A Real-Time Multi-Task Single Shot Face Detector Jun-Cheng Chen, Wei-An Lin, Jingxiao Zheng, and Rama Chellappa University of Maryland, College Park

A REAL-TIME MULTI-TASK SINGLE SHOT FACE DETECTOR
Jun-Cheng Chen∗ , Wei-An Lin∗ , Jingxiao Zheng, and Rama Chellappa
University of Maryland, College Park

pullpull@cs.umd.edu, walin@terpmail.umd.edu, jxzheng@umiacs.umd.edu, rama@umiacs.umd.edu
ABSTRACT
Face, fiducial detection, and 3D head pose estimation are important
face preprocessing modules for face recognition which are usually
performed separately and loosely coupled. In this paper, we propose
a unifying framework to simultaneously detect face, fiducial points,
Face Conf: 0.996
and head pose in real-time. In addition, since no single dataset con- Pose: (-12 ̊,0 ̊,-9 ̊)
Face Conf: 0.982
Pose: (-5 ̊,3 ̊,-13 ̊)
Loc: ∆(x, y, w, h)
Conf: (Cb , Cf)
tains all the required and best annotations, we develop a progressive Pose: (pitch, roll, yaw)
Fiducial: (F1 , … , F5)
training strategy to overcome the annotation discrepancy across dif-
ferent datasets. Extensive experiments on face detection, fiducial de-
tection, and pose estimation benchmarks demonstrate the proposed
approach can achieve comparable performance to a state-of-the-art Fig. 1: For the multi-task single-shot detector (MTSSD), the predic-
system [1] but runs 60 times faster. (i.e., 20 frames per second) tions are made for a set of default boxes indicated by dotted boxes
in different aspect ratios with respect to each solid box. For each
Index Terms— face detection, fiducial detection, head pose es- default box, a confidence score for face vs. non-face (Face Conf) is
timation predicted, as are offsets for the face bounding box (Loc) and facial
landmarks (Fiducial) with respect to the default box and 3D head
1. INTRODUCTION pose for pitch, roll, yaw angles (Pose).
Face recognition has been one of the active research problems stud-
ied in computer vision for decades. Recently, deep convolutional The rest of this paper is organized as follows: in Section 2, we
neural network (DCNN)-based approaches [2, 3, 4] have yielded present a brief review of relevant related works on face preprocess-
recognition performance surpassing human performance on the ing. In Section 3, we discuss the details of the proposed real-time
well-known Labeled Faces in the Wild (LFW) dataset [5]. However, multi-task face detector. In Section 4, we present experimental re-
to make face recognition useful for practical use, it still needs robust sults for the face preprocessing task on challenging face datasets.
and efficient face and fiducial detection modules to localize faces Finally, we conclude the paper in Section 5.
with large pose and illumination variations to automate the face
recognition task.
To address this issue, several multi-task face detectors have been 2. RELATED WORK
developed recently to address these problems. Ranjan et al. pro-
In this section, we briefly review some relevant works for face pre-
posed a multi-task face detector, HyperFace [6], which can simulta-
processing modules.
neously detect faces, facial landmark, classify gender, and perform
pose estimation. In addition, Ranjan et al. further extended [1] to
Multi-task Face Preprocessing: Caruana et al. [9] first analyzed
an All-in-One convolutional neural network model which can si-
multi-task learning (MTL) in detail. Several approaches have used
multaneously perform more tasks than HyperFace where additional
MTL techniques for solving many computer vision problems since
tasks include smile detection, age estimation and identity determi-
then. In [10], Zhu et al. used a mixture of deformable part models
nation. The system achieves state-of-the-art performances on multi-
with shared pools of parts to perform face detection, landmarks
ple tasks. Although these systems can effectively fuse features from
localization and head pose estimation at the same time. Recently,
multiple layers to perform face analysis, the processing speed of both
as suggested in [11, 12], DCNNs trained with a large-scale anno-
HyperFace and All-in-One CNN systems is bounded by the region
tated dataset encode rich hierarchical information, and each layer
proposal generation step using selective search [7] which takes 3.5
is good at different tasks. Several methods have thus incorporated
second to process a single image, and limiting effectiveness for the
the MTL framework with DCNNs for computer vision tasks. Fast
real-world applications. Motivated by recent development in multi-
R-CNN [13] performed object classification and bounding box
task learning for DCNNs and single-shot object detection (SSD) [8]
regression with MTL for object detection. HyperFace [6] further
based on a fully convolutional network, we propose a Multi-task
trained a MTL network for simultaneously face detection, landmarks
SSD (MTSSD) face detector as the face preprocessing module which
localization, pose and gender estimation by fusing the features of in-
can perform multi-scale face detection, five facial landmarks local-
termediate layers of DCNN for improved feature extraction. Zhang
ization, and pose estimation at the same time at much higher speed
et al. [14] improved landmarks localization by training it jointly with
(i.e., around 20 fps with NVidia Titan X for 500×500 RGB input
head-pose estimation and facial attribute inference. Recent methods
images.) since it bypasses the region proposal step.
for face detection based on DCNNs such as Faceness [15], Hyper-
The first two authors equally contribute to this work. face [6], Faster-RCNN [16], All-in-One CNN [1], MTCNN [17],
978-1-4799-7061-2/18/$31.00 ©2018 IEEE 176 ICIP 2018

Supervised Transformer Network [18], etc., have significantly out- 3.2. Multi-task Learning
performed traditional approaches like TSM [10] and NDPFace [19].
Existing methods for landmark localization mainly focus on near- Multi-task learning is an inductive transfer learning technique in
frontal faces [20], [21], [22] when all the essential facial landmarks which two or more learning machines are trained cooperatively [9].
are visible. Recent methods such as PIFA [23], 3DDFA [24], Hy- In MTL settings, there is a mechanism in which knowledge learned
perFace [6], All-in-One CNN [1], and CCL [25] have explored face for one task is transferred to the other tasks. The idea is that each
alignment over varying pose angles. The task of pose estimation task can benefit by the knowledge that has been learned while train-
is to infer 3D head pose of a person with respect to the camera. ing for other related tasks. Backpropagation has been recognized as
TSM [10], FaceDPL [26], HyperFace [6], and All-in-One CNN [1] an effective method for learning distributed representations [30]. For
also have achieved impressive results for this task. example, in multi-task setting, we can jointly minimize one global
loss function for a DCNN model as follows:
T
X Nt
X
L= λt Lt (f (xti , yti )), (1)
3. PROPOSED APPROACH t=1 i=1
where T is the number of tasks, Nt is the number of training samples

In the following subsections, we describe the details of the proposed
for task t, yti is the ground truth label for training sample xti , f is
MTSSD which simultaneously detects all the possible faces along
a multi-output function computed by the network, and Lt is the loss
with their facial landmarks and 3D head pose (i.e., the detected faces
function for task t.
can be immediately aligned into canonical coordinates by the simi-
larity transform layer according to the information of fiducial points
and pose prediction and then passed through the other deep convo- 3.3. Multi-Task Single-Shot Detector for Faces (MTSSD)
lutional network for face recognition.). The MTSSD face detector is built on SSD with additional 3×3 con-
volution layers for the predictions of five fiducial points and 3D head
Extra Feature Layers
pose estimation as shown in Figure 2. The new training objective is
Classifier: conv: 3x3x(3x(nClasses+4+nPoses+nFids) composed of a weighted sum of four losses: Lcls , Lloc , Lf id , and
Non-Maximum Suppression
Detections: 20097 Per Task
Classifier: conv: 3x3x(6x(nClasses+4+nPose+nFids)

Lpose (i.e., the class (i.e., face vs. non-face), face localization, fa-
cial landmark, and head pose losses, respectively.) To compute the
Input Image
conv4_3
loss and backpropagate through the network, we first need to match

conv6_2
VGG16
fc7
conv7_2
the ground truth detection and other annotations to the appropriate

conv8_2
conv9_2
default boxes. We follow the paradigm presented in [8] to match a

pool6
500x500x3
default box with the ground truth box if their Intersection over Union
ratio (IoU) is greater than 0.5. Then, the training objective is com-
puted as follows
Fig. 2: The network architecture for multi-task single shot face de-
tector. The images are first resized to 500×500 and then fed as the 1
inputs into the network. Extra feature maps are added in place of Lall = (Lcls + λloc Lloc + λf id Lf id + λpose Lpose ), (2)
N
the final layers of a VGG-16 network and small convolutional filters
produce estimates for class, pose, bounding box offsets, and fiducial where λloc , λf id , and λpose are set to 1, respectively. Since a ground
point offsets that are processed through non-maximum suppression truth bounding box can be matched to multiple default boxes, the ob-
to make the final detections. Red and green indicate additions to the jective, Lall , is normalized by N which is the total number of default
architecture of SSD [8]. We follow the naming convention for the boxes matched to a ground truth bounding box. Lloc , Lf id , Lpose
extra layers used by the source code of [8]. are all smooth L1 regression loss [13] and Lcls loss is a softmax loss.
Note that different number of detections are performed by different
feature layers compared to HyperFace [6] or All-in-One CNN [1]
which fuse features of different layers and do a fixed number of pre-
3.1. Single-Shot Multibox Object Detector dictions. Since the detector is based on a fully convolutional network
and the detection is run in a single shot, the detection speed is fast. In
The single-shot multibox detector [8] is a feed-forward fully convo- addition, the annotations of fiducial points are not always available
lutional network which uses the truncated VGG16 [27] as the base due to occlusion in AFLW dataset, and we thus force the loss for the
network (i.e., the proposed approach can also be applied to the other occluded points as zeros.
base networks, such as ResNet [28]). The pool5 in VGG16 is con-
verted to 3 × 3 with stride one, and fc6 and fc7 are converted to 3.4. Training
convolutional layers with atrous algorithm [29]. Additional convo-
lutional layers and a global average pooling layer are added after To train the MTSSD, we use (1) the face bounding box annotations
the base network and the size of each layer decreases progressively. for the WIDER face benchmark [31] whose faces are captured un-
Predictions for a regularly spaced set of possible detections are com- der different environments with large variations in pose, illumina-
puted by applying a collection of 3×3 filters to channels in one of tion, occlusion, and resolution, and (2) the AFLW datase [32] for
the feature layers. Each 3×3 filter produces one value at each loca- face bounding boxes, fiducial points, and 3D head pose information.
tion, where the outputs are either classification scores or localization However, if we sequentially fine-tune the SSD model on WIDER
offsets. We refer the interested readers to [8] for more details about face and then on AFLW, vice versa, to train the MTSSD, the face de-
SSD. Since SSD uses the fully convolutional framework, it is fast tection performance will degrade significantly since the annotation
and performs the detection in a single shot. styles for the face bounding boxes are different between the WIDER
177
face and AFLW datasets where WIDER face has much tighter face
bounding boxes than AFLW. Faces in AFLW are much easier to de-
tect than WIDER face dataset. Thus, we first train SSD face detector
using WIDER face and apply it on AFLW to acquire the normal-
ized annotation for face bounding boxes, fiducial points, and head
pose as in the WIDER face dataset. Inspired by [33] which regu-
larizes the objective with the loss of the old task when fine-tuning
the model, we combine the WIDER face and AFLW together in ran-
dom order and directly train the MTSSD on this combined dataset.
When the annotations of fiducial points and pose estimation for an (a) (b)
image are unavailable, we force the corresponding loss and the back-
propagated gradient to zeros for the branches of convolutional layers
responsible for fiducial point detection or head pose estimation. We
set the learning rate to 1e-3 and momentum to 0.9 and train the model
with SGD for 20,000 iteration by fine-tuning the pre-trained model
of VGG16 [27] which is re-sampled by the atrous [29] algorithm. In
addition, we also perform the hard negative mining for face bound-
ing boxes, data augmentation for face bounding boxes and fiducial
points, and sampling strategy for different aspect ratios as in [8].
(c) (d)
4. EXPERIMENTAL RESULTS Fig. 3: Sample detection results for the FDDB dataset. The white
ellipses are the ground truth annotations. The red boxes are the de-
tected bounding boxes. The magenta boxes are the detected bound-
In this section, we present the evaluation results of the proposed ap-
ing boxes with confidence score greater than 0.5 but no annotation is
proach for face detection, facial landmark localization, and pose es-
present. A careful investigation shows that there are totally 102 such
timation, respectively.
boxes, and only 12 of them are not faces. We remove those magenta
boxes in the performance evaluation. Note that there are still some
red boxes that contain human faces but are viewed as false positives
4.1. Face Detection (e.g. in Figure 3-(b) and Figure 3-(c)).
We evaluate the face detection module in the proposed network on

three challenging public datasets, FDDB [34], AFW [10] and PAS- 4.2. Ablative Analysis for Different Layers to Face Detection
CAL faces [35]. These datasets contain faces having large variations
We present an ablative study for the influence of each layer to the
in pose, illumination, scale, blur, and appearance. In [36], Math-
performance of the proposed MTSSD face detector in Figure 5.
ias et al. pointed out that the evaluations on existing face datasets
Since the results of facial landmarks and 3D pose estimation depend
are biased due to different annotation criteria. They proposed reme-
on the detected faces, we only show the results for face detection by
dial measures and new annotations for both the AFW and PASCAL
removing each layer in turn. (i.e. the number of detected faces is
datasets, but the FDDB dataset is viewed immutable since elliptical
different when removing each layer respectively.) From the figure,
annotations are used. In this work, we use the evaluation toolbox
we find the performance drops significantly when we remove the
provided by [36] for AFW and PASCAL face datasets and the eval-
lower layers, conv4 3, conv6 2, and fc7. It corresponds to the fact
uation code in [34] for the FDDB dataset. In order to compensate
that most faces are in medium sizes with respect to the image, and
for the annotation bias in the FDDB dataset, we remove the detected
higher layers mainly work for the close-up faces which are few in
bounding boxes with confidence scores greater than 0.5 if they ac-
the datasets.
tually contain faces but no annotation is present. Figure 3 present
some sample detection results and illustrates how we handle incor-
rect annotations. 4.3. Landmarks Localization and Pose Estimation
The ROC for the FDDB dataset and the precision-recall curves Besides face detection, we also evaluate the landmark and pose es-
for AFW and PASCAL face dataset results are shown in Figure 4. timation modules on the AFW dataset. Figure 6 shows a compari-
Our detection component achieves results comparable to state-of- son of our landmark estimation module with RCPR [37], SDM [38],
the-art algorithms on the FDDB dataset without finetuning on the TSPM [10], and Zhang et al. [14]. It is clear that our method has
validation set as in [1]. For AFW and PASCAL datasets, the pro- higher localization accuracy at all five facial landmarks than these
posed method achieves higher AP, although the precision is slightly methods. Figure 7 shows the performance of the pose estimation
lower than state-of-the-art at some operation points. In addition, our module compared with All-In-One CNN [1], HyperFace [6], and
proposed face detection module outputs bounding boxes by passing Face DPL [26]. The proposed method predicts yaw within 15 for
images through a single-stage network without further re-evaluation more than 90% of the faces, which is adequate as a preprocessing
such as the second-stage RCNN network in [18] or the Iterative Re- module for the face identification/verification framework. In the pro-
gion Proposals and Landmark-based NMS used in [1]. This shows posed framework, the pose estimation module can be used as an ad-
the proposed face detector module can be trained end-to-end without ditional information to evaluate whether all the predicted facial land-
the burden of tuning several pretrained subnetworks, and thus is able marks are reliable. For example, if the predicted yaw +60◦ (right ori-
to operate more efficiently. entation), then we can expect that the fiducial points in the right side
178
1 1 1
Ours
All-In-One CNN 0.95
HyperFace
0.95 0.95
Supervised Transformer Network 0.9
Faster R-CNN
Cascade CNN 0.85
True Positive Rate
0.9 0.9
0.8
Precision
Precision
0.85 0.85 0.75
0.7
0.8 0.8
Ours (ap=98.95 %) 0.65
All-In-One CNN (ap=98.50%) Ours (ap=97.09%)
HyperFace (ap=97.90%) 0.6 All-In-One CNN (ap=95.01%)
0.75 0.75 Supervised Transformer Network (ap=98.35%) HyperFace (ap=92.46%)
Faceness (ap=97.2%) 0.55 Supervised Transformer Network (ap=94.10%)
HeadHunter (ap=97.10%) HeadHunter (ap=89.63%)
0.7 0.7 0.5
0 100 200 300 400 500 600 700 800 0.7 0.75 0.8 0.85 0.9 0.95 1 0.5 0.6 0.7 0.8 0.9 1
False Positive Recall Recall
(a) (b) (c)
Fig. 4: Comparisons with state-of-the-art approaches on (a) the FDDB dataset, (b) the AFW dataset and (c) the PASCAL face dataset.
1
1
Ours
All-In-One CNN
0.8 0.9 HyperFace
DPL
True Positive Rate
Fraction of Test Faces

0.8
0.6
0.7
w/o conv4 3
0.4 w/o conv6 2
w/o conv7 2 0.6
w/o conv8 2
w/o conv9 2
0.2 w/o fc7 0.5
w/o pool6
proposed MTSSD
0.4
0 0 5 10 15 20 25 30
0 100 200 300 400 500 Esimation Error (degree)
False Positive
Fig. 7: Evaluation of pose estimation on the AFW dataset. The per-

Fig. 5: We present the face detection results for the ablative study of formance is measured by the fraction of faces that have mean esti-
the FDDB dataset by removing each individual layer from MTSSD mation error below some threshold.
in turn.
the proposed approach under different situations (e.g., large varia-

of the face should be discarded when performing face alignment. tions in pose, illumination, occlusion, etc) but also make the system
more practical to use in real world scenarios.
20
Ours
RCPR
SDM 6. ACKNOWLEDGMENTS
15 TSPM
Mean Error (%)
Zhang et al.
This research is based upon work supported by the Office of the Di-
10
rector of National Intelligence (ODNI), Intelligence Advanced Re-
5
search Projects Activity (IARPA), via IARPA R&D Contract No.
2014-14071600012. The views and conclusions contained herein
0 are those of the authors and should not be interpreted as necessarily
left eye right eye nose tip left mouth corner right mouth corner
representing the official policies or endorsements, either expressed
or implied, of the ODNI, IARPA, or the U.S. Government. The U.S.
Fig. 6: Performance comparison on the AFW dataset for landmark Government is authorized to reproduce and distribute reprints for
localization. The mean prediction errors on five facial landmarks are Governmental purposes notwithstanding any copyright annotation
reported. thereon.
7. REFERENCES
5. CONCLUSION
[1] R. Ranjan, S. Sankaranarayanan, C. D Castillo, and R. Chel-
In this paper, we propose a multi-task single-shot face detector which lappa, “An all-in-one convolutional neural network for face
can simultaneously perform face detection, facial landmark localiza- analysis,” in IEEE International Conference on Automatic
tion, and head pose estimation. The proposed approach can run in Face and Gesture Recognition (FG 2017), 2017, pp. 17–24.
real-time and still achieve comparable recognition accuracy as com- [2] Y. Taigman, M. Yang, M. A. Ranzato, and L. Wolf, “Deep-
pared to the state-of-the-art system [1] (i.e., 60x faster processing face: Closing the gap to human-level performance in face veri-
speed than [1]). The real-time performance not only facilitates the fication,” in IEEE Conference on Computer Vision and Pattern
researcher to inspect the robustness of face recognition along with Recognition, 2014, pp. 1701–1708.
179
[3] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A uni- [21] S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000
fied embedding for face recognition and clustering,” arXiv fps via regressing local binary features,” in IEEE Conference
preprint arXiv:1503.03832, 2015. on Computer Vision and Pattern Recognition(CVPR), 2014.
[4] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recog- [22] V. Kazemi and J. Sullivan, “One millisecond face alignment
nition,” British Machine Vision Conference, 2015. with an ensemble of regression trees,” in IEEE Conference on
[5] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “La- Computer Vision and Pattern Recognition (CVPR), 2014, pp.
beled faces in the wild: A database for studying face recog- 1867–1874.
nition in unconstrained environments,” in Workshop on Faces [23] A. Jourabloo and X. Liu, “Pose-invariant 3d face alignment,”
in Real-Life Images: Detection, Alignment, and Recognition, in IEEE International Conference on Computer Vision (ICCV),
2008. 2015.
[6] R. Ranjan, V. M Patel, and R. Chellappa, “Hyperface: A [24] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li, “Face align-
deep multi-task learning framework for face detection, land- ment across large poses: A 3d solution,” arXiv preprint
mark localization, pose estimation, and gender recognition,” arXiv:1511.07212, 2015.
IEEE Transactions on Pattern Analysis and Machine Intelli- [25] C. Li, C. C. Loy, and X. Tang, “Unconstrained face alignment
gence, 2017. via cascaded compositional learning,” in IEEE Conference on
[7] J. RR Uijlings, K. EA van de Sande, T. Gevers, and A. WM Computer Vision and Pattern Recognition(CVPR), 2016.
Smeulders, “Selective search for object recognition,” Interna- [26] X. Zhu and D. Ramanan, “Facedpl: Detection, pose estimation,
tional journal of computer vision, vol. 104, no. 2, pp. 154–171, and landmark localization in the wild,” preprint, 2015.
2013.
[27] K. Simonyan and A. Zisserman, “Very deep convolutional
[8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed, “Ssd: networks for large-scale image recognition,” arXiv preprint
Single shot multibox detector,” European Conference on Com- arXiv:1409.1556, 2014.
puter Vision (ECCV), 2016.
[28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learn-
[9] R. Caruana, “Multitask learning,” Learning to Learn, 1998. ing for image recognition,” in IEEE conference on Computer
[10] X. Zhu and D. Ramanan, “Face detection, pose estimation Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
and landmark estimation in the wild,” in IEEE Conference on [29] F. Yu and V. Koltun, “Multi-scale context aggregation by di-
Computer Vision and Pattern Recognition (CVPR), 2012. lated convolutions,” arXiv preprint arXiv:1511.07122, 2015.
[11] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,
[30] G. E Hinton, “Learning distributed representations of con-
E. Tzeng, and T. Darrell, “Decaf: A deep convolutional acti-
cepts,” in Annual conference of the cognitive science society,
vation feature for generic visual recognition.,” in International
1986, vol. 1, p. 12.
Conference on Machine Learning, 2014, pp. 647–655.
[31] S. Yang, P. Luo, C. C. Loy, and X. Tang, “Wider face: A
[12] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How trans-
face detection benchmark,” in IEEE Conference on Computer
ferable are features in deep neural networks?,” in Advances in
Vision and Pattern Recognition (CVPR), 2016.
neural information processing systems, 2014, pp. 3320–3328.
[32] M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof, “An-
[13] R. Girshick, “Fast r-cnn,” in IEEE International Conference
notated facial landmarks in the wild: A large-scale, real-world
on Computer Vision (ICCV), 2015, pp. 1440–1448.
database for facial landmark localization,” in First IEEE In-
[14] Z. Zhang, P. Luo, C. Loy, and X. Tang, “Facial landmark de- ternational Workshop on Benchmarking Facial Image Analysis
tection by deep multi-task learning,” in ECCV, 2014. Technologies, 2011.
[15] S. Yang, P. Luo, C. C. Loy, and X. Tang, “From facial parts [33] Z. Li and D. Hoiem, “Learning without forgetting,” in Euro-
responses to face detection: A deep learning approach,” in pean Conference on Computer Vision, 2016.
IEEE International Conference on Computer Vision (ICCV),
[34] V. Jain and E. Learned-Miller, “Fddb: A benchmark for face
2015.
detection in unconstrained settings,” Tech. Rep. UM-CS-2010-
[16] H. Jiang and E. Learned-Miller, “Face detection with the faster 009, University of Massachusetts, Amherst, 2010.
r-cnn,” IEEE International Conference on Automatic Face and
[35] J. Yan, X. Zhang, Z. Lei, and S. Z. Li, “Face detection by
Gesture Recognition (FG 2017), 2017.
structural models,” Image and Vision Computing, pp. 790–799,
[17] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detec- 2014.
tion and alignment using multi-task cascaded convolutional
[36] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool,
networks,” IEEE Signal Processing Letters, vol. 23, no. 10,
“Face detection without bells and whistles,” in European Con-
pp. 1499–1503, 2016.
ference on Computer Vision (ECCV), 2014.
[18] D. Chen, G. Hua, F. Wen, and J. Sun, “Supervised transformer
[37] X. P. Burgos-Artizzu, P. Perona, and P. Dollár, “Robust face
network for efficient face detection,” in European Conference
landmark estimation under occlusion,” in IEEE International
on Computer Vision (ECCV), 2016.
Conference on Computer Vision (ICCV), 2013.
[19] S. Liao, A. Jain, and S. Li, “A fast and accurate unconstrained
[38] X. Xiong and F. De la Torre, “Supervised descent method and
face detector,” IEEE Transactions on Pattern Analysis and Ma-
its application to face alignment,” in IEEE Conference on Com-
chine Intelligence, 2015.
puter Vision and Pattern Recognition(CVPR), 2013.
[20] X. Cao, Y. Wei, F. Wen, and J. Sun, “Face alignment by explicit
shape regression,” 2014.
180

A Real-Time Multi-Task Single Shot Face Detector Jun-Cheng Chen, Wei-An Lin, Jingxiao Zheng, and Rama Chellappa University of Maryland, College Park

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Real-Time Multi-Task Single Shot Face Detector Jun-Cheng Chen, Wei-An Lin, Jingxiao Zheng, and Rama Chellappa University of Maryland, College Park

Uploaded by

Copyright:

Available Formats

A REAL-TIME MULTI-TASK SINGLE SHOT FACE DETECTOR

Jun-Cheng Chen∗ , Wei-An Lin∗ , Jingxiao Zheng, and Rama Chellappa

University of Maryland, College Park

978-1-4799-7061-2/18/$31.00 ©2018 IEEE 176 ICIP 2018

where T is the number of tasks, Nt is the number of training samples

Classifier: conv: 3x3x(6x(nClasses+4+nPose+nFids)

loss and backpropagate through the network, we first need to match

the ground truth detection and other annotations to the appropriate

default boxes. We follow the paradigm presented in [8] to match a

We evaluate the face detection module in the proposed network on

(a) (b) (c)

Fraction of Test Faces

Fig. 7: Evaluation of pose estimation on the AFW dataset. The per-

the proposed approach under different situations (e.g., large varia-

You might also like