Professional Documents
Culture Documents
MM 2
MM 2
Constrained Local Neural Fields for robust facial landmark detection in the wild
Abstract
Detected facial
al
Facial feature detection algorithms have seen great landmarks
progress over the recent years. However, they still struggle
in poor lighting conditions and in the presence of extreme wj Non-uniform
optimisation tak
optimisation taking
pose or occlusions. We present the Constrained Local Neu- patch expert
Individual
ral Field model for facial landmark detection. Our model landmark reliabilities into
account
includes two main novelties. First, we introduce a prob- reliabilities
x4
x1
θ5θΘ’s
5θΘ’s
x5
5 Θ
x2
spatial locality
Regularised Landmark Mean-Shift optimisation technique, Landmark
y4
θ5θΘ’s
5θΘ’s
1 Θ
y5
θ5θΘ’s
5θΘ’s
2 Θ
y4
y1
y5
y2
patch expert x4
x1
x5
x2
θΘ’s
5 Θ’sΘ
x1
θ5θΘ’s
5θΘ’s
5 Θ
x2
1. Introduction
Facial expression is a rich source of information which Figure 1: Overview of our CLNF model. We use our
provides an important communication channel for human Local Neural Field patch expert to calculate more reliable
interaction. Humans use them to reveal intent, display af- response maps. Optimisation over the patch responses is
fection, and express emotion [13]. Automated tracking and performed using our Non-Uniform Regularised Mean-Shift
analysis of such visual cues would greatly benefit human method that takes the reliability of each patch expert into
computer interaction [13]. A crucial initial step in many account leading to more accurate fitting. Note that only 3
affect sensing, face recognition, and human behaviour un- patch experts are displayed for clarity.
derstanding systems is the detection of certain facial feature
points such as eyebrows, corners of eyes, and lips. This is
an interesting and still an unsolved problem, especially for lighting conditions, in the presence of occlusion, and when
faces in the wild — exhibiting variability in pose, lighting, detecting landmarks in unseen datasets.
facial expression, age, gender, race, accessories, make-up, In this paper, we present the Constrained Local Neural
occlusions, background, focus, and resolution. Field (CLNF), a novel instance of CLM that deals with
There have been many attempts, with varying success, the issues of feature detection in complex scenes. First
at tackling the problem of accurate and person independent of all, our CLNF model incorporates a novel Local Neu-
facial landmark detection. One of the most promising is the ral Field (LNF) patch expert, which allows us to both cap-
Constrained Local Model (CLM) proposed by Cristinacce ture more complex information and exploit spatial relation-
and Cootes [5], and various extensions that followed [1, 3, ships between pixels. We also propose Non-Uniform Reg-
15, 20]. CLM methods, however, they still struggle in poor ularised Landmark Mean-Shift (NU-RLMS), a novel CLM
355
Input Ground LNF LNF
SVR
truth (no edge) (edge)
356
pixels nearby should have similar alignment probabilities. 1
h(θ, x) = T , (11)
Secondly, in the whole area the patch expert is evaluated, 1 + e−θ x
only one peak should be present. We want to enforce some 1 (g )
sparsity in the response. gk (yi , yj ) = − Si,jk (yi − yj )2 , (12)
2
We visually show the advantages of modelling spatial de-
1 (l )
pendencies and input non-linearities in Figure 2, that shows lk (yi , yj ) = − Si,jk (yi + yj )2 . (13)
patch responses maps from SVR patch experts [20], our 2
LNF patch expert without spatial constraints, and our full Vertex features fk represent the mapping from the input xi
LNF patch expert with similarity and sparsity constraints. to output yi through a one layer neural network and θ k is the
Note how our patch expert response has fewer peaks and is weight vector for a particular neuron k. Θ can be thought of
smoother than the one without edge features, and both of as a set of convolution kernels that are applied to an area of
them are more accurate than the SVR patch expert. These interest. The corresponding αk for vertex feature fk repre-
spatial constraints are designed to improve the patch re- sents the reliability of the k th neuron (convolution kernel).
sponse convexity, leading to more accurate fitting. Edge features gk represent the similarities between ob-
servations yi and yj . In our LNF patch expert gk enforces
3.1.1 Model definition smoothness on connected nodes. This is also affected by
the neighbourhood measure S (gk ) , which allows us to con-
LNF is an undirected graphical model that can model the trol where the smoothness is to be enforced. For our patch
conditional probability of a continuous valued vector y (the expert we define S (g1 ) to return 1 (otherwise return 0) only
probability that a patch is aligned) depending on continu- when the two nodes i and j are direct (horizontal/vertical)
ous x (the pixel intensity values in the support region). A neighbours in a grid. We also define S (g2 ) to return 1 (oth-
graphical illustration of our model can be seen in Figure 3. erwise 0) when i and j are diagonal neighbours in a grid.
In our discussion we will use the following notation: Edge features lk represent the sparsity constraint be-
X = {x1 , x2 , . . . , xn } is a set of observed input variables, tween observations yi and yj . For example the model is
y = {y1 , y2 , . . . , yn } is a set of output variables that we penalised if both yi and yj are high, but is not penalised if
wish to predict, xi ∈ Rm is a vectorised pixel intensities in both of them are zero. This has a slightly unwanted conse-
patch expert support region (e.g. m = 121 for an 11 × 11 quence of penalising just yi or yj being high, but the penalty
support region), yi ∈ R, and n is the area of interest where for both of them being high is much bigger. This is con-
we evaluate our patch expert. trolled by the neighbourhood measure S (lk ) that allows us
Our model for a particular set of observations is a condi- to define regions where sparsity should be enforced. We
tional probability distribution with the probability density: empirically defined the neighbourhood region S (l) to return
1 only when two nodes i and j are between 4 and 6 edges
exp(Ψ)
P (y|X) = ∞ (8) apart (where edges are counted from the grid layout of our
−∞
exp(Ψ)dy LNF patch expert).
∞
Above −∞ exp(Ψ)dy is the normalisation (partition) func-
tion which makes the probability distribution a valid one (by 3.1.3 Learning and Inference
making it sum to 1). The following section describes the po- In this section we describe how to estimate the parameters
tential function used by our LNF patch expert. {α, β, γ, Θ}. It is important to note that all of the parame-
ters are optimised jointly.
3.1.2 Potential functions We are given training data {x(q) , y(q) }M
q=1 of M patches,
(q) (q) (q)
Our potential function is defined as: where each x(q) = {x1 , x2 , . . . , xn } is a sequence of
K1 inputs (pixel values in the area of interest) and each y(q) =
(q) (q) (q)
Ψ =
i k=1 αk fk (yi , X, θ k )+ {y1 , y2 , . . . , yn } is a sequence of real valued outputs
K2
βk gk (yi , yj )+ , (9) (expected response maps).
i,j k=1K3
i,j k=1 γk lk (yi , yj )
In learning we want to pick the α, β, γ and Θ values
that maximise the conditional log-likelihood of LNF on the
where model parameters α = {α1 , α2 , . . . αK1 }, Θ = training sequences:
{θ 1 , θ 2 , . . . θ K1 }, and β = {β1 , β2 , . . . βK2 }, γ =
M
{γ1 , γ2 , . . . γK3 } are learned and used for inference during
testing. We define three types of potentials in our model, L(α, β, γ, Θ) = log P (y(q) |x(q) ) (14)
q=1
vertex features fk , and edge features gk , and lk :
fk (yi , X, θ k ) = −(yi − h(θ k , xi ))2 , (10) (ᾱ, β̄, γ̄, Θ̄) = arg max(L(α, β, γ, Θ)) (15)
α,β,γ,Θ
357
It helps with the derivation of the partial derivatives of algorithm. In order to make the optimisation more accurate
Equation 14 and with explanation of inference to convert the and faster we used the partial derivatives of the log P (y|X).
Equation 8 into multivariate Gaussian form (see Baltrušaitis To train our LNF patch expert, we need to define the
et al. [2] for a similar derivation): output variables yi . Given an image with a true landmark
at z = (u, v)T we can model the probability of it being
1 1
P (y|X) = n exp(− (y − μ)T Σ−1 (y − μ)),
1
aligned at zi as yi = N (zi ; z, σ) (we experimentally found
(2π) |Σ|
2 2 2 that best results are achieved with σ = 1). We can then
(16) sample the image at various locations to get training sam-
Σ−1 = 2(A + B + C) (17) ples. An example of such synthetic response maps can be
The diagonal matrix A represents the contribution of α seen in Figure 2, in the ground truth column.
terms (vertex features) to the covariance matrix, and the
3.2. Non-uniform RLMS
symmetric B and C represent the contribution of the β, and
γ terms (edge features). A problem facing CLM fitting is that each of the patch
⎧ experts is equally trusted, but this should clearly not be the
K1
⎨
⎪ case. This can be seen in Figures 1 and 2, where the re-
αk , i = j sponse maps of certain features are noisier. RLMS does not
Ai,j = (18)
⎪
⎩ k=1
take this into consideration. To tackle this issue, we propose
0, i = j
minimising the following objective function:
⎧ K2 n K2
⎪
⎪ (gk )
(g ) arg min(||p + Δp||2Λ−1 + ||JΔp − v||2W . (23)
⎪
⎪ ( β S ) − ( βk Si,jk ), i=j
⎨ k i,r Δp
k=1 r=1 k=1
Bi,j = K2
⎪
⎪ The diagonal weight matrix W allows for weighting of
⎪
⎪ −
(g )
βk Si,jk , i = j
⎩ mean-shift vectors. Non-linear least squares with Tikhonov
k=1 Regularisation leads to the following update rule:
⎧ K2 (19)
⎪
⎪ n
(lk )
K2
(l )
Δp = −(J T W J + rΛ−1 )(rΛ−1 p − J T W v). (24)
⎪
⎪ ( γ S ) + ( γk Si,jk ), i=j
⎨ k i,r
k=1 r=1 k=1 Note that, if we use a non-informative identity W = I, the
Ci,j = K2
⎪
⎪ above collapses to the regular RLMS update rule.
⎪
⎪ (l )
γk Si,jk , i = j
⎩ To construct W , we compute the correlation scores of
k=1 each patch expert on the holdout fold of training data. This
(20)
leads to W = w·diag(c1 ; . . . ; cn ; c1 ; . . . cn ), where ci is the
It is useful to define a vector d, that describes the linear
correlation coefficient of the ith patch expert on the holdout
terms in the distribution, and μ which is the mean value of
test fold and w is determined experimentally. The ith and
the Gaussian form of the CCNF distribution:
i + nth elements on the diagonal represent the confidence
d = 2αT h(ΘX). (21) of the ith patch expert. Patch expert reliability matrix W
is computed separately for each scale and view. This is a
μ = Σd. (22) simple but effective way to estimate the error expected from
th a particular patch. Example reliabilities are displayed in
Above X is a matrix where the i column is xi , Θ is
Figure 4a.
the concatenated neural network weights and h(M ), is an
element-wise application of sigmoid (activation function) 4. Experiments
on each element of M , and thus h(ΘX) represents the re-
sponse of each of the gates (neural layers) at each xi . We conducted a number of experiments to validate the
Intuitively d is the contribution from the the vertex fea- benefits of the proposed CLNF model. First, an experi-
tures. These are the terms that contribute directly from input ment was performed to confirm the benefit of both the LNF
features x towards y. Σ on the other hand, controls the in- patch experts and the NU-RLMS approach to model fitting.
fluence of the edge features on the output. Finally, μ is the The second experiment explored how well the CLNF model
expected value of the distribution, hence it is the value of y generalises to unseen illumination. The final set of exper-
that maximises P (y|x). iments evaluated how well our approach generalises out of
In order to guarantee that our partition function is inte- database and in the wild.
grable, we constrain αk > 0 and βk > 0, γk > 0 [12],
4.1. Baselines
while Θ is unconstrained.
The log-likelihood can be maximised using constrained As the first baseline we used the CLM model proposed
BFGS. We use the standard Matlab implementation of the by Saragih et al. [15]. It uses SVR patch experts and RLMS
358
fitting for landmark detection. We extended it to a multi- For fitting we used a multi-scale approach where we first
scale formulation for a fairer comparison, hence in the ex- fit on the smallest scale moving to largest. We use {15 ×
periments it is called CLM+. The exact same training data 15, 21 × 21, 21 × 21} areas of interest for each scale, and
and initialisation was used for CLM+ and CLNF. 11×11 patch expert support regions. Other parameters used
Another baseline used, was the tree based face and land- are: ρ = 1.5, r = 25, w = 7.
mark detector, proposed by Zhu and Ramanan [22]. It has For fitting on images, the rigid shape parameters were es-
shown good performance at locating the face and the land- timated from an off-the-shelf Viola-Jones [19] face detector
mark features on a number of datasets. Two differently and non-rigid parameters were set to 0. If a face was not
trained models were used: trained on Multi-PIE [7] data detected in an image the rigid parameters were initialised
by the authors (called p99) [22], trained on in the wild data away from the perfect value by the amount expected from
by Asthana et al. (called p204) [1]. the face detector.
Active Orientation Model (AOM) is a generative model Cumulative error curves of using LNF and NU-RLMS
of facial shape and appearance [17]. We used the trained can be seen in Figure 4b. The benefit of both our patch
model (on close to frontal Multi-PIE) and the landmark de- expert (LNF) and the fitting technique (NU-RLMS) can be
tection code provided by the authors. In the experiments it clearly seen, their combination (CLNF) leads to best results.
was initialised using same face detection as CLNF. Interestingly, the effect of NU-RLMS is greater when using
As an additional baseline, we used the Discriminative LNF, this is possibly due to the weights associated with each
Response Map Fitting (DRMF) model of CLM [1]. We patch expert being more accurate for LNF.
used the code and the model provided by the authors. It was
trained using LFPW [4] and Multi-PIE datasets. The model 4.3. Unseen illumination
was initialised using a tree based face detector (p204), as We conducted an experiment to validate the CLNF abil-
that lead to the best results. ity to generalise to unseen illuminations, which is a very
As a final baseline, the Supervised Descent Method difficult task for facial landmark detectors. The experi-
(SDM) was used [21]. This approach is trained on the ment was performed on the left, right and poorly illumi-
Multi-PIE and LFW [8] datasets. It relies on face detection nated faces from the Multi-PIE dataset (8001 images).
from a Viola-Jones face detector, therefore the image was We used the same models as in the previous experiment
cropped around the desired face for a fairer comparison. – trained on frontal illumination Multi-PIE. The fitting ap-
The above baselines were trained to detect 49, 66, or 68 proach was identical to the one in the previous section. We
feature points, making exact comparisons difficult. How- did not want the landmark detection to be affected by face
ever, they all share 49 feature points that they detect – the detection accuracy so the detection was initialised using pa-
feature points in Figure 4a without the face outline. The rameters from frontally illuminated images.
accuracy on these points was reported in our experiments. The approaches compared were all trained on frontal il-
lumination Multi-PIE. We do not include other baselines as
4.2. CLNF they were exposed to more difficult illuminations, making
We performed an experiment to see how our novel patch the comparison unfair. The results of this experiment can
expert and fitting approach affect landmark detection accu- be seen in Figure 4c. The lower error rates of CLNF when
racy on the same dataset. This experiment evaluated the compared to other models can be seen. This highlights the
effect of using an LNF patch expert instead of an SVR one, benefit of our new patch expert and fitting approach.
and NU-RLMS fitting instead of RLMS.
4.4. In the wild
The experiment was performed on the CMU Multi-PIE
dataset [7]. For this experiment we used 3557 frontal and The final set of experiments conducted, evaluated how
close to frontal images (at poses 051, 050, 140 correspond- our approach generalises on unseen and completely uncon-
ing to −15, 0, 15 degrees yaw) with all six expressions and strained in the wild datasets.
at frontal illumination. A quarter of subjects were used for For training we used the subsets of two in the wild
training (890 images) and the rest for testing (2667). datasets: Labelled Face Parts in the Wild (LFPW) [4] and
Each of the images was sampled in 21 locations (20 away Helen [9] datasets. Both of them contain unconstrained im-
from the landmark and 1 near to it) in window sizes of ages of faces in indoor and outdoor environments. In total
19 × 19 pixels - this led to 21 × 89 samples per image. 1176 training images were used for training the LNF patch
It was made sure that the same person never appeared in experts. As before, nine patch expert sets were trained. La-
both training and testing. Nine sets of patch experts were bels from the Helen and LFPW datasets were used to learn
trained in total: at three orientations — −20, 0, 20 degrees the PDM, using non-rigid structure from motion [16].
yaw; and three scales – 17px, 23px and 30px of interocular For fitting we used a multi-scale approach, with (15 ×
distance. The PDM trained by Saragih et al. [15] was used. 15, 21 × 21, 21 × 21) areas of interest for each scale, and
359
0.9 1
0.8
0.8
Proportion of images
Proportion of images
0.6
0.6
0.4
0.4
SVR + RLMS (CLM+) CLNF
0.2 SVR + NU−RLMS 0.2 CLM+
LNF + RLMS Tree based (p99)
LNF + NU−RLMS (CLNF) AOM
0 0
0.02 0.0275 0.035 0.0425 0.05 0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15
Eye distance normalised shape RMS error Eye distance normalised shape RMS error
(a) Patch reliabilities (b) Fitting on Multi-PIE (c) Fitting across illumination
Figure 4: a) The reliabilities of CLNF patch experts, smaller circles represent more reliability (less variance). b) Fitting on
the Multi-PIE dataset, observe the performance boost from both the LNF patch expert and the NU-RLMS fitting method.
Their combination (CLNF) leads to the lowest error, since NU-RLMS performs even better on more reliable response maps.
c) Fitting on the unseen illumination Multi-PIE subset. Note the good generalisability of CLNF.
11×11 patch expert support regions. Other parameters used landmark detection accuracy over state-of-art approaches.
are: ρ = 2.0, r = 25, w = 5. Our LNF patch expert exploits spatial relationships be-
In order to evaluate the ability of CLNF to generalise on tween patch response values, and learns non-linear rela-
unseen datasets we evaluated our approach on the datasets tionships between pixel values and patch responses. Our
labelled for the 300 Faces in-the-Wild Challenge [14]. For Non-uniform Regularised Landmark Mean-Shift optimisa-
testing we used three datasets: Annotated Faces in the Wild tion technique, allows us to take into account the reliabil-
[22], IBUG [18] and 300 Faces in-the-Wild Challenge (300- ities of each patch expert leading to better accuracy. We
W) [18] datasets. The IBUG, AFW, and 300-W datasets have demonstrated the benefit of our approach on a number
contains 135, 337, and 600 images respectively. of publicly available datasets.
To initialise model fitting, we used the bounding boxes CLNF is also a fast approach: a Matlab implementation
provided by the 300 Faces in-the-Wild Challenge [18] that can process 2 images per second on in the wild data, and
were initialised using the tree based face detector [22]. In 10 images per second on Multi-PIE data, on a 3.5GHz dual
order to deal with pose variation the model was initialised at core Intel i7 machine. Lastly, all of the code and testing
5 orientations – (0, 0, 0), (0, ±30, 0), (0, 0, ±30) degrees of scripts to recreate the results will be made publicly avail-
roll, pitch and yaw. The final model with the lowest align- able.
ment error (Equation 4) was chosen as the correct one. This
makes the approach five times slower, but more robust. Acknowledgements
Note that we were unable to use bounding box initialisa- We acknowledge funding support from Thales Research
tions for the SDM method, and they were unnecessary for and Technology (UK) and from the European Community’s
tree based methods. Seventh Framework Programme (FP7/2007- 2013) under
The results of this experiment can be seen in Figure 5. grant agreement 289021 (ASC-Inclusion). This material is
Our approach can be seen outperforming all of the other partially supported by the U.S. Army Research, Develop-
baselines tested. This confirms the benefits of CLNF and ment, and Engineering Command (RDECOM). The con-
its ability to generalise well to unseen data. Some exam- tent does not necessarily reflect the position or the policy
ples of landmark detections can be seen in Figure 6. CLNF of the Government, and no official endorsement should be
gap between CLNF and CLM+ is much greater on in the inferred.
wild images than on the constrained ones. Furthermore,
note the huge discrepancy between CLNF and the AAM References
baseline provided by the authors [10]. This illustrates the [1] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Robust
greater generalisability of our proposed model over other discriminative response map fitting with constrained local
approaches. models. In CVPR, 2013.
[2] T. Baltrušaitis, N. Banda, and P. Robinson. Dimensional af-
5. Conclusions fect recognition using continuous conditional random fields.
In FG, 2013.
We have presented a Constrained Local Neural Field [3] T. Baltrušaitis, P. Robinson, and L.-P. Morency. 3D Con-
model for facial landmark detection and tracking. The two strained Local Model for Rigid and Non-Rigid Facial Track-
main novelties of our approach show an improvement in ing. In CVPR, 2012.
360
1 1 1
CLNF (68)
CLNF (51)
Proportion of images
0.8 0.8 0.8 AAM (51)
Proportion of images
Proportion of images
AAM (68)
0.6 0.6 0.6
Figure 5: Fitting on the wild datasets using our CLNF approach and other state-of-the-art methods. All of the methods have
been trained on in the wild data from different than test datasets. The benefit of CLNF in generalising on unseen data is clear.
Furthermore, CLNF remains consistently accurate across all three test sets. a) Testing on the AFW dataset. b) Testing on the
IBUG dataset. c) Testing on the challenge dataset [18], error rates using just internal and all of the points are displayed.
Figure 6: Examples of landmark detections in the wild using CLNF. Top row is from IBUG dataset, bottom row from AFW.
Notice the generalisability across pose, illumination, and expression. The model can also deal with occlusions.
[4] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Ku- [14] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic.
mar. Localizing parts of faces using a consensus of exem- A semi-automatic methodology for facial landmark annota-
plars. In CVPR, 2011. tion. In Workshop on Analysis and Modeling of Faces and
[5] D. Cristinacce and T. Cootes. Feature detection and tracking Gestures, 2013.
with constrained local models. In BMVC, 2006. [15] J. Saragih, S. Lucey, and J. Cohn. Deformable Model Fitting
[6] N. Dalal and B. Triggs. Histograms of oriented gradients for by Regularized Landmark Mean-Shift. IJCV, 2011.
human detection. In CVPR, pages 886–893, 2005. [16] L. Torresani, A. Hertzmann, and C. Bregler. Nonrigid
[7] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. structure-from-motion: estimating shape and motion with hi-
Multi-pie. IVC, 28(5):807 – 813, 2010. erarchical priors. TPAMI, 30(5):878–892, May 2008.
[8] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. [17] G. Tzimiropoulos, J. Alabort-i Medina, S. Zafeiriou, and
Labeled Faces in the Wild: A Database for Studying Face M. Pantic. Generic Active Appearance Models Revisited.
Recognition in Unconstrained Environments. 2007. In ACCV, pages 650–663, 2012.
[9] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang. Inter- [18] G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces
active facial feature localization. In ECCV, 2012. in-the-wild challenge, 2013.
[10] I. Matthews and S. Baker. Active appearance models revis- [19] P. Viola and M. J. Jones. Robust real-time face detection.
ited. IJCV, 60(2):135–164, 2004. IJCV, 57(2):137–154, 2004.
[11] J. Peng, L. Bo, and J. Xu. Conditional neural fields. NIPS, [20] Y. Wang, S. Lucey, and J. Cohn. Enforcing convexity for im-
2009. proved alignment with constrained local models. In CVPR,
[12] T. Qin, T.-Y. Liu, X.-D. Zhang, D.-S. Wang, and H. Li. 2008.
Global ranking using continuous conditional random fields. [21] X. Xiong and F. De la Torre. Supervised descent method and
In NIPS, 2008. its applications to face alignment. In CVPR, 2013.
[13] P. Robinson and R. el Kaliouby. Computation of emo-
[22] X. Zhu and D. Ramanan. Face detection, pose estimation,
tions in man and machines. Philosophical Transactions B,
and landmark localization in the wild. CVPR, 2012.
364(1535):3441–3447, 2009.
361