You are on page 1of 7

GLOBAL FACIAL ATTRIBUTE DETECTION

USING DEEP LEARNING


Kaushika Sree A
Master Of Engineering, VLSI Design Saraniya O
Department of ECE Assistant Professor
Government College of Technology Department of ECE
Coimbatore, India. Government College of Technology
sreegct95@gmail.com Coimbatore, India
saranya@gct.ac.in

Abstract – Face recognition has wide range of real time advanced human-computer interaction, video surveillance,
application such as Criminal Identification, Access and Security, automatic indexing of images, and video database, among
Healthcare, Finding missing persons, Helping the blind, etc. others.
These face recognition application suffers from inaccuracy due
Face attribute prediction is usually tackled via a
to similar attributes existing between male and female faces. To
overcome this problem, the facial attribute of a person are Detection- Alignment-Recognition (DAR) pipeline. Within
predicted from the original image datasets. Based on these DAR, an off-the-shelf face detector is used to detect faces
attributes we can classify them as male or female and their in images in the detection stage. Then in the alignment
expressions. This paper proposes global transformation method stage a face landmark detector is applied to faces, followed
and global transfer learning technique to predict the attribute of by the operation of establishing correspondence between
facial images. The global transformation is done by bilinear detected landmarks and canonical locations where domain
transformation of the image pixels using the adjacent pixels with
no constraints such as equal scaling, rotation, etc. This gives full
experts’ input is required. Finally faces are aligned by
flexibility to discover transformation that is beneficial to predict transformations estimated from the correspondence. In the
he attributes for any specific input images. The global learning recognition stage, features are extracted from the aligned
net is used to establish dependencies on multiple face attributes. faces and fed into a classifier to predict the face attributes.
Thus learning a shared face representation for multiple
attribute prediction is far better than learning separate face However, the alignment stage in the DAR pipeline
representation for each individual attributes. This deep learning suffers several issues. It heavily depends on the quality of
technique provides more accuracy compared to traditional landmark detection results. Despite good performance on
handcrafted feature technique. The experimental result shows near frontal faces, the current face landmark detectors
that this method can effectively predict the gender, expression cannot give satisfactory results on unconstrained faces with
and youthfulness of the facial image with accuracy 91.05% and
validation loss 0.2079 for CELEBA datasets with less learning large pose angles, occlusion or blurriness. The error in
rate compared to the Alignment based method whose accuracy is landmark localization would definitely harm the
88%. performance of attribute prediction. Besides, even with
accurate facial landmarks, one still needs to handcraft
I. INTRODUCTION specific face alignment protocols (canonical locations,
A face recognition system is a technology capable of transformation methods, etc.), which demands dense
identifying or verifying a person from a digital image or a domain expert knowledge. Some warping artifacts of
video frame from a video source. There are multiple mapping landmark locations to canonical positions are also
methods in which facial recognition systems work, but in inevitable in aligning the faces. Thus facial attribute
general, they work by comparing selected facial attributes prediction error accumulates due to a combination of
from given image with faces within a database. Thus face erroneous landmark detection and handcrafted protocols.
attribute prediction is an important task in face analysis. It is
also described as a Biometric Artificial Intelligence based In this work, we propose a landmark free global
application that can uniquely identify a person by analyzing facial attribute detection which directly learns a global
patterns based on the person’s facial attribute and texture. transformation and part localizations on each input face
While initially a form of computer application, it end-to-end, getting rid of reliance on landmarks and hard-
has seen wider uses in recent times on mobile platforms and wired face alignment as in DAR. This method is landmark
in other forms of technology, such as robotics. It is typically free and learns transformation and localization globally in
used as access control in security systems and can be all images of the datasets.
compared to other biometrics such as fingerprint or eye iris
recognition systems. Although the accuracy of facial II. RELATED WORKS
recognition system as a biometric technology is lower than Attributes, such as person attributes, object attributes
iris recognition and fingerprint recognition, it is widely and face attributes are mid-level representations which
adopted due to its contactless and non-invasive process. convey compact semantic information. Traditional methods
Recently, it has also become popular as a commercial usually use hand-crafted features such as SIFT and HOG to
identification and marketing tool. Other applications include

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


perform the attribute prediction tasks. Most of recent parts on roughly aligned face images, and the
advances in attribute prediction, especially face attribute representations of global faces and facial parts are
prediction, are driven by deep learning. Existing methods aggregated to rank relative facial attributes.
for attribute prediction can be roughly categorized into local
methods and global methods. Global method uses entire III DEEP LEARNING
object for representation learning and attribute prediction Deep learning systems are modeled after the neural
without part information. Different from global methods, networks in the neo-cortex of the human brain, where higher-
local methods extract features from relevant regions or parts level cognition occurs. In the brain, a neuron is a cell that
for attribute prediction. transmits electrical or chemical information. When
A global method is used for facial attribute connected with other neurons, it forms a neural network. In
prediction. The model takes as input faces and facial machines, the neurons are virtual — basically bits of code
landmarks points, and uses a Restricted Boltzmann running statistical regressions. While traditional machine
Machine for joint feature learning of all attributes. It uses a learning algorithms are linear, deep learning algorithms are
multi- branch structure in multi-task learning to predict stacked in a hierarchy of increasing complexity and
multiple facial attributes simultaneously and achieves great abstraction.
performance in face attribute prediction tasks. The Deep learning methods are able to leverage very
method uses a pre-trained deep Convolutional Neural large datasets of faces and learn rich and compact
Networks (CNN) for face recognition tasks to obtain representations of faces, allowing modern models to first
global face representation, and binary linear SVM perform as-well and later to outperform the face recognition
classifiers are built on the global face representations to capabilities of humans.
classify face attributes. A similar global method is showing
IV PROPOSED METHOD
that intermediate representations from hidden layers of
The proposed method consists of a global Trans-
CNNs are also useful in predicting facial attributes. The
Net and part Loc-Nets, responsible for learning a global
Mixed Object Optimization Network (MOON) method
transformation and part localizations respectively. Through
uses a multi-task deep CNN with a loss function which the hierarchical transformations, both the global face
mixes multiple task objectives with a tailored re-weighting representation and the facial part representation are learned
scheme. The MOON method aims to tackle the data together for the purpose of face attribute prediction. We now
imbalance issue commonly seen in facial attribute proceed to introduce the global transformation learning
prediction, and achieves state-of-the-art performance. In component and part localization learning component one by
facial attribute prediction using cnn method, a home- one.
brewed dataset of facial identities and contextual
information about geo-location and weather condition is
used to learn global facial representations as a pre-trained
model, which is then fine-tuned to predict facial attributes.
In semantic segmentation method, designs special
architectures of neural networks to make use of semantic
segmentation information of faces in model training, and
shows that such information can help boost the
performance of facial attribute prediction.
Different from global methods, local methods
extract features from relevant regions or parts for attribute
prediction. They usually need part information in order to
learn representations of each part and build classifiers on
aggregated part representations, thus alignment is usually
required so that representations can be learned from the
right locations. In part based method, the appearances at
particular parts are used to distinguish particular classes of
interest. With provided facial landmarks, facial attribute
classification method first builds correlation maps between
attributes and pixel values at different landmarks and then
extracts features from the relevant parts to perform face
attribute prediction. A similar pipeline of extracting features
from pre-detected regions of interest is also presented.
Particularly, it uses both face images and some attributes to
predict other attributes by reasoning with the relations
among them. Pose Aligned Networks for Deep Attributes
method, combine part-based models with deep learning to
train pose-normalized CNNs for face attribute prediction. In
End-to-end Localization method, an automatic part
Fig 3.1 Flowchart of Proposed Method
localization approach based on STN is used to locate facial

2
3.1). GLOBAL COMPONENT IN AFFAIR: input face image to the i-th attribute prediction as
GLOBAL TRANSNET: 𝑓𝜃𝐶 ,𝜃𝐹 ,𝜃𝑇 (I).
𝑔𝑖 𝑔 𝑔

The global TransNet in AFFAIR takes the detected


face (which is enlarged to include more background) as COMPETETIVE LEARNING STRATEGY:
input, and produces a set of optimized transformation
parameters tailored for the original input face for attribute Through end-to-end training, the global Trans-Net can learn
representation learning. The set of the parameters for global to transform the face to one that is favorable for attribute
transformation is denoted as Tg. The transformation prediction. However, faces captured in the wild usually
establishes the mapping between the globally transformed present large variations. Unlike objects with simple shapes
face image and the input face image via like digits or traffic signs with easily learnable optimal
global transformations, it finds that high-quality
transformation for faces are much more difficult to learn.
The global Trans-Net needs to find a good scale, necessary
rotation and translation to best transform the face for
accurate attribute prediction. To this end, we design a novel
competitive learning strategy, where the learning outcome
of the transformed face competes against that of the original
Then, given the learned transformation parameters face image. It is noted that a similar training strategy is used
Tg, we can obtain the globally transformed face images in a recent work to perform a fine-grained classification task.
𝑔 𝑔
pixel by pixel. The pixel value at location (𝑥𝑖 , 𝑦𝑖 ) of the Within this strategy, it adds a Siamese-like network after the
transformed image is obtained by bi-linearly interpolating global TransNet to force the global TransNet to learn the
𝑖𝑛𝑝𝑢𝑡 optimal global transformations. More concretely, there are
the pixel values on the input face image centered at (𝑥𝑖 two parallel branches. The upper branch is connected with
𝑖𝑛𝑝𝑢𝑡
, 𝑦𝑖 ). Here it does not impose any constraints on the the globally transformed face image and the lower branch is
parameters Tg, such as equal scaling on horizontal and connected with the original input face image. The global
vertical directions, rotation only, etc. This gives full TransNet takes as input the whole face image and learns to
flexibility to AFFAIR to discover transformation that is produce transformation parameters for the face image. Then
beneficial to predicting attributes for the specific input face. the globally transformed face image is fed into the upper
branch of the Siamese-like network to perform attribute
Parametrized by 𝜃𝑔𝑇 , the global Trans-Net learns the proper
prediction. At the same time, the lower branch of the
transformation 𝑇𝑔 on an input face I, i.e., 𝑇𝑔 = 𝑓𝜃𝑔𝑇 (I). We Siamese-like network takes as input the original face image
back propagate the gradient i the global representation with no transformation. Both branches have the same
learning net to the global Trans-Net with the learning architecture.
strategy in STN, thus the global Trans-Net and global
feature representation learning net are trained end-to-end for
attribute prediction.

GLOBAL REPRESENTATION LEARNING NET:

Multiple face attributes usually have dependencies on each


other. For example, the attribute “Male” has strong
dependency on the attribute “Goatee”; the attribute “Straight
Hair” provides strong negative evidence for the attribute
“Wavy Hair”. Thus learning a shared face representation for
multiple attribute prediction is better than learning separate
face representation for each individual attribute. The global
representation learning net considers all the facial attributes
simultaneously. More explicitly, denote the output face from Fig 3.2 Siamese Like Network
the global TransNet as 𝑓𝜃𝑔𝑇 (I). Then the global face
3.2) PART LOCALIZATION LEARNING NET:
representation learning net, parametrized by 𝜃𝑔𝐹 , maps the In AFFAIR, all the part Loc-Nets share the main
transformed image from the raw pixel space to a feature trunk of networks (i.e. the convolution layers) with the
space beneficial for predicting all the facial attributes, global representation learning net (parametrized by 𝜃𝑔𝐹 ).
denoted as 𝑓𝜃𝑔𝐹 ,𝜃𝑔𝑇 (I). Suppose there are in total N The additional parameters to generate the transformation
attributes we are going to predict. Then based on the 𝑇𝑝𝑖 in the part LocNet for the i-th attribute are denoted by
common feature space, N independent classifiers, 𝜃𝑝𝑇𝑖 . Thus 𝑇𝑝𝑖 = 𝑓𝜃𝑝𝑇 ,𝜃𝑔𝐹 ,𝜃𝑔𝑇 (I). The face image is
parametrized by 𝜃𝑔𝐶𝑖 , are built for performing attribute 𝑖

specific classification. Denote the overall mapping from an transformed by the part localization parameter 𝑇𝑝𝑖 . The
locally transformed face image is then processed by the i-th

3
part representation learning net parametrized by 𝜃𝑝𝐹𝑖 and the mirroring to augment the training data. No alignment or
other pre-processing is performed.
i-th part classifier with parameter 𝜃𝑝𝐶𝑖 .
Note that some attributes are corresponding to the
same local regions, e.g. attribute “Mouth Open” and IV RESULTS AND DISCUSSION
attribute “Wearing Lipstick” both correspond to the mouth
region. To save computation power, different attributes The facial attributes such as gender expression and
correspond to the same local face regions may share the youthfulness of the facial image is simulated and detected
using Google colabs with tensorflow backend tool for the
same part Loc-Net parameter 𝜃𝑝𝑇𝑖 and part feature
CELEBA datasets which are publically available.
extraction net parameter 𝜃𝑝𝐹𝑖 .
I. GLOBAL TRANSNET:
3.3) DATA AGUMENTATION: Global Transnet is bilinear interpolation of the
The proposed AFFAIR method is evaluated on the image. Fig 4.1 shows the bi-linearly interpolated image.
large-scale Celebrity Faces Attributes (CelebA) [11] dataset.
The CelebA dataset contains over 200k celebrity images,
each fully annotated with 40 attributes like “Pointy Nose”,
“Wavy Hair”, “Oval Face”. The CelebA dataset has two
versions, one version of unaligned face images in the wild
and the other version of aligned faces which are aligned by
ground truth facial landmarks. We use the unaligned version
in this experiment. The face images cover large pose
variations and cluttered background, thus are quite
challenging. For evaluation on each dataset, it uses its
official training/testing split protocol. It also provides
ablation study of AFFAIR on all the datasets.

3.4) IMPLEMENTATION DETAILS:


Fig 4.1 Global transnet
II. DATA AGUMENTATION:
To learn Tg, the global TransNet contains two
Data Augmentation creates new training data from
convolutional layers and two fully connected layers. In the
existing data by changing their angles, flipping and rotating.
competitive learning strategy, we use the InceptionV3 Net
Fig 4.2 shows the data augmented in various angle of
as the main trunk of the Siamese-like net and Euclidean
rotation.
distance loss for classification of attributes. For the global
representation learning net and the part representation
learning nets, we adopt the ResNet18 with 18 layers. The
input image size is 224 × 224 and 112 × 112 for the global
and the part representation learning nets, respectively. The
loss for the attribute prediction is the cross-entropy loss. The
convolutional layers of the part LocNet are shared with the
global representation learning net. The difference is that the
part LocNet uses the feature before the final pooling layer,
which preserves the spatial information. The parameters of
the networks are randomly initialized and we train the
overall network with the tensorflow platform. The base
learning rate for all the network is 0.001, which is reduced
by 1/625 every 5 epochs. The learning rate for
transformation layers generating 𝑇𝑔 is multiplied by 103 , Fig 4.2 Data Augumentation
and those for localization layers generating the scaling 𝑠𝑥 , III. DATASET SUMMARY:
𝑠𝑦 and the translation 𝑑𝑥, 𝑑𝑦 in 𝑇𝑝 are multiplied by Fig 4.3 shows the summary of the training phase
10−4 and 10−5 , respectively to ensure the stability of the for each epochs. The validation loss and accuracy after each
increasing epochs is obtained. It also gives the
learned transformations. The initial 𝑇𝑔 is set as 0.75 of the computational time for training each set of Datasets.
input size with no rotation or translation. The 𝑇𝑝𝑖 is
initialized with zero translations and scales of 0.3, 0.4, or
0.5 of the input size, which is chosen according to the size
of the attributes. To train AFFAIR, we crop face images
according to face detection bounding boxes provided by an
off-the-shelf face detector. When the detector fails to detect
a face, the face bounding box is inferred from the given
landmark ground truth. We use random cropping (crop 224
× 224 patches from 256 × 256 images) and random

4
accuracy obtained in this method is approximately 91.5%
and the validation accuracy obtained is approximately
92.5%.

Fig 4.3 Dataset Summary Fig 4.6 Accuracy Graph


IV. GENDER RECOGNITION GRAPH:
Fig 4.4 shows the graph which depicts the count of VII. FACIAL ATTRIBUTE DETECTION:
male and female facial images from the total facial images Fig 4.7 depicts the Attributes obtained from the
of the CELEBA dataset. ‘0’ represents the females count facial image. The attributes detected are gender, expression
and ‘1’ represents the males count. and youthfulness. The gender reveals that the image is male
or not. The expression reveals that the image is smiling or
not. And the youthfulness reveals whether the image is
young or not. The result is obtained in binary values.

Fig 4.4 Gender Recognition Graph


V. LOSS FUNCTION GRAPH:
Fig 4.5 shows the training and validation loss
function for the CELEBA dataset. The loss function must be
less for better face attribute detection method. The training
loss obtained in this method is 0.23 and the validation loss
obtained is 0.21 which is comparatively less. Fig 4.7 Facial Attribute Detection

V CONCLUSION AND FUTURE WORK


In this project, AFFAIR, a landmark free global face
attribute prediction method using InceptionV3 network
architecture is proposed. AFFAIR learns to transform the
face to the best configuration, and simultaneously localizes
the most discriminative facial part for specific attribute
prediction through the novel transformation-localization
network. Independent of face landmark detectors or ground
truth annotations, AFFAIR does not require face alignment
as preprocessing and provides state-of-the-art results for the
publically available CelebA dataset. This method shows
superior result compared to the traditional hand-crafted
attribute detection method. This is because the hand-crafted
Fig 4.5 Loss Function Graph feature technique is less accurate and is more time
VI. ACCURACY GRAPH: consuming for training large number of images in the
Fig 4.6 below shows the training and validation dataset.
accuracy for the CELEBA dataset. Accuracy must be high This facial attribute detection technique can be
for better face attribute recognition system. The training further used for higher end biometric face recognition

5
systems that depends on the attributes of the facial image for [5] H. Lai, S. Xiao, Y. Pan, Z. Cui, J. Feng, C. Xu, J. Yin, and S.
classifying and recognizing human faces. It can be used in Yan, 2016, “Deep recurrent regression for facial landmark
healthcare field to monitor the expression of the patients. It detection”, IEEE Transactions on Circuits and Systems for Video
can be used in application that in need of distinguishing Technology, vol. 28, pp. 1144-1157.
between male and female faces. It also finds variety of [6] Y. Zhong, J. Sullivan, and H. Li., 2016, “Face attribute
application in safety and security systems, criminal prediction using off the-shelf CNN features”, International
identification, human robot interaction, mobile applications, Conference on Biometrics, pp. 1–7. 2016.
etc. [7] Y. Zhong, J. Sullivan, and H. Li., 2016, “Leveraging
mid-level deep representations for predicting face attributes
REFERENCE in the wild”, IEEE International Conference on Image
Processing, pp. 3239–3243.
[1] Jianshu Li, Fang Zhao, Jiashi Feng, Sujoy Roy, [8] E. M. Rudd, M. G¨unther, and T. E. Boult., 2016,
Shuicheng Yan, 2018, “Landmark free Face Attribute “Moon: A mixed objective optimization network for the
Prediction”, IEEE Trans. Image Processing., vol. 27, pp. recognition of facial attributes”, European Conference on
4651-4662. Computer Vision, pp. 19–35, Springer.
.[2] M. M. Kalayeh, B. Gong, and M. Shah, 2017, [9] R. Torfason, E. Agustsson, R. Rothe, and R. Timofte.,
“Improving facial attribute prediction using semantic 2016, “From face images and attributes to attributes”, Asian
segmentation”, IEEE Conference on Computer Vision and Conference on Computer Vision, pp. 313–329, Springer.
Pattern Recognition. [10] H. Dibeklioglu, F. Alnajar, A. A. Salah, and T. Gevers,
[3] M. Ehrlich, T. J. Shields, T. Almaev, and M. R. Ame, 2015, “Combining facial dynamics with appearance for age
2016, “Facial attributes classification using multi-task estimation”, IEEE Transactions on Image Processing, vol.
representation learning”, IEEE Conference on Computer 24, pp. 1928–1943.
Vision and Pattern Recognition Workshops, pp. 47–55.
[4] C. Huang, Y. Li, C. Change Loy, and X. Tang, 2016,
“Learning deep representation for imbalanced
classification”, IEEE Transaction on Computer Vision and
Pattern Recognition, pp. 5375– 5384.

You might also like