You are on page 1of 4

Application of Face Filters on Cats given an image or a video

Viviana Dias, Image Analisys and Processing, Faculty of Sciences of University of Porto

Abstract— This project consists on the application of filters III. CAT FACE DETECTION
on cats’ faces given an image or a video of a cat. In the course of
this work the tasks of cat faces detection and facial key-points The foundation of this process is the detection of a cat’s
regression were also performed. face in a frame/image. If there is no cat face in an certain
image the process does not continue and skips to the next
I. INTRODUCTION image/frame. It is important that the objective of the model
This work was proposed by the Image Analysis and is to detect if there is a cat face (cat looking to the camera or
Processing course within the framework of the Mathematical with an approximate angle that permits to capture the front of
Engineering Master’s Degree in FCUP (Faculty of Sciences its face) and not just detecting if there is a cat, since the next
of University of Porto). The use of facial filters in social steps assume that they will receive a cat’s face as input. For
media have changed the way people communicate and share this task it was used YOLOv3 [2], the most recent version of
their emotions. Everyday millions of users generate videos YOLO [3], which is an object detection algorithm. YOLO is
content with visual filters, where they manually select and an object detector algorithm that uses features learned by
apply these filters on top of their video or images. However, a convolutional neural network to return bounding boxes
most of the filters on social media apps can only be applied for the objects it finds in a image and also a confidence
on humans. This project fills the hole in the cat lovers value for each bounding box (see Figure 2). This confidence
community heart since it aims to build an algorithm that value is calculated by the product of the box confidence
applies fun and cute filters onto cats’ faces. The project will score (that reflects how likely the box contains an object and
be divided into three phases: detecting if there is any cat’s how accurate is the boundary box) and the conditional class
face on the video frame (or image) and, if in fact there is a cat probability (probability that the object belongs to certain
face, retrieve the coordinates of its bounding box; performing class given an object is presence).
the regression of specific key-points of the face (eyes, ears Besides the use of a convolutional neural network to
and nose); and finally, applying a pre-chosen filter on each extract features, it makes predictions at 3 different scales
video frame (or on a single image) that contains a cat face. to facilitate the predictions of objects of different sizes.
Dividing the image into a grid, each grid cell predicts B
II. DATASET boundary boxes and gives a box confidence score for each
one using logistic regression. These boxes are predicted from
The dataset used for training the models of the first two adjustments to the B anchor boxes (which are determined via
stages of this project includes almost 10000 cat images k-means clustering given the ground-truth bounding boxes
labelled with coordinates of 9 points (two for the eyes, one sizes). The loss function used to optimize this model is a
for the mouth, and six for the ears) - see Figure 1, provided combination of the localization loss for bounding box offset
by [1]. We divided this dataset into train, validation and test prediction and the classification loss for conditional class
subsets with 6623, 1616 and 1757 images, respectively. probabilities.
In our case there is only one class - cat face - therefore
we changed the structure of the algorithm that was build
for 80 classes to focus only on 1 class. However, we will
start our training with the model weights previous trained
on COCO Dataset [4], because some features extracted can
be already relevant for our problem. The groundtruth for the
bounding boxes we want the algorithm to predict was built
from 7 of the 9 key-points labeled in the CAT Dataset. We
excluded the points 5 and 8 because some images from the
dataset did not show the cat’s ears points so their labels were
misleading. Furthermore, this points will not be necessary for
the filter’s application. Then, a bounding box was created for
each image having the same center and 1.2 times the width
Fig. 1. Example of an image from the CAT Dataset with a representation and high of the minimal not rotated rectangle containing all
of its respective labels. the points referred. Before giving the images and labels to the
model input, they undergo data augmentation that contains
changing the values of saturation and value (for the HSV the input from the previous layer to the next layer without
color representation), flipping upside down and/or left to any modification of the input. Skip connection enables to
right and a random affine transformation. Finally, so we can have a deeper network without the problem of the vanishing
train in batches, the images are all resized to 416 × 416. gradient what lead to ResNet winning some competitions and
The results were satisfactory. For the train set, the final being widely known.
loss value was 0.261. This value is composed by the loss There are pre-established ResNet of many sizes, being
x + y (given by the sum of the mean squared error of each the smallest (and fastest) one ResNet-18, so this is the one
coordinate from the bounding box top-left corner) which we chose to try for this task. For the training we used the
gave 0.075, the sum of the height and width loss (mean weights pretrained on ImageNet Dataset (dataset of over 15
squared error of the square root of height and width) which millions labeled images with around 22000 categories). Also,
was 0.025, the box confidence loss (binary cross entropy) we freeze (not update the weights during train) the first 19
0.162 and the loss of the class probability, that in this case layers so we could take advantage of the general features
does not make sense to calculate since we only have 1 class, that the model weights already had capture in the first layers.
being impossible to attribute a different class. Regarding the Since the final layer was suitable for classification but not for
cats’ faces found in the validation set, the precision value points regression, we changed it to a linear layer that outputs
was 0.965 and recall 0.979 being the F1-score 0.972. This 14 points (coordinates x and y for each of the 7 points). The
scores were obtained with a 0.1 confidence threshold. The images and labels were normalized and then given to the
value of the final loss was 0.351. Keeping the same confident model.
threshold, for the test set the F1-score was 0.963, based on The function loss used was mean squared error (MSE) but
a precision of 0.960 and recall of 0.965. we also calculated the mean absolute error (MAE) for each
one of the coordinates. The coordinates, after normalization,
are between 0 and 1. The results are as follows. On the
validation set the loss was 0.0003 and the MAE value 0.0117
that corresponds to roughly 2 pixels on a 200 × 200 image.
On the test set the loss value was 0.0004 and the MAE was
0.0129. Calculating the MAE values for each coordinate we
could see that the model performs better on the eyes and nose
coordinates (MAE around 0.008). That is expected since the
ears have more “noise” around them because of the different
lengths of hair and positions.
V. APPLICATION OF FACE FILTERS
Once we have the coordinates of the cat’s face key-points,
we are ready to apply some filters. Regarding the needed
transformations and key-points usage, we can aggregate the
filters into 4 categories: top of the head (hats or bows), eyes
Fig. 2. Output of YOLOv3 represented on a sample image.
(glasses), cheeks (blush), and nose/mouth. Besides the blush,
the process involves another image containing the sticker
IV. FACIAL KEY-POINTS REGRESSION to apply on top of our cat image (see Figure 3). Briefly
Once known if there is a cat’s face in the frame/image, the explaining, first this sticker image go through rotation and
location of some facial key-points is important in order to resize. Then, we proceed to pad or cut the transformed sticker
know the exact place to apply the filters. Therefore, we take image so the sticker overlaps our cat’s image in the right
advantage of the bounding box received by the output of the place. Thus, we make the corresponding sticker binary image
YOLOv3 model and cut the frame/image in the interested and subtract it to the cat’s image and to the coloured sticker
area. We keep the image part that has the same center as image as well. Finally, we add the subtracted coloured sticker
the bounding box but 2 times its size to give a bit more to the cat’s image (see Figure 4).
leeway for mistakes (clipping the size if that would exceed
the original image limits). This is an important step to ease
the task of detecting the key-points since this way the cats’
face will be somewhat in the same scale. Once again, we
resize the images, this time to 200 × 200. Having done this
transformations to the images, we update the key-points’ Fig. 3. Sample of the sticker images used.
labels accordingly. Once more, we will exclude the points 5
and 8 for the same reasons already discussed, and the images In more detail, for the category of the eyes/glasses, we
will pass through data augmentation. specify two points from the sticker image being the points we
A residual convolutional neural network was used for this want to overlap with the eyes key-points (1 and 2). Having
task, ResNet [5]. ResNet introduced skip connection to fit those two pairs of points as vectors we can find the angle
Fig. 4. Sample of cat images with filter applications.

between them. Thus, we rotate the sticker image given the the values of the bounding box if there is a cat’s face or null
calculated angle and also resize it so the distance between values if it does not find a cat face.
the sticker chosen points match the distance between the key- After having those values in a list, we fill the null values
points 1 and 2. From there, we just pad and/or cut the image if they are between non null values in each interval of 0.1
in order to put the points in the right location, binarize, do the seconds (for FPS from 10 to 19 this only fills 1 frame value
subtractions and add the final transformed sticker image to that is between 2 non null frames values). If that condition
the cat image/frame. For the other filters with sticker images is confirmed the null values are filled with the mean of the
the process is similar. Rotation is always done regarding the values obtained 0.1 seconds before and 0.1 seconds after.
angle of the eyes’ key-points vector, since this points are We chose this small value of 0.1 seconds because cats can
the ones with the lower error on the key-points regression turn its face really fast. However, this technique can help to
results. The resize for the hats category is done in relation reduce the false negatives.
to the distance of the points 6 and 7, instead of 1 and 2. For Having the process of cat detection and bounding boxes
the nose/mouth category it is done by a ratio of the distance finished, we now proceed to the key points regression. Thus,
between the cat’s eyes. we create a new empty list ready to append the list of key-
For the blush filter we draw directly the red circles, with points coordinates for each frame. Iterating with the frames
40% of transparency, on the image. The calculation of the once again we check if the corresponding bounding box is
coordinates for the circles’ center is proceed as follows. null or not. If it is null we append a 14 length long list of
First, we find the slope of the line that passes through the zeros to the key-points list. Otherwise, we cut the frames
eyes’ key-points. Then, we (imaginarily) trace the segment the same way that was done on the pre-processing of the
between the middle point of the eyes and the nose, and key-points regression model, i.e., having the coordinates of
mark its midpoint. Given this point and the slope previous the box bounding the cat’s face we cut the frame parts that
calculated, we can now have a line that is parallel to the eyes’ exceed in 2× the bounding box size. Then, once again, we
segment and passes through that point. The circles’ centers resize the frame and normalize it. We are now able to give
are given by the intersection of that line with a circumference the frame as input to the facial key-points regression model
of radius 0.7× the length of the eyes’ segment and with in order to receive a list of 14 points as output. This model
center on that midpoint. The radii of the blush circles are returns the output within 0.04 seconds per frame1 . The points
also a ratio of the length of the eyes’ segment. obtained are still in a range between 0 and 1 and need to go
through some process so they can represent the coordinates
VI. VIDEO PROCESSING of the original frame. That said, we desnormalize them
Finally, after having all the pieces settled, it is time to multiplying them by 200; rescale them having in account
bring all together and filterize a cat video. So, we receive a the size of the image before being resized to 200 × 200;
video as input (most probably with a cat on it), as well as and add to the y-coordinates the number of top rows that
the filters the user chooses to apply and the name to give to were cropped because of the bounding box, and to the x-
the new video. There is a trade-off between the frame rate coordinates the left columns that were also cropped. Finally,
and the processing time of the video so we let the frame per now they represent the coordinates of the key-points for the
second (FPS) be an input too. We set 10 FPS as default since original frame and we can append them on the key-points
the motion of the video seems reasonable. Less than 10 FPS list.
the motion starts to feel very artificial. Following the process for the key-points, we notice that
The first thing to do is saving the video into a format that tiny coordinates’ variations between frames can cause the
allows iterating through the frames. Also, we save the audio, face filters on the cat to seem trembling. This should be
if it exists, to add it after the interactions with the frames. smooth out so the filters applied on the video look more
Having the FPS set, we iterate frame by frame and perform natural. To resolve this problem we used Gaussian filter,
the cat face detection on each one. This takes around 0.07
seconds1 for each frame. We create an empty list and append 1 Time values recorded by i7-6700HQ CPU.
this filter is knowingly used on 2 dimensions to reduced R EFERENCES
image noise, but here we will use its 1 dimension version. [1] Weiwei Zhang, Jian Sun, and Xiaoou Tang, “Cat head detection - how
Basically it does a weighted average where the weights are to effectively exploit shape and texture features,” 10 2008, vol. 5305,
sampled from a zero-mean Gaussian distribution and we can pp. 802–816.
[2] Joseph Redmon and Ali Farhadi, “Yolov3: An incremental improve-
control the value of the standard deviation (σ) to perform ment,” arXiv preprint arXiv:1804.02767, 2018.
more or less smoothing. We set the standard deviation of the [3] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, “You
filter as 2 if the standard deviation between the key-points only look once: Unified, real-time object detection,” in Proceedings of
the IEEE conference on computer vision and pattern recognition, 2016,
in proportion to the cat’s face size is not to high. For higher pp. 779–788.
values we lower the value of σ so it doesn’t smooth too much [4] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro
the fast movements. Therefore, we apply the 1D Gaussian Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick, “Mi-
crosoft coco: Common objects in context,” in European conference on
filter on each set of the same coordinate, so we have to apply computer vision. Springer, 2014, pp. 740–755.
this filter on 14 arrays with length equal to the total number [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep
of frames. The null values are ignored since they correspond residual learning for image recognition,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016, pp. 770–
with not detecting a cat face. 778.
Now that we have the final values for the key-points, we
iterate through the frames again. If a cat face is detected on
the frame we give the frame and its corresponding final key-
points to the input of the face filter function (or functions
if more than one filter was chosen), if a cat face was not
detected the frame remains the same. The application of the
face filter takes around 0.1 seconds per frame in the case
of the sticker filter and an insignificant value for the blush
case1 .
At last, we can now finish the process by joining all the
new frames together into a video with the chosen FPS and
attaching the original audio if it existed.
VII. CONCLUSIONS AND FUTURE WORK
This project was developed in the context of the Image
Processing course. The techniques used involve image pro-
cessing tools and Deep Learning models. Being most of the
social apps’ filters for humans’ faces we achieved the goal
of creating videos applying similar face filters on top of cats’
faces. However, there is room for improvements. Currently, it
only applies filters on one cat per image/frame. In the future,
it should be able to apply filters on any cat present on the
image/frame. Also, it detects a cat’s face when the cat is on
its profile what implies that the filters will be applied in an
artificial way. This can be solved by adding more images
for the Cat Face Detection training dataset with cats on their
profile labelled like there is no cat’s face. Another techniques
could be explored to improve speed or even to trying to make
this process able to run on real-time webcam.

1 Time values recorded by i7-6700HQ CPU.

You might also like