You are on page 1of 3

Abstract for UPF DTIC Workshop

Tittle : Fully end to end deep learning model to track the face in the wild.

Human face is arguably among the one of most well – studied deformable objects due to its
numerous applications. However, compared with other face analysis, face tracking currently
undermined of its development. This can be seen in the slow progress of any recent techniques on
tracking methods. Furthermore, to the best of our knowledge, none if any tracker model exposed to
the big datasets, that if utilized correctly can be benefit. Such as using the deep learning technique
which can exploit the pattern well given this load of data.

To bridge this gap, in this study we attempt to implement the fully end to end module based on the
state of the art recurrent based tracker combined with the convolution based landmark localization
models. We experimentally fused these two deep models to see its capabilities on tracking the face
in the wild using the biggest face tracking dataset available today : 300 Videos in the wild dataset
(300 VW).

Our tracker pipeline is based on the Re3 tracker with modifications to work on the facial landmark
key-points dataset. Specifically, we replace the convolutional layers with the state of the art
landmark key-point localizer Hour Glass (HG) network, and transfer those outputs to the series of
Long Short Term Memory (LSTM) layers and ended with a regression layer. On Training, we used
the curriculum learning with the momentum optimizer on Tensorflow™ library. We have left the
model trained for two days even though we note that our model still need more time and tuning.
Since we are still optimizing our graph built definition and execution.

Our current result suggests that this end to end model capable to identify and track the face,
especially the boundary or bounding box of the face correctly. Even though the landmark points are
not well aligned yet. In overall compared with the top performers of this 300VW challenge, our
model still inferior, however, with the gap of normalized error value around 0.05. The normalized
error of our result comparison category 1 in 300VW dataset, with the visual result of our best and
worst tracked face can be seen in the figure 1 below :

Figure 1: A) Overall error comparison between our model (blue line) with top performer on 300VW
dataset Category 1. B) Our best tracked result. C) Our worst tracked result.
In our observation, the high error values mainly comes as result that the LSTM model infer the face
landmark location using the mean of key-point location instead of the actual position. This can be
due to the limited time of training we gave or the model need to be extended, such as by using
multi-layered LSTM etc. Further inspection on the internal LSTM also reveal that the weight of
every gate on LSTM is activated following the weight pattern compared with the randomly
initialized weight. This indicated that the LSTM already learned the dynamic it needs to produce the
facial keypoints location, even though still in its mean location as explained before.

This partial results suggests us that this LSTM – CNN coupled model, with careful training
procedure capable to track the location face in the wild. However, current model need to be tuned
further to be able to locate the landmark precisely. Further, we expect to be able to test this model
on the occluded face dataset to test the memory capability of LSTM.
Tittle : Fully end to end deep learning model to track the face in the wild.

Human face is arguably among the one of most well – studied deformable objects due to its
numerous applications. However, compared with other face analysis, face tracking currently
undermined of its development. Furthermore, to the best of our knowledge, none if any tracker
model exposed to the big datasets.

To bridge this gap, in this study we attempt to implement the fully end to end module based on the
state of the art recurrent based tracker Re3 combined with the convolution based landmark
localization models Hourglass Network (HG). We use the biggest face tracking dataset available
today : 300 Videos in the wild dataset (300 VW) for training and comparison.

Our current result suggests that this end to end model capable to identify and track the face,
especially the boundary or bounding box of the face correctly. Even though the model andmark
points are not well aligned yet. In overall compared with the top other best performers on this
300VW dataset our model still perform less, with the gap of normalized error value around 0.05.

This partial results suggests us that this fully end to end model capable to track the location face in
the wild. However, current model need to be tuned further to be able to locate the landmark
precisely. Further, we expect to be able to test this model on the occluded face dataset to test the
memory capability of our tracker.

You might also like