Professional Documents
Culture Documents
Abstract—Automatic eye gaze estimation has interested re- and accurate pupil-center localization is still a challenging task
searchers for a while now. In this paper, we propose an [11], particularly for images with low resolution.
unsupervised learning based method for estimating the eye
gaze region. To train the proposed network “Ize-Net” in self-
With the success of supervised deep learning techniques,
supervised manner, we collect a large ‘in the wild’ dataset especially convolution neural networks, much progress has
containing 1,54,251 images from the web. For the images in been witnessed in most of the problems in computer vision.
the database, we divide the gaze into three regions based on This is primarily due to the availability of graphics process-
an automatic technique based on pupil-centers localization and ing unit (GPU) hardware and large-sized labelled databases.
then use a feature-based technique to determine the gaze region.
The performance is evaluated on the Tablet Gaze and CAVE
Furthermore, it has been noted that the labelling of complex
datasets by fine-tuning results of Ize-Net for the task of eye vision task is a noisy and erroneous process. Hence, there
gaze estimation. The feature representation learned is also used is an interest in exploring deep learning based unsupervised
to train traditional machine learning algorithms for eye gaze techniques for computer vision tasks [12]–[15].
estimation. The results demonstrate that the proposed method In this paper, we propose an unsupervised (self-supervised)
learns a rich data representation, which can be efficiently fine-
tuned for any eye gaze estimation dataset. technique for learning a discriminative eye-gaze representa-
tion. The method is based on exploiting the domain knowledge
I. I NTRODUCTION generated by analyzing YouTube videos. The aim is to learn a
feature representation for eye gaze, which can be easily used
The eye gaze estimation aims to determine the line-of- by itself or fine-tuned for complex eye gaze related tasks. The
sight for the pupil. It provides information about human visual experimental results show the effectiveness of our technique in
attention and cognitive process [1]. It aids several applications predicting the eye gaze as compared to supervised techniques.
such as human-computer interaction [2], student engagement The main contributions of this paper are as follows:
detection [3], video games with basic human interaction [4],
driver attention modelling [5], psychology research [6] etc. 1) Dataset (Figure 1) of 1,54,251 facial images of 100
Gaze estimation techniques can be broadly classified into different subjects from YouTube videos has been col-
two types: intrusive and non-intrusive. The intrusive technique lected. These images are automatically labeled using the
requires contact with human skin or eyes. It includes usage proposed method of eye gaze region estimation.
of head-mounted devices, electrodes and sceleral coils [7]– 2) Propose a deep neural network, “Ize-Net”, which is
[9]. These devices provide accurate gaze estimation but cause trained on the proposed dataset. The result shows that
an unpleasant user experience. The non-intrusive technique unsupervised techniques can be used for learning rich
does not require physical contact [10]. Image processing representation for eye gaze.
based gaze estimation methods come under the non-intrusive 3) Method to detect if the subject present in the input
category. These methods face a number of challenges, which image, is looking towards his/her left, right, or center
include partial occlusion of the iris by the eyelid, illumination region. The gaze region estimation is calculated by
condition, head pose, specular reflection if the user wears utilizing the relative position of both (left and right) the
glasses, etc; inability to use standard shape fitting for iris pupils in the eye sockets.
boundary detection; and effects like motion blur and over 4) Method to localize pupil-center using facial landmarks,
saturation of image [10]. To deal with these challenges, most OTSU thresholding [16] and Circular Hough Transfor-
of the accurate gaze estimation methods have been performed mation (CHT) [17].
under constrained environments like fixation of head pose, The remainder of this paper is organized as follows: Section
illumination conditions, camera angle, etc. Such methods II describes some of the related studies. Section III presents
require huge dump of high-resolution labelled images. Robust the details of the proposed pupil-center localization and gaze
gaze estimation needs accurate pupil-center localization. Fast estimation methods. In Section IV, we empirically study the
performance of the proposed approach. Section V contains the In [18], Valenti et al. estimated the eye gaze by combining the
conclusion and future work. information of eye location and head pose.
Appearance-based gaze tracking methods do not explicitly
II. R ELATED W ORK extract the features instead they utilize the whole image for
eye gaze estimation. These methods, normally do not require
The proposed method contains pupil-center localization and the geometry information and calibration of cameras, since
eye gaze estimation techniques. Accordingly, the literature sur- the gaze mapping is directly performed on the image content.
vey demonstrates some of the relevant pupil-center localization These methods usually require a large number of images to
and eye-gaze estimation methods. train the estimator. To reduce the training cost, Lu et al. [28]
The most popular solutions presented for the task of pupil- proposed a decomposition scheme. It included the initial gaze
center localization can be broadly classified into active and estimation and the subsequent compensations for the gaze esti-
passive methods [10]. mation to perform effectively using training samples. Huang et
The active pupil-center localization methods utilize ded- al. [29] proposed an appearance based gaze estimation method
icated devices to precisely locate the pupil-center such as in which the video captured from the tablet was processed
infrared camera [7], contact lenses [8] and head-mounted using HoG features and Linear Discriminant Analysis (LDA).
device [9]. These devices require pre-calibration phase to In [30], an eye gaze tracking system was proposed, which
perform accurately. They are generally very expensive and extracted the texture features from the eye regions using the
cause uncomfortable user experience. local pattern model. Then it fed the spatial coordinates into the
The passive eye localization methods try to gather infor- Support Vector Regressor to obtain a gaze mapping function.
mation from the supplied image/video-frame, regarding pupil- Zhang et al. [31] proposed GazeNet which was deep gaze
center. Valenti et al. [18], have used isophotes to infer circular estimation method. Williams et al. [32] proposed a sparse and
patterns and used machine learning for the prediction task. semi-supervised Gaussian process model to infer the gaze,
An open eye can be peculiarly defined by its shape and its which simplified the process of collecting training data.
components like iris and pupil contours. The structure of an In this paper, we propose an unsupervised method to detect
open eye can be used to localize it in an image. Such methods whether a subject is looking towards his/her left, right or center
can be broadly divided into voting-based methods [19], [20] region. We utilize the relative position of pupils in the eye
and model fitting methods [21], [22]. Although these methods sockets for judging the gaze region. This method allows us
seem very intuitive, they do not provide good accuracy. Several to predict the gaze region for a variety of images containing
machine learning based pupil-center localization methods have different textures, races, genders, specular attributes, illumina-
also been proposed. One such method was proposed by tions and camera qualities.
Campadelli et al. [23], in which they used two Support Vector
Machines (SVM) and trained them on properly selected Haar III. P ROPOSED M ETHOD
wavelet coefficients. In [24], randomized regression trees were This section explains the pipeline of proposed gaze region
used. These supervised learning based methods require the estimation method. At first, we localize pupil-centers. Then,
tiresome process of data labeling. utilize them to estimate the region of eye gaze using an
In this paper, we propose a method which overcomes the intuitive approach which works well for images captured in
aforementioned limitations. It is a geometric feature-based the wild. Further, for learning eye gaze representation, we
pupil-center localization method, which gives accurate results collect a large dataset of human faces. The domain knowledge
for images captured under uncontrolled environment. In the based pupil-centers and facial points are used to create noisy
past, various visible imaging based eye gaze tracking methods labels representing gaze regions (left, right, or center). A
have been proposed which can be broadly classified among network is then trained for the mentioned task. Later we show
feature-based methods and appearance-based methods. the usefulness of the representation learned from the fully
Feature-based methods utilize some of the prior knowledge automatically generated noisy labels.
to detect the subject’s pupil-centers from simple pertinent
features based on shape, geometry, color and symmetry. These A. Dataset Collection
features are then used to extract eye movement information. In recent years, several gaze estimation datasets have been
Morimoto et al. [25] assumed a flat cornea surface and published. Most of the datasets contain very less variety
proposed a polynomial regression method for gaze estimation. of images in terms of head poses, illumination, number of
In [26], Zhu and Yang extracted intensity feature from an images, collection duration per subject and camera quality.
image and used a Sobel edge detector to find pupil-center. To demonstrate that our proposed method is versatile, we
The gaze direction was determined via linear mapping func- collect a dataset containing 1,54,251 facial images belonging
tion. The detected gaze direction was sensitive to the head to 100 different subjects. The overall statistic of our dataset
pose, therefore, the users must stabilize their heads. In [27], is shown in Table I. We download different types of videos
Torricelli et al. performed the iris and corner detection to from YouTube’s creative common section. These videos are
extract the geometric features, which were mapped to the basically of the category where a single subject is seen on the
screen coordinates by the general regression neural network. screen at a time, like news reporting, makeup tutorials, speech
Figure 1: Sample images from proposed dataset. Here, we can see that there is huge variation in illumination, facial attributes of subjects,
specular reflection, occlusion, etc. First and second rows from top show images for which the gaze region is correctly estimated and third
row shows images where gaze region is not correctly estimated. First row; first and second images are looking towards left region. First
row; third and fourth images are looking towards right region. First row; fifth and sixth images are looking towards central region. Second
row contains images of challenging scenarios like, occlusion and specular reflection; for which we get correct gaze region estimation. Last
row contains images of scenarios where our method fails due to insufficient information for determining correct gaze region. (Image Source:
YouTube creative commons)
videos, etc. We have considered every third frame of the 3) Apply the method of blob center detection on extracted
collected videos for dataset creation. For the training purpose, iris contours to calculate ’primary’ pupil-centers.
the dataset has been split into training and validation sets 4) Crop regions near these centers, to perform the center
with 70% and 30% uniform partitions over the subjects. The rectification task. The crop length is decided by applying
overview of our proposed dataset has been shown in Figure equation (1).
1. In this figure, we can observe that our dataset contains a Height of eye contour
huge variety of images with varying illumination, occlusion, Crop length = + of f set (1)
2
blurriness, color intensity, etc. Table II provides a comparison
5) Compute Adaptive thresholding and apply Canny edge
of the state-of-the-art gaze datasets with our proposed dataset.
detector [34] to make the iris region more prominent.
B. The Pupil-Center Localization 6) Apply CHT over the edged image to find secondary
pupil-centers.
Accurate pupil-center localization plays an important role in 7) Compute average of primary and secondary pupil-
eye gaze estimation. We take face image as input and extract centers to finalize the value for pupil-centers.
eyes from this image making use of the facial landmarks ob-
Empirically, we noticed that the pupil-center localization
tained by Dlib-ml library [33]. Further processing is performed
accuracy is increased by taking an average of pupil-centers
on the extracted eye images. We localize pupil-center using
calculated by the above two methods. Few sample results of
two methods i.e. blob center detection and CHT; and take
pupil-center localization have been shown in Figure 2. The
average of the pupil-centers obtained by both the methods to
blue, green and pink dots represent the pupil-center obtained
calculate the final pupil-center.
by our primary method, secondary method and their average,
The steps of the proposed pupil-center localization method
respectively.
are explained below:
1) Extract eyes using facial landmarks information. C. Heuristic for Eye Gaze Region Estimation
2) Apply OTSU thresholding on the extracted eyes to take The Pupil-center is the most decisive feature of the face to
advantage of unique contrast property of eye region determine gaze direction. Eyeballs move in the eye sockets
while pupil circle detection. to change the direction of the gaze. By using the relative
Figure 2: Results of pupil-center localization method. Green, blue and pink colors represent the pupil-centers as mentioned in Section III-B
(Image Source: [35] best viewed in color).
Table IV: Results on Tablet Gaze with comparison to baselines [35]. Effectiveness of learnt features in Ize-Net is demonstrated by the fine
tuning the network and by training a SVR over various FC layer features.
gaze region prediction. The consideration of face geometry is tuning is done with mean square error loss function. The
in accordance with the proposed heuristic used to label the experimental results demonstrate that the proposed method
images of the collected dataset. We validate the performance outperforms the state-of-the-art gaze prediction for both Tablet
of the proposed network on CAVE dataset. The angular labels Gaze and CAVE datasets. For experiments, we try our best to
of CAVE dataset images have been mapped into three gaze follow the protocols discussed in [43] and [29]. However, there
regions. Post categorizing the images into their corresponding can be a few differences in frame extraction and selection.
gaze regions we fine tune the Ize-Net for entire CAVE dataset To demonstrate that the network learned efficient features, we
to cross-check the performance of this network. We fine-tune trained a Support Vector Regressor (SVR) over the features
our network for 10 epochs with 0.0001 learning rate [35]. As learned in 31st FC layer and 34th FC layer for Tablet Gaze
mentioned in TABLE III, our network gives 82.80% five-fold dataset. As depicted in TABLE IV, the low gaze prediction
cross-validation accuracy on CAVE dataset. errors of SVR confirms that the learned features are highly
efficient.
D. Fine Tuning Results on Tablet Gaze and CAVE Datasets
V. C ONCLUSION AND F UTURE W ORK
To fine tune the base model for prospective datasets, we add
two Fully-Connected (FC) layers at the end of the proposed In this paper, we propose a method which learns a rich eye
Ize-Net network. We fine-tune the network on Tablet Gaze gaze representation by using unsupervised learning technique.
and CAVE datasets. The two FC layers added in the base Using the relative position of pupil-centers in left and right
network are each of dimension 256 for both the datasets. The eye, the images are labeled based on gaze region i.e. left,
fine tuning results for Tablet Gaze are shown in TABLE IV right, or center. To demonstrate the robustness of the proposed
and those for CAVE are shown in TABLE V. As depicted in method, we collect a large dataset of facial image. We also
TABLE IV and V, we demonstrate the results for different propose Ize-Net network, which is trained on the collected
levels of fine tuning. Last 8 FC layers, last 12 FC layers and dataset. The weights of this trained model can be used for any
complete network are fine-tuned one-by-one for the empirical facial image to detect the region of gaze. Machine learning
analysis of results. For fine-tuning the proposed network for methods can be used on the learned gaze region representa-
Tablet Gaze dataset, we used a learning rate of 0.0001 with tion to calculate eye gaze. Experimental results confirm the
10 epochs and for CAVE dataset, we used a learning rate efficiency of the proposed method.
of 0.0001 with 15 epochs. For both the datasets, the fine The proposed gaze estimation method can be vastly used for
many human-computer interaction based applications without
prior need of troublesome data labelling task.
Accuracy (%) Currently, our method is robust to the head pose movement
Methods
e ≤ 0.05 e ≤ 0.10 e ≤ 0.25 within −10° to 10°. In the future, we plan to utilize the head
Ours 56.97 100.00 100.00 pose information completely while estimating the gaze region.
Poulopoulos et al. [46] 87.10 98.00 100.00
We also plan to perform the real-time pupil-center localization
Leo et al. [10] 80.70 87.30 94.00
Campadelli et al. [47] 62.00 85.20 96.10 and gaze region estimation for a video-based dataset.
Cristinacce et al. [48] 57.00 96.00 97.10 ACKNOWLEDGEMENT
Asadifard et al. [49] 47.00 86.00 96.00
We gratefully acknowledge the support of NVIDIA Corpo-
Table VI: Comparison of proposed pupil-center localization method ration with the donation of the Titan Xp GPU used for this
with other state-of-the-art methods. research.
R EFERENCES [27] D. Torricelli, S. Conforto, M. Schmid, and T. D’Alessio, “A neural-based
remote eye gaze tracker under natural head motion,” Computer Methods
[1] M. Mason, B. Hood, and C. Macrae, “Look into my eyes: Gaze direction and Programs in Biomedicine, 2008.
and person memory,” Memory, 2004. [28] F. Lu, T. Okabe, Y. Sugano, and Y. Sato, “Learning gaze biases with
head motion for head pose-free gaze estimation,” Image and Vision
[2] B. Ghosh, A. Dhall, and E. Singla, “Speech-gesture mapping and
Computing, 2014.
engagement evaluation in human robot interaction,” arXiv, 2018.
[29] Q. Huang, A. Veeraraghavan, and A. Sabharwal, “Tabletgaze: uncon-
[3] A. Kaur, A. Mustafa, L. Mehta, and A. Dhall, “Prediction and local-
strained appearance-based gaze estimation in mobile tablets,” arXiv,
ization of student engagement in the wild,” in IEEE Digital Image
2015.
Computing: Techniques and Applications, 2018.
[30] H. Lu, G. Fang, C. Wang, and Y. Chen, “A novel method for gaze
[4] P. Barr, J. Noble, and R. Biddle, “Video game values: Human–computer tracking by local pattern model and support vector regressor,” Signal
interaction and games,” Interacting with Computers, 2007. Processing, 2010.
[5] L. Fridman, P. Langhans, J. Lee, and B. Reimer, “Driver gaze region [31] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling, “Mpiigaze: Real-world
estimation without use of eye movement,” IEEE Intelligent Systems, dataset and deep appearance-based gaze estimation,” IEEE Transactions
2016. on Pattern Analysis and Machine Intelligence, 2017.
[6] E. Birmingham and A. Kingstone, “Human social attention,” Annals of [32] O. Williams, A. Blake, and R. Cipolla, “Sparse and semi-supervised
the New York Academy of Sciences, 2009. visual mapping with the sˆ 3gp,” in IEEE Computer Vision and Pattern
[7] D. Xia and Z. Ruan, “IR image based eye gaze estimation,” in IEEE Recognition, 2006.
ACIS International Conference on Software Engineering, Artificial In- [33] D. King, “Dlib-ml: A machine learning toolkit,” Journal of Machine
telligence, Networking, and Parallel/Distributed Computing, 2007. Learning Research, 2009.
[8] D. Robinson, “A method of measuring eye movemnent using a scieral [34] J. Canny, “A computational approach to edge detection,” IEEE Trans-
search coil in a magnetic field,” IEEE Transaction on Bio-Medical actions on Pattern Analysis and Machine Intelligence, 1986.
Electron., 1963. [35] B. Smith, Q. Yin, S. Feiner, and S. Nayar, “Gaze locking: passive eye
[9] A. Tsukada, M. Shino, M. Devyver, and T. Kanade, “Illumination-free contact detection for human-object interaction,” in ACM User Interface
gaze estimation method for first-person vision wearable device,” in IEEE Software and Technology, 2013.
International Conference on Computer Vision Workshop, 2011. [36] A. Villanueva, V. Ponz, L. Sesma-Sanchez, M. Ariz, S. Porta, and
[10] M. Leo, D. Cazzato, T. De Marco, and C. Distante, “Unsupervised eye R. Cabeza, “Hybrid method based on topography for robust detection
pupil localization through differential geometry and local self-similarity,” of iris center and eye corners,” ACM Transactions on Multimedia
Public Library of Science, 2014. Computing, Communications, and Applications, 2013.
[11] C. Gou, Y. Wu, K. Wang, K. Wang, F. Wang, and Q. Ji, “A joint cascaded [37] U. Weidenbacher, G. Layher, P. Strauss, and H. Neumann, “A compre-
framework for simultaneous eye detection and eye state estimation,” hensive head pose and gaze database,” 2007.
Pattern Recognition, 2017. [38] Q. He, X. Hong, X. Chai, J. Holappa, G. Zhao, X. Chen, and
[12] X. Wang and A. Gupta, “Unsupervised learning of visual representations M. Pietikäinen, “Omeg: Oulu multi-pose eye gaze dataset,” in Scan-
using videos,” in IEEE International Conference on Computer Vision, dinavian Conference on Image Analysis, 2015.
2015. [39] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling, “Appearance-based
[13] I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: unsupervised gaze estimation in the wild,” in IEEE Computer Vision and Pattern
learning using temporal order verification,” in European Conference on Recognition, 2015.
Computer Vision. Springer, 2016, pp. 527–544. [40] K. Krafka, A. Khosla, P. Kellnhofer, H. Kannan, S. Bhandarkar, W. Ma-
[14] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised tusik, and A. Torralba, “Eye tracking for everyone,” in IEEE Computer
learning of depth and ego-motion from video,” in Proceedings of the Vision and Pattern Recognition, 2016.
IEEE Conference on Computer Vision and Pattern Recognition, 2017, [41] S. Sabour, N. Frosst, and G. Hinton, “Dynamic routing between cap-
pp. 1851–1858. sules,” in Neural Information Processing Systems, 2017.
[42] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling, “It’s written all over your
[15] S. Datta, G. Sharma, and C. Jawahar, “Unsupervised learning of face
face: Full-face appearance-based gaze estimation,” in IEEE Computer
representations,” in IEEE International Conference on Automatic Face
Vision and Pattern Recognition Workshop, 2017.
& Gesture Recognition, 2018.
[43] E. Skodras, V. G. Kanas, and N. Fakotakis, “On visual gaze tracking
[16] N. Otsu, “A threshold selection method from gray-level histograms,”
based on a single low cost camera,” Signal Processing: Image Commu-
IEEE Transaction on System, Man, Cybernatics, 1979.
nication, 2015.
[17] S. Pedersen, “Circular hough transform,” Vision, Graphics, and Interac- [44] S. Jyoti and A. Dhall, “Automatic eye gaze estimation using geomet-
tive Systems, 2007. ric i& texture-based networks,” in IEEE International Conference on
[18] R. Valenti and T. Gevers, “Accurate eye center location through invariant Pattern Recognition, 2018.
isocentric patterns,” IEEE Transactions on Pattern Analysis and Machine [45] O. Jesorsky, K. Kirchberg, and R. Frischholz, “Robust face detection
Intelligence, 2012. using the hausdorff distance,” in International Conference on Audio-
[19] K. Kim and R. Ramakrishna, “Vision-based eye-gaze tracking for human and Video-Based Biometric Person Authentication, 2001.
computer interface,” in IEEE Transaction on System, Man, Cybernatics, [46] N. Poulopoulos and E. Psarakis, “A new high precision eye center
1999. localization technique,” in IEEE International Conference on Image
[20] A. Peréz, M. Córdoba, A. Garcia, R. Méndez, M. Munoz, J. Pedraza, Processing, 2017.
and F. Sanchez, “A precise eye-gaze detection and tracking system,” [47] P. Campadelli, R. Lanzarotti, and G. Lipori, “Precise eye localization
UNION Agency, 2003. through a general-to-specific model definition.” in British Machine
[21] J. Daugman, “The importance of being random: statistical principles of Vision Conference, 2006.
iris recognition,” Elsevier Pattern Recognition, 2003. [48] D. Cristinacce, T. Cootes, and I. Scott, “A multi-stage approach to facial
[22] D. Hansen and A. Pece, “Eye tracking in the wild,” Computer Vision feature detection.” in British Machine Vision Conference, 2004.
and Image Understanding, 2005. [49] M. Asadifard and J. Shanbezadeh, “Automatic adaptive center of pupil
[23] P. Campadelli, R. Lanzarotti, and G. Lipori, “Precise eye and mouth detection using face detection and cdf analysis,” in International Multi-
localization,” International Journal of Pattern Recognition and Artificial Conference of Engineers and Computer Scientists, 2010.
Intelligence, 2009. [50] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with
[24] N. Markuš, M. Frljak, I. S. Pandžić, J. Ahlberg, and R. Forchheimer, deep convolutional neural networks,” in Neural Information Processing
“Eye pupil localization with an ensemble of randomized trees,” Pattern Systems, 2012.
Recognition, 2014. [51] O. M. Parkhi, A. Vedaldi, A. Zisserman et al., “Deep face recognition.”
[25] C. Morimoto, D. Koons, A. Amir, and M. Flickner, “Pupil detection in BMVC, vol. 1, no. 3, 2015, p. 6.
and tracking using multiple light sources,” Image and Vision Computing,
2000.
[26] J. Zhu and J. Yang, “Subpixel eye gaze tracking,” in IEEE International
Conference on Automatic Face and Gesture Recognition, 2002.