You are on page 1of 4

Deep Learning based Eye Gaze Tracking for

Automotive Applications: An Auto-Keras Approach


Adrian Bublea, Cătălin Daniel Căleanu
Department of Applied Electronics
Faculty of Electronics, Telecommunications and Information Technologies
Politehnica University Timișoara
Timisoara, Romania
adrian.bublea@student.upt.ro, catalin.caleanu@upt.ro

Abstract—We propose a deep neural network-based gaze for one of the most promising type of Human Machine
sensing method in which the design of the neural architecture is Interface (HMI).
performed automatically, through a network architecture
search algorithm called Auto-Keras. First, the neural model is At the beginning, lots of time and effort were put into the
generated using the Columbia Gaze Data Set. Then, the research of eye-gaze tracking using different head-mounted
performance of the solution is estimated on an online scenario systems to be able to measure the gaze more accurately.
and proves the generalization ability of our model. In However, this kind of systems are not of interest anymore for
comparison to a geometrical approach, which uses dlib facial consumers, in general, or for the automotive industry because
landmarks, filtering and morphological operators for gaze it is impractical to wear bulky headwear. Recently, due to the
estimation, the proposed method provides superior results and improvements of embedded imaging acquisition and
certain advantages. processing capabilities, the remote monitoring of eye gaze
emerged as an attractive solution. The problems generated by
the head pose and orientation in regard with eye-gaze tracking
were tackled using either model-based or appearance-based
Keywords— eye gaze tracking, deep learning, automotive methods, e.g. the work of J. G. Wang [5] or Y. Sugano [6].
Other researches have opted to use near-infrared (NIR)
I. INTRODUCTION illumination [7], stereo imaging [8], zoom cameras in
In these days, more and more cars are making their way to combination with wide-angle cameras [9], or a combination
the streets, crowding them and making driving more of these to increase coverage, so that they will be able to allow
challenging. This, and also the increase in top speed and a larger head movement.
acceleration ability of a car, are making driving more tiresome
and require much more attention and awareness. Among the In the last years, due to the remarkable performances of
main traffic problems that are causing accidents, we could DNNs in visual computing tasks, the deep learning-based
mention the violation of the traffic rules, speeding, driving solutions for gaze estimation have gained an increased
under the influence of alcohol and drugs. Still, 80% of crashes popularity. For example, S. Vora et. al. compared the
involve driver distraction, thus making Advanced Driver performances of several Convolution Neural Network (CNN)
Assistance Systems (ADAS) an important component for architectures: AlexNet, VGG16, ResNet50 and SqueezeNet in
alerting the driver in case of dangerous situations. predicting 6 gaze zones plus eyes closed case [10]. A
Recurrent-CNN network architecture that combines
The aim of this work was to develop a Deep Neural appearance, shape and temporal information for video-based
Network (DNN) based gaze zone estimation for automotive gaze estimation is introduces in [11]. In order to overcome the
applications that monitors the driver during his trip from one problem of head rotation, H. S. Yoon et. al. are proposing a
location to another, ensuring a safer driving environment for combination of single image and dual near-infrared cameras
him and other traffic participants. The application can be used [12]. They use a Deep Convolutional Neural Network
to provide drivers with assistance and warnings to take (DCNN) that simultaneously uses both image types. The
appropriate actions and act accordingly. conventional ResNet model was modified by replacing its last
7 x 7 average (AVG) pooling layer with an additional
The paper is organized as follows: Section II makes a brief
convolutional layer due to the problem of high inter-class
overview of the previous works and researches in eye gaze
similarity.
estimation problem; Sections III describes the proposed
system from an algorithmic perspective; The experimental For more in-depth review of CNNs for gaze estimation,
part and the conclusions are presented respectively in Section see [13].
IV and Section V.
Apart from many of the above-mentioned approaches,
II. RELATED WORK ours does not require explicit personal calibration for each
user and is able to differentiate a higher number (nine) of
The first researches in the field of gaze estimation dates zones. To the best of our knowledge, our work is the first one
back in the `80s and were dedicated to help the paralyzed employing a NAS algorithm for designing a gaze detection
people use eye-gaze controlled computers (T. E. Hutchinson model. It also provides top results for Columbia Gaze Data
[1] and J. L. Levine [2]). However, one of the first researchers Set: 85% accuracy for a 78%-22% training-testing split. It
to consider employing eye-gaze for normal users was R. J. K. shows further cross-driver and real-time capabilities in
Jacob [3]. In the early 2000’s, in his work [4], A. T. realistic driving scenarios.
Duchowsky points that eye gaze tracking could be the basis

978-1-7281-9513-1/20/$31.00 ©2020 IEEE


III. PROPOSED FRAMEWORK IV. EXPERIMENTAL RESULTS
The next two approaches were proposed for the gaze The experiments referring to gaze estimation were
estimation problem: performed in two situations: offline and real traffic:
A. The geometrical approach A. Offline estimation
It consists of a facial keypoints predictor from Dlib library The first experiment was performed using Columbia Gaze
which uses a facial detection algorithm, and an eye gaze Data Set (CAVE DB) [17] which contains a total of 5,880
tracking system detecting several eyes gaze directions and images of 56 different people (32 male, 24 female) with a
also when a person blinks or keeps his eyes shut. The Dlib resolution of 5,184 x 3,456 pixels; 21 of the subjects were
facial-feature tracker represents a pretrained detector, already Asian, 19 were White, 8 were South Asian, 7 were Black, and
trained on iBUG 300-W face landmark dataset that uses the 4 were Hispanic or Latino [18]. The subjects ranged from 18
ensemble of regression trees to directly detect 68 facial to 36 years of age, and 21 of them wore prescription glasses.
landmarks in the captured driver face images. For our purpose, For each subject, there are 5 head poses (0°, ±15°, ±30°) and
of interest are only the keypoints situated in the eyes’ region 21 gaze directions: seven horizontal gaze directions (0°, ±5°,
(fig. 1). ±10°, ±15°) and three vertical gaze directions (0°, ±10°). The
available images undergo a Haar Cascade based eye region
selection. The resulting detections are grouped into nine
classes: center, left, right, up, down, up-left, up-right, down-
left and down-right (fig. 2).

Fig. 1. From the Dlib’s 68 facial keypoints, only 12 points are selected: 36
up to 47.

The darkest part of the components of an eye is the pupil;


this observation can be used to detect it by applying a filter on
eye-only frames. For marking and isolating the pupil, a
succession of operations were performed: bilateral filtering,
morphological erosion, binary thresholding and contours
Fig. 2. Starting from the CAVE database, we manually grouped the various
finding. The eye keypoints were used to define the two eye head poses and gaze orientations in a new dataset format having 9 gaze zones.
regions. Then, the pupil is assigned to one of the four parts of
the trigonometric circle by comparing the detected eye center An Auto-Keras image classifier was initialized with
coordinates with the coordinates of the middle point of the eye maximum number of trials of 10 and maximum number of
regions. This procedure needs both empirically determined epochs of 50. Then, it was fed with the training data described
thresholds and user dependent calibrations. above. The network architecture searching process details are
provided in Tab. 1 and the resulting DNN architecture is
B. The Auto-Keras approach summarized in fig. 3. The test accuracy of the geometrical
The field of Automated Machine Learning (AutoML) approach was at half (42%) from the one obtained with the
represents a new and active research direction among AI Auto-Keras generated model in a setup with 5000 training and
practitioners due to the promise of being able to automatically 880 validation samples.
generate models which fit the training data. Currently, there
are several such approaches, e.g., auto-sklearn, TPOT, TABLE I. AUTO-KERAS NAS: ACCURACY VS NUMBER OF TRAILS
HyperOpt and several Neural Architecture Search (NAS) Trial Val_loss Val_accuracy
implementations. If the problem requires the design of DNN
architectures, the latter option should be employed. NAS has Trial 1 Complete [00h 02m 38s] 2.0882 0.2446
already demonstrated its viability in many tasks such as image Trial 2 Complete [00h 30m 22s 0.4566 0.8441
recognition [14] or language modeling [15]. As the working
Trial 3 Complete [00h 12m 35s] 1.5692 0.3777
principle behind the NAS, reinforcement learning (RL),
evolutionary algorithm (EA), gradient descent or Bayesian Trial 4 Complete [00h 34m 31s] 0.4586 0.8532
optimization could be mentioned. Trial 5 Complete [00h 16m 03s] 0.8788 0.7304
Google Cloud AutoML is a commercially available Trial 6 Complete [00h 10m 11s] 2.1126 0.1536
solution that uses proprietary technologies, probably a RL and
Trial 7 Complete [00h 37m 57s] 0.7843 0.7918
EA combination, for finding suitable DNN architectures. The
main problem is that NAS algorithms are notorious for the Trial 8 Complete [00h 32m 08s] 0.4656 0.8419
prohibitive computational demands, thus making the Google’s Trial 9 Complete [00h 15m 08s] 2.1033 0.1638
tools cost prohibitive. In this paper we employed a Bayesian
optimization guided network morphism tool, called Auto- Trial 10 Complete [00h 23m 59s] 0.5628 0.8214
Keras. It represents a free/open source NAS alternative built Total elapsed time: 03h 35m 38s
on top of TensorFlow/Keras framework [16].
Best model 0.6331 0.8521
B. Real traffic
We further collected a naturalistic driving video sequence.
An in-action view of the application is presented in the images
below (Fig. 4 a) and b)). The setup was consisting of a laptop
on which the application was running, and a mobile phone
used as a camera. This choice offers better mobility, an easier
camera positioning (behind the steering wheel) and a safer
driving.

a) Looking in the right mirror.

b) Looking in the left mirror.


Fig. 4. Drive test: gaze detection for looking into the mirrors. Both
geometrical and Auto-Keras methods are performing well.

This experiment also enables us to:


- Perform a cross-subject test, as the Auto-Keras generated
model was trained on different subjects.
- Study the influence of camera position. The CAVE DB has
the camera placed at eye level whereas in our setup the camera
is placed much below the eye level.
In this situation, the frames have not been annotated with
the eyes gaze orientation, so just a subjective evaluation of the
two methods was performed. This time, both methods are
providing quasi-identical and accurate responses, as the real
traffic scenario was not as challenging as the Columbia
dataset.
Fig. 3. Auto-Keras generated DNN architecture for gaze estimation
applications.
V. CONCLUSION [5] J. G. Wang, E. Sung, “Study on eye gaze estimation”, IEEE Trans.
Syst., Man, and Cybernetics – Part B, vol. 32, no. 3, 2002, pp. 332-350.
In the last years, due to the remarkable performances of [6] Y. Sugano, Y. Matsushita and Y. Sato, "Appearance-Based Gaze
DNNs in visual computing tasks, the deep learning-based Estimation Using Visual Saliency," in IEEE Transactions on Pattern
solutions gained an increased popularity [19]-[21]. Analysis and Machine Intelligence, vol. 35, no. 2, pp. 329-341, Feb.
2013, doi: 10.1109/TPAMI.2012.101.
The driver monitoring application is an important safety [7] J. Wu, W. Ou and C. Fan, "NIR-based gaze tracking with fast pupil
feature in automotive. With the help of eye gaze tracking ellipse fitting for real-time wearable eye trackers," 2017 IEEE
technology, a driver’s behavior can be determined, since the Conference on Dependable and Secure Computing, Taipei, 2017, pp.
driver’s visual recognition behavior provides most of the 93-97, doi: 10.1109/DESEC.2017.8073839.
information needed for safe driving. This technology allows [8] S. W. Shih, J. Liu, “A novel approach to 3-D gaze tracking using stereo
us to implement HMI functionalities, determine the driver cameras”, IEEE Trans. Syst., Man, and Cybernetics – Part B, vol. 34,
no. 1, 2012, pp. 234-245
drowsiness level or if the diver is distracted and not paying
[9] D. H. Yoo, M. J. Chung, “A novel non-intrusive eye gaze estimation
attention to the road. These detections can be further used to using cross-ratio under large head motion”, Computer Vision and
provide driver support and assistance. Image Understanding, vol. 98, no. 1, Apr. 2005, pp. 25-51.
In this article, an automated procedure for generating a [10] S. Vora, A. Rangesh and M. M. Trivedi, "Driver Gaze Zone Estimation
Using Convolutional Neural Networks: A General Framework and
neural model for eye gaze sensing has been proposed. It scores Ablative Analysis," in IEEE Transactions on Intelligent Vehicles, vol.
85% accuracy using Columbia Gaze Data Set and a 78%-22% 3, no. 3, pp. 254-265, Sept. 2018, doi: 10.1109/TIV.2018.2843120.
training-testing data split. The best model (tab. 1) has a [11] C. Palermo, J. Selva, M. A. Bagheri, S. Escalera, “Recurrent CNN for
different accuracy from any intermediate trails results as 3D Gaze Estimation using Appearance and Shape Cues”, British
Auto-Keras, for faster searching, early stop on all the models. Machine Vision Conference, 2018.
However, the final model is trained again in the end, to ensure [12] H. S. Yoon, N. R. Baek, N. Q. Truong and K. R. Park, "Driver Gaze
it trains the number of epochs specified by the user. Detection Based on Deep Residual Networks Using the Combined
Single Image of Dual Near-Infrared Cameras," in IEEE Access, vol. 7,
The research in this paper is of significant practical value pp. 93448-93461, 2019, doi: 10.1109/ACCESS.2019.2928339.
as the proposed method proved to be accurate in both online [13] A. A. Akinyelu and P. Blignaut, "Convolutional Neural Network-
and offline evaluations, without requiring calibration steps. Based Methods for Eye Gaze Estimation: A Survey," in IEEE Access,
vol. 8, pp. 142581-142605, 2020, doi:
As possible further improvements and research directions 10.1109/ACCESS.2020.3013540.
we could mention: [14] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang.
“Efficient architecture search by network transformation”, AAAI,
- Dealing with occlusions and images that contain only partial 2018.
data by employing better face detectors. [15] Barret Zoph and Quoc V Le., “Neural architecture search with
reinforcement learning”, ICLR, 2017.
- Speeding up inference time using ML accelerators. [16] Haifeng Jin, Qingquan Song, and Xia Hu. "Auto-keras: An efficient
neural architecture search system." Proceedings of the 25th ACM
- Investigate other AutoML implementations. SIGKDD International Conference on Knowledge Discovery & Data
- Experiments on more challenging data sets, e.g., Gaze360 Mining. ACM, 2019.
[22] or Gaze-in-the-Wild [23]. [17] Columbia Gaze Data Set (CAVE DB),
https://www.cs.columbia.edu/CAVE/databases/columbia_gaze
[18] B.A. Smith, Q. Yin, S.K. Feiner and S.K. Nayar, "Gaze Locking:
Passive Eye Contact Detection for Human?Object Interaction," ACM
ACKNOWLEDGMENT Symposium on User Interface Software and Technology (UIST), pp.
271-280, Oct. 2013.
The authors are thankful to Night Vision Team @ Veoneer [19] E. R. Tomodan, C.-D. Căleanu, “Bag of Features vs Deep Neural
Timișoara for fruitful discussions and constructive Networks for Face Recognition”, 2018 13th International Symposium
suggestions in elaborating Adrian Bublea’s Bachelor Thesis on Electronics and Telecommunications (ISETC’18), Timisoara, 2018,
regarding gaze estimation for automotive applications. pp. 1-4.
[20] R. Mîrșu, G. Simion, C.-D. Căleanu, I.M. Pop-Calimanu, “A PointNet-
REFERENCES Based Solution for 3D Hand Gesture Recognition”, Sensors, vol. 20,
no. 11, 2020, Open Access: https://www.mdpi.com/1424-
8220/20/11/3226, https://doi.org/10.3390/s20113226
[1] T. E. Hutchinson, K. P. White, W. N. Martin, K.C. Reichert, L. A. Frey,
[21] R. Mîrșu, G. Simion, C.-D. Căleanu, O. Ursulescu, „Deep Neural
“Human-computer interaction using eye-gaze input”, IEEE
Networks vs Bag of Features for Hand Gesture Recognition”,
Transactions On Systems Man And Cybernetics, vol 19, no. 6, pp
Telecommunications and Signal Processing (TSP), Budapest,
1527-1534, 1989.
Hungary, July 1-3, 2019, https://doi.org/10.1109/TSP.2019.8768812
[2] J. L. Levine, “An eye-controlled computer”, IBM Thomas J., Watson
[22] P. Kellnhofer, A. Recasens, S. Stent, W. Matusik and A. Torralba,
Research Center, Research Report RC-8857, Yorktown Heights, N. Y,
"Gaze360: Physically Unconstrained Gaze Estimation in the Wild,"
1981.
2019 IEEE/CVF International Conference on Computer Vision
[3] R. J. K. Jacob, “The use of eye movements in human-computer (ICCV), Seoul, Korea (South), 2019, pp. 6911-6920, doi:
interaction techniques: what you look at is what you get”, ACM 10.1109/ICCV.2019.00701.
Transactions on Information Systems, vol. 9, no. 2, 1991, pp. 152-169.
[23] Kothari, R., Yang, Z., Kanan, C. et al. Gaze-in-wild: A dataset for
[4] A. T. Duchowski, “A breath-first survey of eye-tracking applications”, studying eye and head coordination in everyday activities. Sci Rep 10,
Behavior Research Methods, Instruments, & Computers, vol. 34, no. 4, 2539, 2020.
2002, pp. 455-470.

You might also like