You are on page 1of 13

To Appear in IEEE Transactions on Intelligent Transportation Systems 2014

Hand Gesture Recognition in Real-Time for Automotive Interfaces:
A Multimodal Vision-based Approach and Evaluations
Eshed Ohn-Bar and Mohan Manubhai Trivedi, Fellow, IEEE

comfort for certain users (as opposed to interaction approaches
based solely on tangible controllers), we propose an alternative or
supplementary solution to the in-vehicle interface. As each
modality has its strengths and limitations, we believe a multi-modal
interface should be pursued for leveraging advantages from each
modality and allowing customization to the user.

Abstract—In this work, we develop a vision-based system
that employs a combined RGB and depth descriptor in order to
classify hand gestures. The method is studied for a humanmachine interface application in the car. Two interconnected
modules are employed: one that detects a hand in the region
of interaction and performs user classification, and the second
performing the gesture recognition. The feasibility of the
system is demonstrated using a challenging RGBD hand
gesture data set collected under settings of common
illumination variation and occlusion..

Advantages for developing a contact-less vision-based
interface solution: The system proposed in this paper may offer
several advantages over a contact interface. First, camera
input could possibly serve multiple purposes, in addition to the
interface. For instance, it allows for analysis of additional hand
activities or salient objects inside the car (as in [8]–[12]),
important for advanced driver assistance systems.
Furthermore, it allows for the determination of the user of the
system (driver or passenger), which can be used for further
customization. Second, it offers flexibility to where the gestures
can be performed, such as close to the wheel region. A
gestural interface located above the wheel using a heads up
display was reported to have high user acceptability in [2]. In
addition to al-lowing for interface location customization and a
non-intrusive interface, the system can lead to further novel
applications, such as for use from outside of the vehicle. Third,
there may be some potential advantages in terms of cost, as
placing a camera in the vehicle involves a relatively easy
installation. Just as contact gestural interfaces showed certain
advantages compared to conventional interfaces, contact-free
interfaces and their effect on driver visual and mental load
should be similarly studied. For instance, accurate
coordination may be less needed when using a contact-free
interface as opposed to when using a touch screen, thereby
possibly reducing glances at the interface.

Index Terms—Human-machine interaction, hand gesture
recognition, driver assistance systems, infotainment, depth cue

Recent years have seen a tremendous growth in novel
devices and techniques for human-computer interaction (HCI).
These draw upon human-to-human communication modalities
in order to introduce a certain intuitiveness and ease to the
HCI. In particular, interfaces incorporating hand gestures have
gained popularity in many fields of application. In this paper,
we are concerned with the automatic visual interpretation of
dynamic hand gestures, and study these in a framework of an
in-vehicle interface. A real-time, vision-based system is
developed, with the goal of robust recognition of hand gestures
performed by driver and passenger users. The techniques and
analysis extend to many other application fields requiring hand
gesture recognition in visually challenging, real-world settings.

Motivation for in-vehicle gestural interfaces: In this paper,
we are mainly concerned with developing a vision-based,
hand gesture recognition system that can generalize over
different users and operating modes, and show robustness
under challenging visual settings. In addition to the general
study of robust descriptors and fast classification schemes
for hand gesture recognition, we are motivated by recent
research showing advantages of gestural interfaces over
other forms of interaction for certain HCI functionalities.

Challenges for a vision-based system: The method must
generalize over users and variation in the performance of the
gestures. Segmentation of continuous temporal gesture events
is also difficult. In particular, gesture recognition in the volatile
environment of the vehicle’s interior differs significantly from
gesture recognition in the constrained environment of an
office. Firstly, the algorithm must be robust to varying global
illumination changes and shadow artifacts. Secondly, since the
camera is mounted behind the front-row seat occupants in our
study and gestures are performed away from the sensor, the
hand commonly self-occludes itself throughout the
performance of the gestures. Precise pose estimation (as in
[13], [14]) is difficult and was little studied before in settings of
harsh illumination changes and large self-occlusion, yet many
approaches rely on such pose information for producing the
discriminatory features for gesture classification. Finally, fast
computation (ideally real-time) is desirable.

Among tactile, touch, and gestural in-vehicle interfaces,
gesture interaction was reported to pose certain advantages
over the other two, such as lower visual load, reduced driving
errors, and a high level of user acceptability [1]–[3]. The
reduction in visual load and non-intrusive nature led many
automotive companies to research such HCI [4] in order to
alleviate the growing concern of distraction from interfaces with
increasingly complex functionality in today’s vehicles [5]–[7].
Following a trend in other devices where multi-modal
interfaces opened ways to new functionality, efficiency, and
The authors are with the Laboratory for Intelligent and Safe
Automobiles (LISA), University of California San Diego, San Diego, CA
92093-0434 USA (e-mail:;

In order to study these challenges, we collected a RGB1

Hand detection is commonly performed using skin analysis [33]. etc. These can be extracted locally using spatio-temporal interest points (as in [22].g. Information of pose. The type of gestures varies from coarse hand motion to fine finger motion. hand gesture 2 . 1. video games [40]. spatiotemporal interest points as in Dollar´ et al. Hand gesture recognition systems commonly use depth information for background removal purposes [46]–[49]. [42]. facilitated the development of many gesture recognition systems. [20]. Video descriptors for spatio-temporal gesture analysis: Recent techniques for extracting spatio-temporal features from video and depth input for the purpose of gesture and activity recognition are surveyed in [19]. Minnen et al. Generally. The results of this study demonstrate the feasibility of an in-vehicle gestural interface using a real-time system based on RGBD cues. showing its difficulty (Table IV). ! (b) Example gestures in the dataset. Different common spatio-temporal feature extraction methods were tested on the dataset. Depth (RGBD) dataset of 19 gestures. In particular. Hand gesture recognition with RGBD cues: The introduc-tion of high-quality depth sensors at a lower cost. rotate clockwise. Such features may be hand crafted. There has been some work in adapting color descriptors to be more effective when applied to depth data. A Hidden Markov Model (HMM) may also be used [50] for gesture modeling. on recognition performance. driver assistance [35]. (b) The gestures shown are (top to bottom): clockwise O swipe.To Appear in IEEE Transactions on Intelligent Transportation Systems 2014 Time (a) Large variation in illumination and performance of the gestures. as done in this paper. [46] proposed using a Finger-Earth Mover’s Distance (FEMD) for recognizing static poses. high contrast shadows. [53]) In this paper. [36].) throughout the performance of the gestures in the dataset are shown. Finally. In [34]. such as the Microsoft Kinect. the gesture dataset is used to study effects of different training techniques. as in [17]. As noted by [52]. [39]. In [33]. the effects of occlusion. as demonstrated in [25]–[30]. [23]) or sampled densely. Examples of gesture samples and the challenging settings are shown in Fig. common RGB based techniques (e. [48]. 1: Examples of the challenges for a vision-based in-vehicle gesture interface. Relevant literature related to gesture recognition and user interfaces is summarized below. scroll up. such as user-specific training and testing. The classification of static gestures is performed using an average neighborhood margin maximization classifier combined with depth and hand rotation cues. depth infor-mation is used to segment the hand and estimate its orientation using PCA with a refinement step. Fig. although difficult to obtain in our appli-cation. medical in-strumentation [41]. resulting in frequent self-occlusion. [51] used features of global image statistics or grid coverage. Each of the descriptors is compared over the differ-ent modalities (RGB and depth) with different classification schemes (kernel choices for a Support Vector Machine classifier [18]) for finding the optimal combination and gaining insights into the strengths and limitations of the different approaches. Gestures are performed away from the sensor. II. The gesture recognition system studied is shown to be suitable for a wide range of functionalities in the car. and a randomized decision forest for depth-based static hand pose recognition. recognition methods may extract shape and motion features that represent temporal changes corresponding to the gesture performance. and illumination variability due to the position of the interface in the top part of the center console. a nearest neighbor classifier with a dynamic time warping (DTW) measure was used to classify dynamic hand gestures of digits from zero to nine. A set of common spatio-temporal descriptors [15]– [17] are evaluated in terms of speed and recognition accuracy. performed 2-3 times by 8 subjects (each subject preformed the set as both driver and passenger) for a total of 886 instances. we pursue a no-pose approach for recognition of gestures. pinchnzoom-in. hand gesture recognition systems were developed with applications in fields of sign language recognition [31]–[34]. Illumination artifacts (saturation. The dataset collected allows for studying user and orientation invariance. RELATED RESEARCH STUDIES As the quality of RGB and depth output from cameras improve and hardware prices decline. is also highly useful for recognition. a wide array of applications spurred an interest in gesture recognition in the research community. or learned using a convolutional network [24]. and other human-computer interfaces [43]–[45]. smart environments [37]– [39].

Experimental Setup and Dataset The proposed system uses RGB and depth images in a region of interest (ROI). III.To Appear in IEEE Transactions on Intelligent Transportation Systems 2014 Temporal Segmentation Module Depth image Feature RGB image extraction Hand detection and Temporal Temporally user determination filtering segmented videos input output Gesture Recognition Module Spatio-temporal RGBD video feature extraction input Gesture Time Temporally segmented classificati on output Fig. [55]. [57] studied 17 hand gestures and six head gestures using an infrared camera. The inventory is not explicitly mentioned. we briefly review works with affinity to the vehicle domain. First. and a HMM and rule-based classifier. [58] used a Theremin device. we . as well as the speed of the algorithm. 1 and Fig. 2: Outline of the main components of the system studied in this paper for in-vehicle gesture recognition. Althoff et al. 3). Endres et al. Each descriptor is applied to the RGB and depth modality separately. There also has been some work towards standardization of the in-vehicle gestural interaction space [56]. In this work we focus on approaches that do not involve tracking of hand pose. motion boundary descriptors and dense trajectories [17]. Common spatio-temporal feature extraction methods such as a histogram of 3D oriented gradients (HOG3D) [16]. Hand gesture interfaces in the car: Finally. HAND GESTURE RECOGNITION IN THE CAR A. where a CCD camera and NIR LEDs illumination in a simulator were used to perform gesture recognition out of an elaborate gesture inventory of 15 gestures. and only one subject was used. a contactless device consisting of two metal antennas. This is followed by spatio-temporal feature analysis for performing gesture classification. and need to be adjusted as in [54]. For classification. generating a signal which is fed to a DTW classifier. Moving the hand alters the capacity of an oscillating current. this ROI was chosen to be the instrument panel (shown in Fig. an SVM classifier is employed [18]. which is either the passenger or the driver. A similar effort to ours was reported in Zobl et al. A HMM is employed to perform the dynamic gesture recognition. and other forms of gradient-based spatio-temporal feature extraction techniques [15] will be evaluated on the challenging dataset. The gestures used were both static and dynamic. Static gestures may be used to activate the dynamic gesture recognizer. and finally these are earlyfused together by concatenation. the hand detection module provides segmentation of gestures and determines the user. In order to demonstrate the feasibility of the system. In our experiments. may not work well on the output of some depth sensors.

The main focus of this work is recognition of gestures under illumination artifacts. 3 . and not the effects of the interface on driving. 4 shows the illumination variation among videos and subjects. Interface location: Among previously proposed gestural interfaces. the location of the interface varies significantly. subjects were requested to drive slowly in a parking lot while performing the gestures. A large variation in the dataset is observed in Fig. A temporal sum was performed over the number of pixel intensities above a threshold in each gesture video to produce an average intensity score for the video. Each gesture was performed about three times by eight sub-jects. the average number of high intensity pixels over the m n images It in a video of length T . It was observed that following the initial learning of the gesture set. These large and inaccurate movements provided natural Fig. At times this resulted in the hand partially leaving the pre-defined infotainment ROI.ucsd. 1 Intensity Score = t m n T X =1:T (1) jf(x. the dataset contains 886 gesture samples. Therefore. and the ROI is 115 250. as the gestures were verbally instructed. In our study. The gestures are all dynamic. y) : It(x.variations which were incorporated into the training and testing set. Altogether. as these are common in human-to-human communication and existing gestural interfaces. collected a dataset containing 19 hand both within the same subject and among subjects. The size of the RGB and depth maps are both 640 480. the gestures were performed by the center console. Each subject performed the set two times. both passenger and driver carried the gestures more naturally. Subjects 1 and 4 performed the gestures in a stationary vehicle. y) > 0:95gj That is. as strokes became large and more flowing. The dataset is publicly available at http://cvrr. once as the driver and once as the passenger.html. 4.

Weather conditions are indicated as overcast (O) and sunny (S). 436g 10−3 −4 2 10 1 3 4 5 6 7 8 Subject Fig. Scrolls provide volume control. swipe right. swipe X. and not with the hand. Swipe up and swipe down are used for transition between bird-eye view to street view. 1). Swipes provide the ‘next’ and ‘previous’ controls. the X and V swipes provide feedback and ranking of a song. The triangles plot the overall mean for each subject. 1-bottom. The motion in these is mostly performed with the fingers. GS2 involves additional gestures for music control. a fist following a spread open palm or vice-versa. depth. This gesture set contains gestures that can be used for general navigation through other menus if needed. Since recognition was found to be a challenging task on its own. and with three fingers allows for a voice search of a song. the scrolls for moving throughout a map. two finger rotation gestures rotate the view. and the type of feedback that will be used. We note that there were small variations in the performance of some of the gestures. A tap with one finger pauses. scroll down. spatio-temporal features are evaluated in terms of speed. the more complex GS3 contains refined gestures purposed for picture or navigation control. swipe up. Each point corresponds to one gesture sample video. Finally. 4 . as opposed to the scroll where the fingers move with the entire hand in the direction of the scrolling: scroll left. One tap gestures can be done with one or three fingers. position of the hand. and point cloud) for the in-vehicle vision-based gesture recognition system studied in this work. # Passengerg = f450. the open and close gestures are used to answer or end a call. it is the main focus of this paper. scroll right. one tap-1 and one tap-3. Finally. depending on the starting B. the location of the interface should depend on whether the system aims to replace or supplement existing secondary controls. and expand and pinch allows for zoom control. Hand Detection and User Determination Both recognition and temporal segmentation must be addressed. so that the user can ‘like’ or ‘dis-like’ songs. 3. In particular. Next we have the open and close. swipe down. and scroll up. and the expand (opposite motion). Gesture inventory: The inventory is as follows. 4: Illumination variation among different videos and sub-jects as the average percent of high pixel intensities (see Eqn. for instance the swipe X and swipe + can be performed in multiple ways. as well as rotate counterclockwise and rotate clockwise (Fig. For GS1 (phone). A one finger tap is used for ‘select’. 0 Score 10 −1 10 −2 10 Intensity Subject Gender Weather Skin Color 1 M O C2 2 M S C2 3 M S C2 4 M O C3 5 M S C1 6 M S C3 7 F S C3 8 M S C1 Total Samples: f# Driver. swipe V. A set of function-alities is proposed for each gesture. TABLE I: Attribute summary of the eight recording sequences of video data used for training and testing. We chose a position that would be difficult for a vision-based system due to illumination artifacts and self-occlusion.To Appear in IEEE Transactions on Intelligent Transportation Systems 2014 Fig. we use a two finger pinch as shown in Fig. Gesture functionality: The 19 gestures are grouped into three subsets with increasing complexity for different in-vehicle applications as shown in Table II. as shown in Fig. In future design. Skin-color varies from light (C1) to intermediate (C2) and dark brown/black (C3). 1-second row). 3: Camera setup (color. and the swipe + provides the ‘info/settings/bring up menu’ button. Finally. Time of capture was done in afternoon and mid-afternoon. swipe + (plus). Videos with little to no illumination variation were taken using subjects 1 and 4. Two-finger swipe gestures: swipe left.

the spatial HOG algorithm 1 1 described in Section III-B is applied again using a M N grid of 1 cells and B angle quantization bins to extract a compact 1 1 1 temporal descriptor of size M N B .7 0. : : : . M Ng. N. or 3) passenger. Although temporal segmentation is a difficult problem as well. This is done with a simplified version of the histogram of oriented (HOG) algorithm [60] which will be described below and an SVM classifier. : : : .y where 2 f + s :2 2 B (2) : g and 1 is the indicator m MN ]: d (I1. and then again on Region integration for improved hand detection: The specific setup of location and size of the ROI can have a sig-nificant impact on the illumination variation and background noise in the ROI. These are compared in Table III in terms of extraction time and dimensionality. as shown in Fig. For instance. HOG : Another choice of is motivated by [15].15 gear shift. Given a set of video frames. The discrete derivatives Gx and Gy are approximated using a 1D centered first difference [ 1. the spatial descriptors are collected over time to form a 2D array (visualized in Fig. each is applied to the RGB and depth video independently. hT (4) The pipeline for this algorithm contains three parameters. y) be an m n signal. Changes in the feature vector correspond to changes in the shape and location of the hand. : : : .9 Pos itives Gesture Set 2 (GS2) MusicnMenu Control SwipeX SwipeV OneTap3 ScrollUp2 ScrollDown2 OneTap1 SwipeRight SwipeLeft True Gesture Set 1 (GS1) Phone SwipePlus SwipeV Close ScrollUp2 ScrollDown2 Open 1 is X s Gx. Gesture Set 3 (GS3) PicturenNavigation SwipeUp SwipeDown OneTap1 ScrollUp2 ScrollDown2 ScrollRight2 ScrollLeft2 RotateCntrClwse RotateClwse ExpandnZoomOut PinchnZoomIn Rate 0. The approach is termed 2 HOG . The local histogram is normalized using an L2s s s normalization: h ! h = k(h )k2 + . the video is first resized to T = 20 frames by linear interpolation so that the descriptor is fixed in size. 5). Let G . for s 2 f1. and varying generalization.6 (3) 2 For additional details and analysis on this part of the algorithm we refer the reader to [59]. : : : . the descriptor at frame is the p ht = [h . so that the hand must leave the ROI between different gestures. The classification may be binary. Because the location of the ROI in our setup produces common illumination artifacts. and side hand-rest regions were shown to increase detection accuracy for the driver’s hand in the ROI (Fig. function. 6.8 0. : : : . IT ) = h1. 0. We found that over-lapping the blocks produces improved results. a three class classification performs recognition of the user: 1) no one. The image is split into M N blocks. 50% overlap between the cells is used.y h (q) = x. HOG spatial feature extraction: Let I(x. we detail the implementation of the visual features extraction used in this work. as in [59]. and B. and quantized orientation angles into B bins. and throughout the experiments a th Region Integration Fig.2 C. s s 0. and average over the videos in the dataset. 2) of size T (M N B). we time feature extraction for each video. 1] to obtain the magnitude. we found that using visual information from other ROIs in the scene improves hand detection performance under ambiguous and challenging settings [61]. and fix M = N. namely M. : R R R ! R . As the instrument panel region is large with common illumination artifacts. in this work we employ a simple segmentation of temporal gestures using a hand presence detector. detecting whether a hand or not is present in the ROI. [62].1 0 False Positives Rate The first module in the system performs hand detection in a chosen ROI. features extracted from the wheel.25 The first module described in the previous section produces a video sequence. so that the q the histogram descriptor for the cell No Integration 0. 5 .5 performance. 5: Driver hand presence detection in the instrument panel region. For clarity and reproducibility. y) = ] 0. G. divide by the number of frames. 0. for producing the d dimensional feature vector for gesture classification. T g. h n descriptor function. so that only one parameter can be varied. concatenation of the histograms from the cells 1 T HOG: A straightforward temporal descriptor is produced by choosing a vectorization operator on the spatial descriptors in each frame. In the latter case. t 2 f1. Finally. In this case. since it involves applying the same algorithm twice (once in the spatial domain. Consequently. B t 0. or multiclass for user determination. h t. cues from other regions in the scene (such as the wheel region) can increase the robustness of the hand detection in the instrument panel region.05 0. Spatio-Temporal Descriptors from RGB and Depth Video denote a cell 1[ (x. we choose a bin for q 2 f1. We consider four approaches. In this case. which then requires spatio-temporal feature extraction for the classification of the gesture instance. . : : : . 2) driver. Bg in s 0. In calculation of extraction time. We use B = 8 orientation bins in all of the experiments.To Appear in IEEE Transactions on Intelligent Transportation Systems 2014 TABLE II: Three subsets of gestures chosen for evaluation of application-specific gesture sets.

The temporal HOG descriptor shows best results across modalities and kernels. we experimented with a range of descriptors. Results are shown on the entire 19 gestures dataset using leave-one-subjectout cross-validation (cross-subject test settings). EXPERIMENTAL EVALUATION AND DISCUSSION Spatio-temporal descriptor analysis: The descriptors mentioned in Section III were compared with the three kernels in Table IV. [17]. The other techniques involve a global descriptor computed over the entire image patch. Overall. Dimensionality 2560 256 256 2000* 1000* K2( 2 kernel. and at test time project the temporal HOG concatenation feature using the eigenspace to derive a compact feature vector. we optimize k over k 2 f500. 7 4 5 6 hT As in [15]. Asterisk * prefix . n m :R T R R ! RM1 N1 B1 (I1 . a video sequence is represented as a bag of local spatiotemporal features. D. those histograms over time). . The operation is performed on a dense grid.25 54 372 2 1 We emphasize that in our implementation. 6: Varying the cell size parameters in the HOGbased gesture recognition algorithm with a linear SVM for a RGB. As shown in Fig. we can reduce the dimension-ality of the concatenated histograms descriptor (HOG) using Principal Component Analysis (PCA).8 2. xj) = xi xj an RBF- N = N. the linear kernel is given as.. DTM (Heng et al. Trajectory shape descriptors encode local motion patterns. Given two d data points. the DTM and HOG3D baselines . In this case. The results are shown for the entire 19 ges-ture dataset with leave-one-subject-out cross validation (crosssubject test settings). we also use the mean of each of the spatial HOG features over time in the feature set. and motion boundary histograms (MBH) are extracted along the x and y directions.RGB or depth. Similarly to HOG3D.07 GHz x 8. 2000. In this case. and RGB+Depth descriptors. X (8) k KHI (xi. [16]): A spatio-temporal extension of HOG. but we fix those to be the same 1 as in the spatial HOG feature extraction so that M = M. xj) =min(xik. a Mercer similarity or kernel function needs to be defined. [17]): The dense trajectories and motion boundary descriptor uses optical flow to extract dense trajec-tories. where 3D gradient orientations are binned using convex regular polyhedrons in order to produce the final histogram descriptor. 4000g. Studying this operation is useful 2 mainly for comparison with HOG . we pre-compute the eigenspace using the training samples. 3000. k-means is run five times and the best results are reported. 1 Extraction Time (in ms) 2. There are three additional parameters for the second operation of HOG. Furthermore. 6. Classifier Choice SVM [18] is used in the experiments due to its popularity in the action recognition literature with varying types of de-scriptors [16]. (6) KLIN (xi.requires codebook construction. In our experiments. the dimension2 ality of HOG is much lower than the corresponding temporal HOG concatenation. depth.83 3.(%) To Appear in IEEE Transactions on Intelligent Transportation Systems 2014 TABLE III: Comparison of average extraction time per frame in milliseconds for each descriptor and for one modality . Experiments were done in C++ on a Linux 64-bit system with 8GB RAM and Intel Core i7 950 @ 3. In these. IT ) = HOG ( 2 . xj 2 R . only HOG3D and DTM require codebook construction with k-means. : : : . but even after parameter optimization these did not show improvement over the aforementioned baselines. We study the following three kernel choices. around which shape (HOG) and motion (histograms of optical flow) descriptors are extracted. we follow the author original implemen-tation with a dense sampling grid and a codebook produced by k-means. and each video is represented as a frequency histogram of the visual words (assignment to visual words is performed using the Euclidean distance). HOG3D (Klaser¨ et al. a 4 4 cell size in computing the HOG-based descriptors was shown to work well. xi. T HOG-PCA: Alternatively. xjk) =1:d IV. and a codebook is produced using k-means. Generally. the benefits for HOG are small. In SVM classification. and a histogram intersection kernel (HIK). 1000. Although lower performing descriptors benefit significantly from the non-linear kernels. k-means is used to produce the codebook by which to quantize features. B = B. ) = exp( 1 xi xj 2C k X (xik xjk) 2 x +x ik =1:d ) (7) jk 2 where C is the mean value of the distances over the training samples. A fixed 8 bin orientation histogram is used. h1 3 ) (5) . Note that extracting RGBD cues from both modalities will require about twice the time. 70 60 Accuracy 50 40 30 RGB Depth RGBD 20 1x1 2x2 3x3 4x4 5x5 6x6 Descriptor HOG HOG 7x7 HOG-PCA DTM[17] HOG3D[16] 8x8 Cell Size Fig. such as the Cuboids [53] and HON4D [52].

6 .

first a hand detection and user determination step was used. Although our work is concerned with gesture recognition in naturalistic settings and not the psychological aspects of the interface. as expected. Following a trend in other devices where multi-modal interfaces opened ways to new functionality and efficiency. it is used in the remaining experiments. it appears to contain complementary information to the HOG scheme when combined. Evaluation on gesture subsets: As mentioned in Section IIIA. 7) that provides a basic gesture interaction at high recognition accuracy (shown in Table VI).. Table In an attempt to propose a complete system. it is outperformed by the HOG scheme. and not for improving the results over HOG. A set of three subsets was chosen and experiments were done using three testing methods. with results shown in Table V. and histogram intersection kernel. 8. In this paper. followed 7 . Because HIK SVM with V. Average and standard deviation over the 8 folds are shown. we studied the feasibility of an in-vehicle. The three test settings are as follows: 1/3-Subject: a 3-fold cross validation where each time a third of the samples from each subject are reserved for training and the rest for testing. we observe that although the HOG shows comparable results to DTM and HOG3D. The reason is that within the 8 subjects there were large variations in the execution of each gesture. RBF. Interestingly. This is the main reason for which HOG-PCA was studied in this work. CONCLUDING REMARKS 2 the HOG+HOG descriptor showed good results.To Appear in IEEE Transactions on Intelligent Transportation Systems 2014 TABLE IV: Comparison of gesture classification results using the different spatio-temporal feature extraction methods on the entire 19 gesture dataset in a leave-one-subject-out cross validation (cross-subject test settings). as well as the effect of user-specific training. K LIN Descriptor n Modality DTM HOG3D HOG HOG HOG -PCA -PCA 2 HOG+HOG Descriptor n Modality DTM HOG3D HOG HOG -PCA 2 -PCA HOG+HOG 2 HOG+HOG Descriptor n Modality DTM HOG3D HOG HOG HOG 44:3 2 HOG+HOG HOG 41:3 14:1 35:8 9:5 44:1 11:8 -PCA 2 -PCA HOG+HOG 2 HOG+HOG 45:4 8:8 33:1 8:9 47:9 12:7 13:8 37:1 9:5 40:6 7:8 55:2 13:9 49:1 11:9 46:9 12:8 55:9 57:5 13:6 14:6 47:8 13:2 36:7 8:5 61:8 15:7 56:2 12:4 49:6 14:6 63:3 15:3 62:2 15:8 are outperformed by the rest. The confusion matrix for each gesture subset using 2/3-Subject test settings are shown in Fig. a 19 gesture dataset may not be suitable for the application of an automotive interface. Basic interface with a mode switch: Equipped with insight on the least ambiguous gestures from Fig. with additional comfort for some users. 2/3-Subject: Similarly to 1/3-Subject. 8. such as a one tap with three fingers (OneTap3) in order to navigate among functionality modes while keeping the same gesture set. vision-based gesture recognition system. we study a final gesture subset (Fig. more so than when using the HOG-PCA scheme (although the two descriptors have the same dimensionality). provide appropriate feedback. but two thirds of the samples are reserved for training from each subject and the remaining third for testing. The best result overall is prefixed by an asterisk. and maximize intuitiveness. Results are done over 8 subjects and averaged. we sought a similar solution to the in-vehicle interface. Inspecting the different HOG descriptors studied in this 2 work. Each should be designed and studied carefully in order to avoid a high mental workload in remembering or performing gestures. We propose to use one of the gestures. K2 RGB (%) 47:0 12:3 39:1 8:5 46:5 15:9 38:0 47:2 9:2 35:4 9:1 50:8 14:3 17:2 Depth (%) 40:8 9:9 43:0 11:4 57:0 17:0 48:7 13:7 K HI 47:7 12:0 37:8 6:4 47:3 14:1 42:1 49:0 11:6 34:9 8:6 43:2 11:8 44:2 8:6 57:4 15:6 48:8 49:6 14:4 49:0 14:7 17:9 58:6 15:8 57:1 57:6 16:7 RGB+Depth (%) 51:5 15:3 41:3 9:0 62:1 15:5 56:5 52:3 62:5 13:7 13:5 16:0 64:5 16:9 14:1 52:3 16:2 57:8 13:4 15:7 54:0 14:8 44:6 9:7 62:2 16:8 57:3 52:3 14:5 62:1 13:2 16:1 63:1 16:7 V reveals a lower accuracy on the challenging cross-subject testing. Linear. As each interaction modality has its strengths and limitations. In bold are the best results for each modality and for each kernel for the 2 SVM classifier. our experimental design attempted to accompany other successful existing gestural interaction interfaces. we believe a multi-modal interface should be pursued for leveraging advantages from each modality and allowing customization to the user. Cross-subject: leave-one-subject-out cross validation. possibly since these are densely sampled over the ROI yet background information does not contain useful information for the recognition (unlike other action recognition scenarios). The purpose of such a study is mostly in evaluating the generalization of the proposed algorithm.

B. M. 2010. Safety. 8 . X. 2013. Finally. A.” in IEEE Conf. “Vision on VI. Since incorporating samples of a subject in training resulted in a significantly higher recognition performance for that subject in testing. [12] E. Horrey.” in British Machine Vision Conf. 4–7. “Predicting driver maneuvers by learning holistic features. [10] C. “Driver hand activity analysis in naturalistic driving studies: challenges. Out of a set of 19 gestures. Tang. 2014. Wei. Cuong Tran and Mr. Ashish Tawari for their helpful advice. with one of the gestures used to switch between different functionality modes. and J. 1/3-Subject GS4 98:4 0:5 2/3-Subject Cross-Subject RGB+Depth (%) 99:7 0:6 92:8 8:8 [2] M. Increasing the number of user-specific samples results in improved recognition. Symp. Thomassen. and T. Gelau. Veh. 2009. J.. online learning could further improve classification rates. G. Minardo. 7: Equipped with the analysis of the previously proposed gesture subsets. vehicle. 597–614. pp. Skov. Martin. A. [13] I. Trivedi. Trivedi. 22. 1/3-Subject GS1 GS2 GS3 Overall 95:0 91:0 91:4 92:5 1:1 1:7 2:0 1:6 2/3-Subject RGB (%) 96:5 2:7 94:7 1:3 94:6 1:7 95:3 1:9 GS1 GS2 GS3 Overall 92:7 90:5 87:0 90:1 0:3 1:5 2:1 1:3 Depth (%) 94:1 1:6 93:6 1:9 90:3 2:1 92:3 1:9 80:9 12:4 72:6 19:4 67:3 16:0 73:6 15:9 GS1 GS2 GS3 Overall 95:6 92:9 93:2 93:9 1:1 1:8 1:9 1:6 RGB+Depth (%) 96:5 1:6 96:1 1:2 96:0 2:2 96:2 1:7 82:4 15:1 73:8 13:7 72:0 15:6 76:1 14:8 Mode Switch Phone Cross-Subject Music Navigation Gesture Set 4 (GS4) OneTap3 SwipeV ScrollUp2 ScrollDown2 ScrollRight2 ScrollLeft2 75:5 16:7 63:8 16:6 56:2 14:7 65:2 16:0 Fig. “Skill aquisition while operating in-vehicle information systems: Interface design determines the level of safety-relevant distractions. M. The authors would like to thank Dr. N. Rep. 2011. Guo. Tran and M. and the reviewers and editor for their helpful com-ments. 2007. Trivedi. The studied RGBD feature set might be useful for studying naturalistic hand gestures [8].” in IEEE Conf. Electron.. and C. DOT HS 811 334. Human Factors in Computing Systems. algorithms. Sun. L. four subsets were constructed for different interactivity applications. RGB+Depth is the two descriptors concatenated and a HIK SVM. 2014. The location of the ROI can be studied in order to determine optimal natural interactivity and the gesture subset can be further refined and evaluated. Gregers. Computer Vision and Pattern Recognition Workshops-Mobile Vision.To Appear in IEEE Transactions on Intelligent Transportation Systems 2014 TABLE V: Recognition accuracy and standard deviation over cross-validation using different evaluation methods discussed in Section IV. Sun. Parada-Loira. F. pp. [9]. Klauer. Tawari. Ohn-Bar. no. Oikonomidis. Argyros. “Assessing the effects of in-vehicle tasks on driving performance.” in IEEE Conf.” in CHI Human factors in computing systems. [8] E. Sudweeks. a final gesture set composed of less ambiguous gestures is defined and studied. [5] W. we hope that the evaluation in this work and the public dataset will inspire the development of new and improved hand gesture recognition techniques. 2014. Trivedi. Future extensions should further analyze the role of each of the spatio-temporal descriptors in increasing illumination-. The overall category is the mean over the column for each modality.” Human Factors and Ergonomics Society. Symp. and surround for on-road maneuver analysis. 2011. Trivedi. “Developing a car gesture interface for use as a secondary task. [11] C. We thank our colleagues for their help in the data collection process. 19. 2011. [7] S. Tran and M. no. 2014. Burnham. E. Alpern and K. A careful evaluation of different temporal descriptors showed the challenging nature of the dataset. no. Veh. Temporal segmentation of gestures without requiring the hand to leave the ROI may result in a more comfortable interface to use. for showing the best modality settings and the effects of the test settings.. but you can’t look: interacting with in-vehicle systems. Richardson. [6] G. and A. [9] E. and experimental studies. 2003. Tech. S. Veh. J.” in IEEE Intell. A. Dingus. “Vision for driver assistance: Looking at people in a vehicle. and M. Kyriazis. Computer Vision and Pattern Recognition. Jahn. J. “Realtime and robust hand tracking from depth. S.” Journal of Electronic Imaging. M. Tawari. X. M. Alba-Castro. and N. Krems. Martin.C. by a real-time spatio-temporal descriptor and gesture classi-fication scheme. J. F. M. 2008. 2009. D. with RGBD fusion proving to be beneficial for recognition. and J. and M. “Efficient modelbased 3D tracking of hand articulations using Kinect. Washington. “A research study of hand gesture recognition technologies and applications for human vehicle interaction. “Hand gestures to control infotainment equipment in cars. vol. and M.” National Highway Traffic Safety Administration. and M. and subject-invariance of the system.” Ergonomics in Design. “An analysis of driver inattention using a case-crossover approach on 100-car data: Final report. Qian. Pickering.” in Institution of Engineering and Technology Conference on Automative Electronics. occlusion-. 136–151.” in SIGCHI Conf. 4.” in IEEE Intell. REFERENCES [1] J. Ohn-Bar. Martin. A. ACKNOWLEDGMENTS This research was performed at the UCSD Computer Vision and Robotics Research and LISA: Laboratory for Intelligent and Safe Automobiles. [14] C. Ohn-Bar. Y. “Driver assistance for “keeping hands on the wheel and eyes on the road”. K. M. pp. [4] C. wheels: Looking at driver. The subset is designed for basic interaction. J. Gonzalez´-Agulla.” in Visual Analysis of Humans. S. [3] F. TABLE VI: Recognition accuracy using RGB+Depth and a HIK SVM on Gesture Set 4. 51.. “You can touch.

Sel. and R. Computer Vision Workshops.04 0.92 0. Nicolescu. Zhu. Topics Signal Process. 6. techniques evaluation. Chen.-L. G. Singh.02 OneTap1 0. Gall. 108. Kirac. pp. vol. [27] G. pp.” in IEEE Conf. “LIBSVM: A library for support vector machines.” Image and Vision Computing. M.02 0. W.93 ScrollRight2 1.02 0. Y. G.” in IEEE Intl. [20] R. Arrate. [34] P. no. [29] J.98 0. Z. [33] D. [28] R. Taylor.07 0. [16] A. V.-J. Springer International Pub-lishing. Guo. May. 2014.¨ C. [17] W. Wu. ser. [31] N. and Y.93 0. Schmid. 2014. 2014.02 0.98 0. den Bergh.00 ScrollLeft2 1.02 Pinch/ZoomIn 0. and F. G. and L. 2011. “Skeletal quads: Human action recognition using joint quadruples. Chang and C. Nebout.08 0. and Tech. Trivedi. V. Klaser.” Image and Vision Computing. Kara. pp. Uebersax. 27:1–27:27.” Intl. and L. Gool.02 0. Journal of Computer Vision. [21] S.00 RotateCntrClwse 0. Computer Vision Workshops. Computer Vision and Pattern Recognition. SpringerBriefs in Computer Science.00 SwipeV OneTap3 Open ScrollDown2 OneTap1 SwipeRight 1. Trivedi. Conf. Conf.02 SwipeV OneTap3 1.98 ScrollUp2 ScrollDown2 OneTap1 SwipeRight SwipeLeft (b) MusicnMenu Control (GS2) SwipeUp 0. R..” in IEEE Intl. 32. Moeslund. no. Keskin. 8: Results for the three gesture subsets for different in-vehicle applications using 2=3-Subject test settings. Computer Vision and Pattern Recognition. [25] A. G. Average correct classification rates are shown in Table V. Sathyanarayana. Liu.96 ScrollUp2 ScrollDown2 0. Brian.” ACM Intell.04 0. 103. 2010. B. Wang. Ohn-Bar and M. 12. 2011.” in Human Action Recognition with Depth Cameras. B. 2007.” in Visual Analysis of Humans. Conf. and G. “Sign language recognition. 2011. Bowden. Holte. 2013. Neverova. [18] C.96 Close 0. pp. Liu. “Action recognition on motion capture data using a dynemes and forward differences representation. G. 28. C. Bartlett.” in IEEE Conf. Heng. [24] N. Richard. H. Akarun.89 0. A. “A survey on vision-based human action recognition.09 Expand/ZoomOut 0. 5. Computer Vision Workshops. 976–990. Helen. 2014. A. 2013.” in British Machine Vision Conf.02 0. “Comparing gesture recognition accuracy using color and depth in- . vol. Computer Vision Workshops.98 SwipeUp OneTap1 ScrollUp2 SwipeDown RotateCntrClwseRotateClwse Expand/ZoomOut Pinch/ZoomIn ScrollDown2 ScrollRight2 ScrollLeft2 (c) PicturenNavigation (GS3) Fig.02 SwipeDown 0. “Hollywood 3D: Recognizing actions in 3D natural scenes. “Spelling it out: Real-time ASL finger-spelling recognition. Conf. Wolf. [23] Y. 2013. and V.95 Open 0.05 0. “Dense trajectories and motion boundary descriptors for action recognition. Schmid. D. Paci. Doliotis. Kapsouras and N. 6. “Human pose estimation and activity recognition from multi-view videos: Compara-tive explorations of recent developments.00 ScrollUp2 0.98 0. Hadfield and R. no.02 0. “Joint angles similarities and HOG for action recognition.96 ScrollUp2 0.” in IEEE Conf. C. [26] C. and X. 52–73.¨ M. M.96 0.04 0. pp. “Real-time sign language letter and word recognition from depth data. where 2=3 of the samples are used for training and the rest for testing in a 3-fold cross validation. no. Evangelidis. Stefan. Marszalek. [32] C.02 SwipeX (a) Phone (GS1) 0. D. “Human action recognition by representing 3d human skeletons as points in a lie group. F.. M. 539–562.” in IEEE Intl. 8. Computer Vision and Pattern Recognition Workshops-Human Activity Understanding from 3D Data. 2. Conf. Boyle. Bowden. Sommavilla. Pattern Recognition.00 0. Klaser.” in IEEE Intl. 538–552. 2014.96 0.98 ScrollUp2 0. and C. [30] I. “Learning actionlet ensemble for 3d human action recognition. Horaud. pp. “A multi-scale approach to gesture detection and recognition. no. 2008.” IEEE J. 2013. 453–464. Conf. 2011. Bebis. vol. Chellappa. 2011. Athitsos.” Journal of Visual Communication and Image Representation.02 RotateClwse 0. F. and B. Poppe. Syst.” in IEEE Intl.040. and C. Littlewort. [22] S.02 0. “Hand gestures for intelligent tutoring systems: Dataset. J. 11–40. pp.-C..” in IEEE Intl. “A spatio-temporal descriptor based on 3D-gradients.92 ScrollDown2 0. 2012. Erol. C.96 0. vol.02 0. Pugeault and R.02 0. [19] M. 2 [15] E. 1.930.02 0. and M. “Real time hand pose estimation using depth sensors.96 0. Eckhard. Lin.To Appear in IEEE Transactions on Intelligent Transportation Systems 2014 SwipePlus SwipeV SwipeX 0. Nikolaidis.08 1. Twombly.02 0. pp. 2013.” Computer Vision and Image Understanding.02 ScrollDown2 SwipePlus SwipeV Close SwipeLeft 0. 60–79. W. vol. “Vision-based hand pose estimation: A review. Tran. Vemulapalli. A RGB+Depth combined descriptor was used. “Evaluating spatiotemporal interest point features for depth-based action recognition. and T. M. Computer Vision Workshops. McMurrough. M.02 0. vol. and R. G.

9 .

[53] L. Landsiedel.” Procedia Engineering. V. San Diego (UCSD). 2011. 2011. Y. He has given over 70 Keynote/Plenary talks at major conferences. Kolsch. 11. UT. Martin. and J. 248–262. 2011. Ohn-Bar. 2012. Ciampi. J. [46] J. 43. Belongie. “HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. Anup Doshis dissertation judged among the five finalists in the 2011 by the Western (USA and Canada) Association of Graduate Schools. These include developing an autonomous robotic team for Shinkansen tracks. “Behavior recognition via sparse spatio-temporal features. Rogner.” in IEEE RO-MAN. and M. human-centered mul-timodal interfaces. 6947.¨ A. Gallo. degree with specialization in signal and image processing. Ohn-Bar and M. [38] G. Ohn-Bar and M. M. A. “Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. and S. 2011. Lindl. Zobl. Tran. Vatavu. Tappe.” in Human-Computer Interaction-INTERACT. “Controller-free exploration of medical image data: Experiencing the Kinect. [52] P. C. L. Trivedi. Edan. Macedo. Dollar. 2013. N. Rabaud. E. “Open/closed hand classification using Kinect data. Transp. P.” in IEEE Conf. Wakid.” in ACM Conf. and M. M. and J. Kao and C. Gool. Larson.S. He received the IEEE Intelligent Transportation Systems Societys highest honor. and E. Xia and J. vol.” in ACM Automotive User Interfaces and Interactive Vehicular Applications. Cheng and M. Stern. [45] Z. and L. Panger. pp. [47] M. Triggs. Endres. [49] C. Nieschulz. (with honors) degree in electronics from Birla In-stitute of Technology and Science. Trivedi.D. 2012.” in IEEE Intell. 2013 (best paper award). 2011. marking and finger-count menus.” in IEEE Intl. on Virtual and Augmented Reality. pp. His team has played key roles in several major research initiatives. and J.-D. and S. K. 2013.. Computer Vision and Pattern Recognition.” in European conference on Interactive TV and Video. A. Walchshaeusl. “Towards robust cross-user hand tracking and shape recognition.. den Bergh. [54] M. Bailly. “Kinect in the kitchen: testing depth camera interactions in practical home environments. Sibert. “Vision-based infotainment user determination by hand recognition for driver assistance. [44] E. pp. Ozcelik and G. machine learning. [50] D.” ACM Commun. Symp. [61] E. Trivedi is a Fellow of the International Association of Pattern Recognition (for contributions to vision systems for situational awareness and human-centered vehicle safety) and the Society for Optical Engineering (for contributions to the field of optical engineering). and hand patterns for driver activity recognition. Lecolinet. intelligent vehicles.To Appear in IEEE Transactions on Intelligent Transportation Systems 2014 formation. A. and Signal Process.-S. Schwartz. degree in electrical engineering in 2013 from the University of California. [42] E. Muller.” in IEEE Conf. and the Distinguished Alumni Award from Utah State University. the Pioneer Award (Technical Activities) and the Meritorious Service Award of the IEEE Computer Society. Buss. 2011. Prof. 15. 2011. in 1976 and 1979. Logan. Wachs. [56] F. D. 86–89. Tawari. 2012. Conf. Hahn. Univer-sity of California San Diego (UCSD). and M. “Real-time 3D hand gesture interaction with a robot for understanding directions from humans. R. Carton. D. [55] A. Conf. “Vision-based hand-gesture applications. D. T. driver assistance.” in ComputerBased Medical Systems. B. 0 Mohan Manubhai Trivedi (S 14) received the B. He is a co-author of a number of papers winning Best Papers awards. Cemgil. respectively. a human-centered vehicle collision avoidance system. Bachmair. 2012.. Ferscha. where he and his team are currently pursuing research in machine and human perception. [40] C. T. and driver assistance systems. “DTW based clustering to improve hand gesture recognition. R. and Y. Patel. La Jolla. 759–764. D. 2012. 72–81. Reis. “Head. “Dante vision: In-air and touch gesture [58] S.-Y. vol. N. R. 2011. Meng.¨ “Geremin”: 2D microgestures for drivers based on electric field sensing. Muttenthaler. [48] G.” in ACM Conf. Conf. Computer Vision and Pattern Recognition Workshops-Analysis and Modeling of Faces and Gestures. pp. intelligent transportation. and lane/turn/merge intent prediction modules for advanced driver assistance. [43] C. Pilani. Puhringer.. Sengul.” in IEEE Conf. CA. and L. [60] E.¨ H. A. He is currently working towards a Ph. Emerging Signal Process. Keskin. C. He has also established the Laboratory for Intelligent and Safe Automobiles. Syst. “Comparing free hand menu techniques for distant displays using linear. 2012.” in IEEE Symp. Kirmizibayrak. and M. Commun. eye. Intell.” in Human Behavior Unterstanding. Trivedi. and Ph. Geiger. USA. 10 . and F. [57] C. Althoff. 2010. Riener. “User-defined gestures for free-hand tv control. “Depth camera based hand gesture recognition and its applications in human-computer-interaction. Aggarwal. D. “The power is in your hands: 3D analysis of hand gestures in naturalistic video. [35] E. Fahn. vol. 2013.” in IEEE Conf. [37] R. India. “Histograms of oriented gradients for human detection. “In-vehicle hand activity recognition using integration of regions. His research in-terests include computer vision. UCSD. [39] J. Saba. 2013. Europe. “A human-machine interaction technique: Hand gesture recognition based on hidden markov models with trajectory of hand motion. no. Zafrulla. 7065.” in VDI-Tagung: Der Fahrer im 21. Computer Vision and Robotics Research Laboratory. and Asia. Cottrell. He is currently a Professor of electrical and computer engineering and he is the Founding Director of the Computer Vision and Robotics Research Laboratory. pp. R.. Applicat. Minnen and Z. A. 2011. Inform. Pervasive Technologies Related to Assistive Environments. “Gesture-based interaction for learning: time to make the dream a reality. vol. 3.” in IEEE Conf. USA. pp. P. M. M. 3. Oreifej and Z. Kelner. J. vol. S. 0 Eshed Ohn-Bar (S 14) received his M. Computer Vision and Pattern Recognition. He serves on the Board of Governors of IEEE ITS Society and on the Editorial advisory board of the IEEE Trans on Intelligent Transportation Systems. Trivedi.´ V. 2005. “Standardization of the in-car gesture interaction space. User Interfaces. He regularly serves as a Consultant to industry and government agencies in the United States. Lemme. 2014. degrees in electrical engineering from Utah State University. Outstanding Research Award in 2013. Placitelli. Computer Vision and Pattern Recognition. Kuehnlenz. [36] E. M. Pattern Recognition. Shinko Cheng 2008 and Dr.” in ACM Automotive User Interfaces and Interactive Vehicular Applications. Walter. Hagmuller. Weger. [51] O. Veh. 2011. V. F. Logan. Ren. Akarun.” in IEEE Intl. Two of his students were awarded Best Dissertation Awards by the IEEE ITS Society (Dr. Philbeck. [41] L. A. and C. 2011. Brendan Morris 2010) and his advisee Dr. no. Dalal and B. J.S. 2915. Teixeira.¨ H.E. Yuan. a vision-based passenger protection system for smart airbag deployment. Sep.” in Virtual Reality Continuum and Its Applications in Industry. 367–368. J. [59] N.. “Hand gesture-based visual user interface for infotainment. “Evaluation of gesture based interfaces for medical volume visualization tasks.” British Journal of Educational Technology.” in Gesture-Based Communication in Human-Computer Interaction. Radeva. M. in 1974 and the M. 2004. 2005. Mller. Intell. G. Lang. “Robust multimodal hand and head gesture recognition for controlling automotive infotainment systems. Ning. sensing for natural surface interaction with combined depth and thermal cameras. Nijs. M. and G.” in IEEE Conf.” IEEE Trans. 3739–3743. UT. Liu.” in ACM Human Factors in Computing Systems. M. Computer Vision Workshops. M.” in IEEE Intl. active safety systems and Naturalistic Driving Study (NDS) analytics. Mitsou. Ohn-Bar. vol. Trivedi. Aug.D. Rigoll. S. “Gesture components for natural interaction with in-car devices. Wollherr. 2005. USA. Computer Vision Workshops.