You are on page 1of 6

One-Shot Learning of Gestures using Kinect data

Author Shashank Sonkar IIT Kanpur shashank@cse.iitk.ac.in Advisor Prof. H. C. Karnick IIT Kanpur hk@cse.iitk.ac.in

Abstract
In this report, we deal with the problem of gesture recognition. With the help of kinect, and the Microsoft SDK, we can now track the human skeleton in real time. Given the information of coordinates of body joints, we aim to distinguish dierent gestures. The approach we take is to make a unique signature for each gesture, that can distinguish it from other gestures. Trajectory of body joints in dierent gestures is dierent in space. We use this fact to formulate our signature for the gesture. We then apply the string matching algorithm on these gesture signatures. Our nal aim is to come up with a unique gesture signature using only one instance of the gesture. We have been given a dataset of 12 gestures, which have around 6000 gesture instanses. We want to build a strong signature for a gesture that is powerful enough so that the machine can identify it after being given only a single instance of that gesture. Currently the gesture signature that we propose uses 50 instances of the same gesture to distinguish it. It gives an accuracy of about 75-80%. This means that the logic which we are using to formulate our gesture signature is not awed. We just need to extend the logic to increase the "uniqueness" factor of each gesture. Later on, in our discussion we shall logically see that our method when extended has a high probability to recognise a gesture only from a single instance. This has applications in sign language recognition.

chines see the world in the form of voxels just like humans do? A voxel (volumetric pixel or Volumetric Picture Element) is a volume element, representing a value on a regular grid in three dimensional space. This is analogous to a pixel, which represents 2D image data. This is possible with the help of kinect, a sensor that gives 3D data. The machine can know the dierent voxels in space that the wrist joint has moved through when a gesture is done. It can then be used to make a signature of that gesture. We want to test this idea. This is a relatively new topic of research as earlier depth data was not available in real time. With kinect, this became possible. Still one hurdle was there - tracking the human skeleton in real time. So, with the release of the skeletal tracking library in Microsoft SDK, solving this problem was of interest as we can test our theories. In this report, we are testing this idea after breaking the space into 27 cubes so that the skeleton ts into it. It is giving reasonable results given the simplicity of the techniques involved. So we want to further explore this idea as we may get better results after breaking the space into even ner cubes or making use of multiple simple signatures.

1.1

Dataset

Introduction

Humans are capable of recognizing patterns like hand gestures after seeing just one example. We want our machines to be able to do the same. Take for instance, a hand gesture. Humans basically know the trajectory through which the hand has moved. Can somehow ma1

Microsoft Research Cambridge-12 Kinect gesture data set[1] has been used. There are a total of 12 dierent gestures. The gestures are of two kinds, iconic and metamorphic. Iconic gestures are those that imbue a correspondence between the gesture and the reference. Metaphoric gestures are those that represent an abstract content. For the former, there are six gestures from a rst person shooter game and for the latter, there are six gestures for a music player. The dataset comprises of 6244 gesture instances, collected from 30 people performing 12 gestures. For each gesture instance, the space coordinates(x,y,z)

Figure 1: Visualisation of gesture from dataset of 20 body joints has been given for each frame. Frames are collected at the rate of 30 per second. The joints include head, shoulder center, shoulder left, shoulder right, elbow left, elbow right, wrist left, wrist right, hand left, hand right, hip center, hip left, hip right, spine, knee left, knee right, ankle left, ankle right, foot left and foot right. The origin is the Kinect sensor. Z coordinate represents the depth of the person depicting the action. Y coordinate represents the axis along one can measure the height of a person. For one instance of "shoot" gesture(gesture 8), the start frame, the mid frame where the person brings both hand together as if holding a gun, and the last frame have been visualised in the gure 1. As of now, we have used this dataset. Later on, we plan to integrate it with the skeletal tracking library provided by Microsoft Kinect SDK. It provides real time tracking of body joints. coordinates of 20 body joints. Microsoft had not limited it to be used only for the games created by its developers.

1.3

Methodology

The idea is to compare two gestures. For each gesture, we aim to build a unique signature that can demarcate it from other gestures. The gesture completion time varies across dierent instances of the same gesture as the person can perform a gesture with varying speeds. And also, dierent people are performing the same gesture so again, their speeds may vary. So, the gesture signature has to be time invariant. Therefore, the signature should only take into account the space through which the joint has travelled. The approach that we take is to discretise the space in cubes. Signature is the ordered sequence of cubes through which the joint has moved. Then, we compare two gestures using a string matching algorithm and compare 1.2 Kinect Skeletal Tracking using MS SDK their similarity. So, broadly we plan on performing the We tested out whether the library could track human following steps. skeleton in real time. When we stood infront of the kinect, it tracked the skeleton almost instantly, just after 1. Discretization of space a delay of 2-3 seconds and continued tracking it until we moved out of the range of the sensor. There was very 2. Making gesture signature time invariant little time lag(1-2 seconds) between body joints moving and kinect tracking that movement. Moreover, there 3. Similarity calculation using String Matching was an API through which a user can access the space 2

Discretization of space

The idea is to logically divide the space into cubes so that the cubes span the entire skeleton. Logical partition has been done by following the steps below. 1. First we construct a cube with length equal to the height of the person. Height is taken to be the distance between head and foot-right. The center is taken to be the shoulder-center. The edges are taken to be parallel to x, y, and z axis. 2. This cube formed is then divided into further 27 smaller cubes. Center of the innermost cube remains shoulder-center joint. We took the initial big cube length to be equal to the height of the person because when both the hands are fully stretched outwards, the distance between the left and the right wrist cannot exceed the height of the person. The drawing of Leonardo Da Vincis Vitruvian man justies our assumption. However, the knee and foot coordinates still can lie outside the big cube. To avoid this, the lower bounds on the bottommost layer(w.r.t y coordinate), consisting of 9 cubes has been relaxed w.r.t y coordinate. The 2-D front view of the layers of cubes is shown below. The cube boundaries are shown in grey color. Notice that the rst and second layer have all its boundaries dened, however the third bottommost layer has only upperbound for the y coordinate specied. Similarily the lowerbounds for x for the leftmost layer of nine cubes, upperbounds for x for rightmost layer of nine cubes can be relaxed.

Figure 3: Skeleton inside the cubes Now, knowing the joint coordinate, the next task is to transform it into a cube id where cube id lies between 1 to 27. The mapping is discussed below.

2.1

Mapping from coordinates to Cube IDs

Each cube is identied by a triple (i,j,k) which represents its center. Each of i,j,k can take three values from the set S={0,1,2} independently of each other. Hence we can form 3*3*3 dierent triples. The algorithm to transform a joint coordiante (x,y,z) to a triple is if x < -height/6, then i=0 if x > height/6, then i=2 if -height/6 < x < -height/6, then i=1 Similarly y and z can be used to calcuate j and k respectively. The rst and second bullets highlight the point that lower and upperbounds have been relaxed.

Making gesture signature time invariant

Figure 2: Vitruvian Man 3

To compare two gestures, we only need to know the space through which the joints have travelled, that is the ordered sequence(ordering over time) of cube Ids that a joint has travelled. So, the gesture signature should be modied so that the gesture completion time does not

play a role in it. It should only reveal the space through which the joint has travelled. The importance of discretization of space for removing time as a parameter is discussed below. If for a joint, the joint coordinate lies in some cube Id, say x and in the next frame also, it still lies in the same cubeId, this implies that the joint is still in the same space, but it is taking time to cross that space. So, if we can replace all continous occurrances of x in the gesture sequence by a single x, it will imply that the gesture has passed through the space marked x. The time taken to cross it no longer playing a role. Thus, the same gesture now done with varying speeds by dierent people, potentially has the same signature. So, the algorithm is to replace all continous occurences of a cube Id by its single instance. But, this raises the question of granularity of discretization.

4.1

What are "strings" here?

By strings, we mean the gesture signatures. Each gesture signature is an ordered list of voxels(cube Ids) it has travelled through during the course of the gesture. Thus each element of our string is a voxel in space.

4.2

Similarity between two cube Ids

Suppose we have two identical strings which mismatch at exactly one position. The algorithm is going to replace the element of rst string at that position with the element of another string. These elements correspond to the cube ids in our gesture signature. Consider two cases, when the voxels representing the two mismatching cube ids are close to each other and another where they are far apart. The two gestures in the rst case are closer as compared to the second case. So, our algorithm should be able to capture this observation. S(A[i],B[j]) does exactly that. 3.1 How discretised the space should be? It captures the similarity between the ith element of the th We have currently discretised the space into 27 cubes. rst string with the j element of the second string. On increasing the number of cubes, number of gestures In our case the similarity between two cube ids is given that can be identied increases, plus the gestures that by the distance between their centers. vary very little, their dierence can be captured if we have broken the space into ner cubes.

4.3

Penalty

String Comparision

For string comparision, we have used NeedlemanWunsch algorithm. It nds the similarity measure between two strings. Basic idea is to count the number of edits, deletes, and replacements required to transform one string to another. Less the number of such operations, more is the similarity. The algorithm uses dynamic programming. The algorithm is given below for two strings A and B[2]. for i = 0 to length(A) do for j = 0 to length(B) do Match F(i-1,j-1) + S(A[i] , B[j]) Delete F(i-1, j) + d Insert F(i, j-1) + d F(i,j) max(Match, Insert, Delete) end for end for 4

"Delete F(i-1, j) + d" in the above algorithm denotes the penalty one must pay to delete one element from rst string to match it with the second string. We make this penalty a function of i and j. If the insertion or deletion is taking in the vicinity of the center of any two strings, penalty is more as compared to insertion or deletion near the ends. Our approach is to give such weightage is justied based on the belief that the middle part of any gesture is the most informative.

5
5.1

Experiments
Experiment 1

We took 30 instances of each of the 12 gestures. For each gesture type, we compared each instance with remaining instances of the same type, and all instances of other gestures. Then we calculate the mean and variance of similarity between dierent gesture types.

1 2 3 4 5 6 7 8 9 10 11 12

1 -50.70 -85.55 -79.18 -106.24 -95.66 -100.36 -71.91 -91.57 -88.30 -100.83 -78.38 -96.44

2 -85.55 -49.28 -64.35 -81.82 -55.93 -80.55 -64.98 -74.21 -79.50 -65.36 -82.58 -77.84

3 -79.18 -64.35 -42.68 -88.86 -56.25 -86.82 -45.84 -71.76 -81.67 -72.81 -80.87 -81.46

4 -106.24 -81.82 -88.86 -51.68 -90.23 -71.09 -67.72 -84.76 -56.44 -85.73 -63.64 -89.12

5 -95.66 -55.93 -56.25 -90.23 -36.49 -88.54 -40.85 -68.32 -88.98 -60.94 -90.42 -68.66

6 -100.39 -80.55 -86.82 -71.09 -88.54 -33.04 -68.11 -61.46 -81.13 -83.31 -92.70 -90.13

7 -71.91 -64.98 -45.84 -67.72 -40.85 -68.11 -33.94 -60.06 -68.56 -51.31 -72.61 -65.62

8 -91.57 -74.21 -71.76 -84.71 -68.32 -61.46 -60.06 -45.14 -88.64 -72.18 -90.73 -81.44

9 -88.30 -79.50 -81.67 -56.44 -88.98 -81.13 -68.56 -88.64 -50.64 -90.77 -55.23 -86.90

10 -100.83 -65.36 -72.81 -85.73 -60.94 -83.31 -51.31 -72.18 -90.77 -56.77 -90.31 -73.22

11 -78.38 -82.58 -80.87 -63.64 -90.42 -92.70 -72.61 -90.73 -55.23 -90.31 -51.27 -85.59

12 -96.44 -77.84 -81.46 -89.12 -68.66 -90.13 -65.62 -81.44 -86.90 -73.22 -85.59 -55.51

Table 1: Similarity Matrix 5.1.1 Results

Interpretation of results

The dissimilarity between the gestures of same type is less as compared to gestures of dierent types. Table 1 depicts the calculated mean. Any entry (i,j) refers to the dissimilarity measure between ith and j th gesture. More the value, more the similarity. We have also plotted a 3-D bar plot. The bar length signies the measure of dissimilarity. Smaller the height, more is the similarity. For easier understanding, the color of bars change from blue to yellow to red as dissimilarity increases. Figure 4 shows the front view and the top view.

The voxel method seems promising. But the gesture 1 and gesture 11 have less accuracy. Below is the gure of both the gestures. As one can see in both the gestures, the person is stretching his hands upwards, but in gesture 11, the person also brings it close to his head. But, if he does not bring his hand close enough to his head, we cannot capture the voxels enclosing his head and region above it, so both the gestures tend to be very close to each other. We have thus identied the problem. The space has to be more discretised to be able to distinguish these two gestures.

5.2

Experiment 2

We try to predict a new gesture given 50 instances of gestures of each type. On an average, out of 100 gestures, it can correctly predict up to 75-80% of the gestures. The result for each gesture is below. Gesture No. 1 2 3 4 5 6 7 8 9 10 11 12 Description Lift outstretched arms Duck Push right Goggles Wind it up Shoot Bow Throw Had enough Change weapon Beat both Kick Accuracy 59.0164 75.4098 88.8889 71.4918 88.7097 88.5246 81.5246 86.4407 72.4138 74.4138 45.4545 79.0323 5 Figure 5: gesture 1

Figure 6: gesture 11 In the course of next semester, we want to test this idea and break our space into ner voxels. Moreover,

Figure 4: Front and top views of the plot the string matching algorithm takes O(n2 ) time, so we because the voxel in which the joint can go in the next want to further modify this algorithm to work faster. frame depends only on what voxel it was in the last frame, and not on all voxels lying in the trajectory carved out by it. It has been used extensively in speech recognition. It can 7 Future Work also deal with the varying gesture completion time. We face a similar problem in speech recognition. To speak We have the basic framework over which we can test our a sentence, dierent people can take dierent time, but ideas. During the course of another semester, we plan HMM takes into account that parameter. However this to implement the following ideas. will require multiple gesture instances for training.

7.1

More voxels and faster comparision

We plan to increase the number of voxels and check whether we can perform one-shot recognition using this signature. The time complexity of the string comparision algorithm also has to be be reduced as voxels being more in number, the length of the signature( symbolising the trajectory through which the joint has passed) will also increase.

References

1. Fothergill, S., Mentis, H.M., Kohli, P., Nowozin, S. (2012). Instructing people for training gestural interactive systems. In Proceedings of the ACM conference on human factors in computing systems (pp. 17371746). 2. http://en.wikipedia.org/wiki/Needleman-Wunsch_Algorithm" 3. http://en.wikipedia.org/wiki/Hidden_Markov_ model

7.2

Hidden Markov Models

A hidden Markov model is a statistical model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states. In a Markov process, predictions for the future of the process can be based solely on its present state. Our system can also be modelled as a Markov process 6

You might also like