Recognition of American Sign Language
Shreyas Bethur, Pritish Gandhi, and Anupama Kuruvilla

Abstract—This paper outlines the ongoing development of a pattern recognition technique for recognizing the finger-spelling of alphabet of American Sign Language (ASL) vocabulary and converting them to text. This work is the phase two of the broader ongoing project. The methodology adopted in the recognition employs applying a group of classification techniques such as Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA) and both linear and kernel Support Vector Machines (SVM). We describe the approach used for obtaining the individual frames from a real-time webcam video, crop the required part of the image that will be used for classification and then run each of the classifiers for recognition. Finally the intermediate results of each classifier along with the future work are discussed. We also use a spell checker to correct and ensure the accuracy of the output text on screen. Index Terms—Image segmentation, Clustering, American Sign Language, Finger Spelling, PCA, LDA

be explained in the course of the paper. The images are then cropped to reduce the effect of background and other insignificant details in the image that are may impair the classification. Once the images are cropped they are fed to the classifiers. The results obtained from each of the classifiers are discussed. After the alphabets have been classified they are immediately displayed on the screen and the webcam continues to obtain new images and the process continues to make it real-time. In this phase we have introduced a new feature which verifies the spelling of the text displayed on screen. II. PREVIOUS WORK ASL being an inherently gestural language, many previous works have also taken into account the background scene, facial expressions, eyebrow movements to recognize the signs [3], [4]. In this project work we only consider the finger spelt alphabet of ASL. There are a lot many of the previous work where the signer wears input devices such as gloves, while signing. The position and the transition between each letter are recorded and these inputs are used to extract and recognize the signed letters. The gloves worn by the user helps in obtaining the hand position and orientation data [5], which is sent to a computer for further analysis. Though this approach has proved to yield acceptable results, it was an invasive and expensive solution. Some approaches that involved image processing used the Mean Square Error (MSE) and recognized the letters using the lowest MSE while some other approaches used images collected from more than one camera and the algorithms used varied from Hidden Markov Models to modeling the position of the hand and its velocity [6] to using Neural Networks. A research that most closely correlates with what we propose to accomplish uses the SecDia Fischer Linear Discriminant (FLD) [7], in which the training images are arranged based on the secondary diagonal. After the training images are rearranged according to the secondary diagonal Fischer Linear Discriminant (FLD) analysis is applied on them to obtain fingerspelling recognition.

I. INTRODUCTION The most common mode of communication amongst the deaf community is sign language. It comprises of gestures, visual cues which may or may not be accompanied by motional gestures. In this work we use finger spelling alphabet of the American Sign Language [1] vocabulary. American Sign Language (ASL) is the fourth most commonly used language in the United States and Canada [2]. Finger Spelling uses signs for each letter of the alphabet to spell out a complete word of English language. In ASL the finger spelling gestures are created by a single hand. Most of it does not require motion (except for letters ‘i’ and ‘z’). Each of the finger spelt alphabet is distinguished by the positioning of the fingers by the signer. Each configuration that makes a letter of the alphabet is called a handshape. In this project our approach is to use Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA) and Support Vector Machines (SVM) as image classification techniques, to recognize the finger-spelled alphabet in front of a webcam, using MATLAB. This process initially requires us to train our classifier using a training set of images of all the 27 letters (26 alphabets and space) of the ASL alphabet. During testing only the appropriate and required image frames from the webcam are to be extracted and manipulated so that they can then be fed to the classifier. Finally the spelt letter is recognized and can displayed as text in real-time. Frames from a webcam are taken at specific intervals and from these the images of individual finger spelt alphabets are segmented out. The method of the segmentation of individual alphabets will

III. APPROACH In this project we consider the alphabet of ASL and try to convert it to digital text. The approach to obtain and recognize the finger spelt alphabet letters can be formulated into 3 steps, as shown: 1. Obtaining required image data – Data Acquisition The data acquisition comprises of obtaining the required

Extracting salient features of the data and classifying the data based on above mentioned algorithms and recognizing the letters – Feature Extraction and Classification. Two images out of the 12 images that were captured using the webcam are shown. images were captured using a webcam. the further away they are from 0 the most change was observed between frames.2 images from the webcam during run-time. Their difference is shown below which determines whether it should be cast away or kept for classification. This eases the cropping of the images.01s between each frame. 4. Fig. Binary thresholding is performed to obtain a black and white image which helps in finding the coordinates to crop the image. the images were captured on a plane black background. The images of 26 letters and a sign for space between words giving a total of 27 letters. 1. Data Acquisition The data acquisition can be looked at as two parts. The gray scale images are then cropped to obtain the region of interest and remove the background. Having a plain black background also makes it Fig. Second. easy to crop the images using thresholding. Hence. Out of these 12 frames only the ones with the finger letters were saved while the rest of them which were transition between two letters were neglected. IV. We look at the 8 past images and if either of them are the same letter then it is not considered. These training images were cropped to obtain the region of interest and neglect the extra insignificant detail like the background that might reduce the accuracy of classification. First for the training data set. for the 12 frames we have 11 difference images. The number of white pixels in each line is counted . the consecutive ones were also cast off since they pronounce the same letter and so that could be considered more than once if it is not ignored. Image Preprocessing The images obtained are color images (RGB). Hence. For ease of computation and eliminating artifacts arising from the background. 3. the right and the top edge as shown in Fig. Each letter was imaged 9 times with 3 images per signer. with 9 samples of each were obtained. Principal Component Analysis (PCA). IMPLEMENTATION Training Data 1. the user or signer has to hold a sign for a maximum time of 8 frames or 800ms else that letter would be considered again. The images are resized since PCA and LDA need to have all training and testing images of the same dimensions. each frame was compared to the previous one and the difference between the two images was calculated. Manipulating the data for classification – Image Preprocessing. Since we don’t extract any features from the color information. the left. Finally out of these images. 2. 2. 3. The edges are detected by counting the number of white pixels that are encountered in each line. since the white pixels represent the hand. The closer these difference images are to 0 the less motion or change they had. The captured images were of size 352x288. Block diagram of the different stages of the finger spelling recognizing system. Linear Discriminant Analysis (LDA) and two variations of the Support Vector Machines (SVM). This information was used to decide which image had less relative motion or change from the previous image and eventually the still image determines that a letter has been spelled. To accomplish this. for processing the input video a webcam was interfaced with MATLAB and 12 frames were captured with pause of 0. Three subjects (each repeating 3 instances of an alphabet) were used to perform signing of alphabets.e. These static images are used as our training set. all the images are converted to gray scale. linear SVM and kernel SVM are run on these images. 2. large motion between images) the image looks white where the change took place as shown in Fig. The black and white images are cropped by detecting three edges. The performance and clustering ability of all these algorithms are obtained. 3. If there was small change in motion between the images then values of the resulting difference will be dark (black) while if the difference is large (i. So the Euclidean distance of each of these difference images from 0 were calculated and finally based on a heuristic threshold value (e = 7 or 8 depending on the background and illumination) the images with smaller values of e (or small differences) were preserved. 2.

V. During testing. then we find the reconstruction error of each reconstructed image to the original test image and classify based on which class gave the smallest reconstruction error.3 starting from top to bottom to find the top edge. Individual PCA. 3. For a large value(N_Tr =8) of training samples the classifier returned a 100% accuracy.The methodology we have adopted for LDA is based on the work of Belhumeur et al. 3. and then each training image is projected onto this subspace to obtain the feature vector for each class. This process is done on the training images too while modeling the classifier. Fig. Linear Discriminant Analysis . The recognition accuracy for the various cases is shown in Table I. We have concluded that since it is a post processing technique it can act to better the results without adversely affecting the outcome in most cases. and then each training image of reduced dimensionality is projected onto this subspace to obtain the feature vector for each class. The nine images were captured using three signers (three per signer). 4. This is introduced as a post processing step to improve the accuracy of the text returned as the output of the recognition phase. to perform Spell checker A new feature has been incorporated in this phase of the project. The spell checker compares the recognition output to each of the word contained in the library. Global PCA – Principal component analysis is done on all the training images of all the classes (i. [8]. Thresholding the image to find the coordinates of the boundary and then cropping the grayscale image. A cushioning parameter is also provided which leaves that many pixels gap between the coordinate found by detection and the actual cropping line. letters) together to find a universal linear feature subspace. we generated a database of nine images for each of the twenty six letters and a sign for space between words. Individual PCA – Principal component analysis is done on the training images of each class separately to obtain linear feature subspace for each of the classes. To overcome this problem the images are resized to 120x200. It can also correct errors due to misclassification at the recognition phase. The images were captured with consistent illumination and without any lateral or rotational changes. During testing we project the testing image onto each of the subspace and reconstruct the image using the projections from each of the subspace. if a word is found to be a perfect match the score is added a zero and the search stops. we project the testing image onto the universal linear subspace to obtain the feature vector for the test image and we find the distance of this vector to each of the trained class vectors using the Euclidean distance metric and classify based on the nearest neighbor rule. If there was no successful match it tries to assign the word to an element of the dictionary with the minimum Levenshtein distance measure. principal component analysis to reduce the dimensionality of the training images and then perform linear discriminant analysis to find the universal linear feature subspace. deletion of a required character or a wrong character substitution which can be viewed as propagated errors from the image capture stage. subject variability and change in handshape of each of the letter. 2. and its coordinates are used to crop the gray scale image. We had added spell checker module that is constructed on the Levenshtein distance metric.Linear SVM has been used to classify each of the classes. EXPERIMENTAL RESULTS For our experiments. During testing. Feature Extraction and Classification For feature extraction and classification we have used PCA and LDA in this phase of our work. Support Vector Machine (SVM). from left to right to find the left edge and the right to left to find the right edge. we project the testing image onto the universal linear subspace to obtain the feature vector for the test image and we find the distance of this vector to each of the trained class vectors using the eucledian distance metric and classify based on the nearest neighbor rule. . This resizing has to be done since PCA and LDA algorithms run only when the training and testing data have the same dimensions. The cropped images will be of varying sizes due to the distance between the hand and the camera. The SVM is implementation is based on the method illustrated in [9]. We have used two types of PCA: 1. LDA and SVM) of our classifiers by partitioning the database into training and testing set of different sizes.e. When more than 10 white pixels are encountered in a line that line is detected as the beginning of the hand. To reduce dimensionality the PCA coefficients were fed into the SVM classifier. The current library of vocabulary used consists of 58112 words of English. The purpose of adding this new feature is to rectify the errors that can occur due to insertion of an extra character. We have limited to the present collection of words keeping in focus the objective of keeping the computational time low. during training. The number of words in the vocabulary can be varied to increase or decrease the size of the library. We tested all four (Global PCA.

* Euclidean distance is used for GPCA.6 % 100 % 100 % 92.3 % In this phase of the project we have implemented linear SVM and tested its performance on giving different inputs.G. 7. In feature extraction and classification we will work on finding the optimal number of training images for PCA and LDA. 711-720. Fisherfaces: Recognition using class specific linear projection. Metaxas. D. IPCA and LDA N_Tr is the number of training images used per letter.3 % 83. Results obtained using raw image data.5 % N_Tr = 8 92.6 % 96. http://www.html I.deaflibrary. It was observed that when training images is taken to be 8. The edge features obtained using sobel operator with a threshold of 0. Weaver. pages 363–369.5 % N_Tr = 7 68.4 % 48. The results were not as promising and no improvement was recorded and hence we have used PCA and LDA as inputs to the SVM. As can be seen from Table I the recognition accuracy of all the three classifiers have a very similar relationship with the number of training images. pp.M. Suraj and D. 1997. Baker-Shenk and D. Kriegman. Vol.5 % 83. M.joachims.3 % 59.3 % 85. Burgett.35 was used as the input to the linear SVM. Hespanha and D. http://svmlight.2 % 51. We will also look at other classification techniques like Correlation filters and Kernel Support Vector Machines.3 % 55. J.3 % 92. Vogler and D. REFERENCES [1] C. The further approach would be to work using kernel SVM.N. “Eigenfaces vs. Belhumeur.D. “American Sign Language: A Teacher’s [2] [3] [4] [5] Resource Text on Grammar and Culture”. Essa. Pentland. Moreover. J. A. By setting the number of training images to be five per letter. The features extracted were also varied to check if an improvement could be achieved through another feature set.” International Conference on Computer Vision.2 % 83. P.2 % 59. Mumbai. Starner. “Assistive technology: Communication devices for the deaf.” IEEE Transactions on Pattern Analysis and Machine Intelligence. Cokely. “Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video” IEEE Transactions on Pattern Analysis and Machine Intelligence 20(12) (1998) 1371-1375. PCA coefficients of the images and LDA coefficients of the image are compared.4 VI. Darrel and A.org [6] [7] [8] [9] Fig. for five training images the two kinds of PCA give the same accuracy. T. we determined the relationship between true recognition rate and false acceptance rate using the receiver operating characteristic (ROC) curve for PCA and LDA classifiers as shown in Fig. J. LDA has a much better the relationship between true recognition rate and false acceptance rate compared to PCA. “Tracking facial motion” IEEEWorkshop on Non rigid and articulated motion. FUTURE WORK N_Tr = 6 Global PCA Individual PCA LDA SVM on raw SVM on PCA SVM on LDA 49. 1998. C. Washington . 4. With three training images per letter the accuracy is poor but it’s much better with four and five training images. Guru “Secondary Diagonal FLD for fingerspelling Recognition” Proceedings of International Conference of Computing: theory and Applications(ICCTA ’07) P. T.3 % 81.S. No. Pentland. “ASL recognition based on a coupling between HMMs and 3D motion analysis. India.org/asl. April 1991. 19.C.” 2004. ROC curve for PCA and LDA . an accuracy of 100% was attained. 4.

Sign up to vote on this title
UsefulNot useful