You are on page 1of 4

1

Recognition of American Sign Language


Shreyas Bethur, Pritish Gandhi, and Anupama Kuruvilla

Abstract—This paper outlines the ongoing development of a be explained in the course of the paper. The images are then
pattern recognition technique for recognizing the finger-spelling cropped to reduce the effect of background and other
of alphabet of American Sign Language (ASL) vocabulary and insignificant details in the image that are may impair the
converting them to text. This work is the phase two of the
classification. Once the images are cropped they are fed to the
broader ongoing project. The methodology adopted in the
recognition employs applying a group of classification techniques classifiers. The results obtained from each of the classifiers are
such as Principle Component Analysis (PCA), Linear discussed. After the alphabets have been classified they are
Discriminant Analysis (LDA) and both linear and kernel Support immediately displayed on the screen and the webcam
Vector Machines (SVM). We describe the approach used for continues to obtain new images and the process continues to
obtaining the individual frames from a real-time webcam video, make it real-time. In this phase we have introduced a new
crop the required part of the image that will be used for
feature which verifies the spelling of the text displayed on
classification and then run each of the classifiers for recognition.
Finally the intermediate results of each classifier along with the screen.
future work are discussed. We also use a spell checker to correct
and ensure the accuracy of the output text on screen. II. PREVIOUS WORK
ASL being an inherently gestural language, many previous
Index Terms—Image segmentation, Clustering, American Sign
Language, Finger Spelling, PCA, LDA
works have also taken into account the background scene,
facial expressions, eyebrow movements to recognize the signs
[3], [4]. In this project work we only consider the finger spelt
I. INTRODUCTION alphabet of ASL. There are a lot many of the previous work
where the signer wears input devices such as gloves, while
The most common mode of communication amongst the
signing. The position and the transition between each letter are
deaf community is sign language. It comprises of gestures,
recorded and these inputs are used to extract and recognize the
visual cues which may or may not be accompanied by
motional gestures. In this work we use finger spelling alphabet signed letters. The gloves worn by the user helps in obtaining
of the American Sign Language [1] vocabulary. American the hand position and orientation data [5], which is sent to a
Sign Language (ASL) is the fourth most commonly used computer for further analysis. Though this approach has
language in the United States and Canada [2]. Finger Spelling proved to yield acceptable results, it was an invasive and
uses signs for each letter of the alphabet to spell out a expensive solution. Some approaches that involved image
complete word of English language. In ASL the finger processing used the Mean Square Error (MSE) and recognized
spelling gestures are created by a single hand. Most of it does the letters using the lowest MSE while some other approaches
not require motion (except for letters ‘i’ and ‘z’). Each of the used images collected from more than one camera and the
finger spelt alphabet is distinguished by the positioning of the algorithms used varied from Hidden Markov Models to
fingers by the signer. Each configuration that makes a letter of modeling the position of the hand and its velocity [6] to using
the alphabet is called a handshape. Neural Networks. A research that most closely correlates with
In this project our approach is to use Principle Component what we propose to accomplish uses the SecDia Fischer
Analysis (PCA), Linear Discriminant Analysis (LDA) and Linear Discriminant (FLD) [7], in which the training images
Support Vector Machines (SVM) as image classification are arranged based on the secondary diagonal. After the
techniques, to recognize the finger-spelled alphabet in front of training images are rearranged according to the secondary
a webcam, using MATLAB. This process initially requires us diagonal Fischer Linear Discriminant (FLD) analysis is
to train our classifier using a training set of images of all the applied on them to obtain fingerspelling recognition.
27 letters (26 alphabets and space) of the ASL alphabet.
During testing only the appropriate and required image frames
from the webcam are to be extracted and manipulated so that III. APPROACH
they can then be fed to the classifier. Finally the spelt letter is
In this project we consider the alphabet of ASL and try to
recognized and can displayed as text in real-time. Frames from
convert it to digital text. The approach to obtain and recognize
a webcam are taken at specific intervals and from these the
the finger spelt alphabet letters can be formulated into 3 steps,
images of individual finger spelt alphabets are segmented out.
as shown:
The method of the segmentation of individual alphabets will
1. Obtaining required image data – Data Acquisition
The data acquisition comprises of obtaining the required
2

images from the webcam during run-time. easy to crop the images using thresholding.
2. Manipulating the data for classification – Image Second, for processing the input video a webcam was
Preprocessing. interfaced with MATLAB and 12 frames were captured with
3. Extracting salient features of the data and classifying pause of 0.01s between each frame. Out of these 12 frames
the data based on above mentioned algorithms and only the ones with the finger letters were saved while the rest
recognizing the letters – Feature Extraction and of them which were transition between two letters were
Classification. neglected. To accomplish this, each frame was compared to
the previous one and the difference between the two images
was calculated. If there was small change in motion between
the images then values of the resulting difference will be dark
(black) while if the difference is large (i.e. large motion
between images) the image looks white where the change took
place as shown in Fig. 2. This information was used to decide
which image had less relative motion or change from the
previous image and eventually the still image determines that
a letter has been spelled. Hence, for the 12 frames we have 11
difference images. The closer these difference images are to 0
the less motion or change they had, the further away they are
from 0 the most change was observed between frames. So the
Euclidean distance of each of these difference images from 0
were calculated and finally based on a heuristic threshold
value (e = 7 or 8 depending on the background and
illumination) the images with smaller values of e (or small
Fig. 1. Block diagram of the different stages of the finger spelling differences) were preserved.
recognizing system.

IV. IMPLEMENTATION
Training Data
1. The images of 26 letters and a sign for space between
words giving a total of 27 letters, with 9 samples of
each were obtained. Three subjects (each repeating 3
instances of an alphabet) were used to perform signing Fig. 2. Two images out of the 12 images that were captured using the
of alphabets. These static images are used as our webcam are shown. Their difference is shown below which determines
whether it should be cast away or kept for classification.
training set.
2. These training images were cropped to obtain the Finally out of these images, the consecutive ones were also
region of interest and neglect the extra insignificant cast off since they pronounce the same letter and so that could
detail like the background that might reduce the be considered more than once if it is not ignored. We look at
accuracy of classification. the 8 past images and if either of them are the same letter then
3. The images are resized since PCA and LDA need to it is not considered. Hence, the user or signer has to hold a
have all training and testing images of the same sign for a maximum time of 8 frames or 800ms else that letter
dimensions. would be considered again.
4. Principal Component Analysis (PCA), Linear
Discriminant Analysis (LDA) and two variations of the Image Preprocessing
Support Vector Machines (SVM), linear SVM and The images obtained are color images (RGB). Since we
kernel SVM are run on these images. The performance don’t extract any features from the color information, all the
and clustering ability of all these algorithms are images are converted to gray scale. The gray scale images are
obtained. then cropped to obtain the region of interest and remove the
background. Binary thresholding is performed to obtain a
Data Acquisition black and white image which helps in finding the coordinates
The data acquisition can be looked at as two parts. to crop the image. This eases the cropping of the images. The
First for the training data set, images were captured using a black and white images are cropped by detecting three edges,
webcam. Each letter was imaged 9 times with 3 images per the left, the right and the top edge as shown in Fig. 3. The
signer. The captured images were of size 352x288. For ease of edges are detected by counting the number of white pixels that
computation and eliminating artifacts arising from the are encountered in each line, since the white pixels represent
background, the images were captured on a plane black the hand. The number of white pixels in each line is counted
background. Having a plain black background also makes it
3

starting from top to bottom to find the top edge, from left to principal component analysis to reduce the
right to find the left edge and the right to left to find the right dimensionality of the training images and then
edge. When more than 10 white pixels are encountered in a perform linear discriminant analysis to find the
line that line is detected as the beginning of the hand, and its universal linear feature subspace, and then each
coordinates are used to crop the gray scale image. A training image of reduced dimensionality is
cushioning parameter is also provided which leaves that many projected onto this subspace to obtain the feature
pixels gap between the coordinate found by detection and the vector for each class. During testing, we project the
actual cropping line. testing image onto the universal linear subspace to
obtain the feature vector for the test image and we
find the distance of this vector to each of the trained
class vectors using the Euclidean distance metric and
classify based on the nearest neighbor rule.
4. Support Vector Machine (SVM)- Linear SVM has
been used to classify each of the classes. To reduce
dimensionality the PCA coefficients were fed into
the SVM classifier. For a large value(N_Tr =8) of
training samples the classifier returned a 100%
Fig. 3. Thresholding the image to find the coordinates of the boundary and accuracy. The SVM is implementation is based on
then cropping the grayscale image. the method illustrated in [9].

The cropped images will be of varying sizes due to the Spell checker
distance between the hand and the camera, subject variability A new feature has been incorporated in this phase of the
and change in handshape of each of the letter. To overcome project. We had added spell checker module that is
this problem the images are resized to 120x200. This resizing constructed on the Levenshtein distance metric. This is
has to be done since PCA and LDA algorithms run only when introduced as a post processing step to improve the accuracy
the training and testing data have the same dimensions. This of the text returned as the output of the recognition phase.
process is done on the training images too while modeling the The current library of vocabulary used consists of 58112
classifier. words of English. The number of words in the vocabulary can
be varied to increase or decrease the size of the library. We
Feature Extraction and Classification have limited to the present collection of words keeping in
For feature extraction and classification we have used PCA focus the objective of keeping the computational time low.
and LDA in this phase of our work. We have used two types The purpose of adding this new feature is to rectify the
of PCA: errors that can occur due to insertion of an extra character,
1. Global PCA – Principal component analysis is done deletion of a required character or a wrong character
on all the training images of all the classes (i.e. substitution which can be viewed as propagated errors from
letters) together to find a universal linear feature the image capture stage. It can also correct errors due to
subspace, and then each training image is projected misclassification at the recognition phase. The spell checker
onto this subspace to obtain the feature vector for compares the recognition output to each of the word contained
each class. During testing, we project the testing in the library, if a word is found to be a perfect match the
image onto the universal linear subspace to obtain score is added a zero and the search stops. If there was no
the feature vector for the test image and we find the successful match it tries to assign the word to an element of
distance of this vector to each of the trained class the dictionary with the minimum Levenshtein distance
vectors using the eucledian distance metric and measure. We have concluded that since it is a post processing
classify based on the nearest neighbor rule. technique it can act to better the results without adversely
2. Individual PCA – Principal component analysis is affecting the outcome in most cases.
done on the training images of each class separately
to obtain linear feature subspace for each of the
classes. During testing we project the testing image V. EXPERIMENTAL RESULTS
onto each of the subspace and reconstruct the image
For our experiments, we generated a database of nine
using the projections from each of the subspace, then
images for each of the twenty six letters and a sign for space
we find the reconstruction error of each
between words. The nine images were captured using three
reconstructed image to the original test image and
signers (three per signer). The images were captured with
classify based on which class gave the smallest
consistent illumination and without any lateral or rotational
reconstruction error.
changes. We tested all four (Global PCA, Individual PCA,
3. Linear Discriminant Analysis - The methodology we
LDA and SVM) of our classifiers by partitioning the database
have adopted for LDA is based on the work of
into training and testing set of different sizes. The recognition
Belhumeur et al. [8], during training, to perform
accuracy for the various cases is shown in Table I.
4

VI. FUTURE WORK


N_Tr = 6 N_Tr = 7 N_Tr = 8 In feature extraction and classification we will work on
finding the optimal number of training images for PCA and
Global PCA 49.4 % 68.5 % 92.6 % LDA. We will also look at other classification techniques
Individual PCA 48.2 % 83.3 % 96.3 % like Correlation filters and Kernel Support Vector
Machines.
LDA 51.2 % 85.2 % 92.6 %
In this phase of the project we have implemented linear
SVM on raw 59.3 % 83.3 % 100 %
SVM and tested its performance on giving different inputs.
SVM on PCA 59.3 % 83.3 % 100 % Results obtained using raw image data, PCA coefficients of
the images and LDA coefficients of the image are
SVM on LDA 55.5 % 81.5 % 92.3 % compared. The features extracted were also varied to check
* Euclidean distance is used for GPCA, IPCA and LDA if an improvement could be achieved through another
N_Tr is the number of training images used per letter. feature set. The edge features obtained using sobel operator
with a threshold of 0.35 was used as the input to the linear
As can be seen from Table I the recognition accuracy of all
SVM. The results were not as promising and no
the three classifiers have a very similar relationship with the
improvement was recorded and hence we have used PCA
number of training images. With three training images per and LDA as inputs to the SVM. The further approach would
letter the accuracy is poor but it’s much better with four and be to work using kernel SVM.
five training images. Moreover, for five training images the
two kinds of PCA give the same accuracy. It was observed
that when training images is taken to be 8, an accuracy of REFERENCES
100% was attained.
[1] C. Baker-Shenk and D. Cokely, “American Sign Language: A Teacher’s
Resource Text on Grammar and Culture”, Washington ,D.C, April 1991.
By setting the number of training images to be five per [2] D. Burgett, “Assistive technology: Communication devices for the deaf,”
letter, we determined the relationship between true recognition 2004.
[3] http://www.deaflibrary.org/asl.html
rate and false acceptance rate using the receiver operating [4] I. Essa, T. Darrel and A. Pentland, “Tracking facial motion”
characteristic (ROC) curve for PCA and LDA classifiers as IEEEWorkshop on Non rigid and articulated motion.
shown in Fig. 4. LDA has a much better the relationship [5] T. Starner, A. Pentland, J. Weaver, “Real-Time American Sign
Language Recognition Using Desk and Wearable Computer Based
between true recognition rate and false acceptance rate Video” IEEE Transactions on Pattern Analysis and Machine Intelligence
compared to PCA. 20(12) (1998) 1371-1375.
[6] C. Vogler and D. Metaxas, “ASL recognition based on a coupling
between HMMs and 3D motion analysis,” International Conference on
Computer Vision, pages 363–369, Mumbai, India, 1998.M.
[7] M.G. Suraj and D.S. Guru “Secondary Diagonal FLD for fingerspelling
Recognition” Proceedings of International Conference of Computing:
theory and Applications(ICCTA ’07)
[8] P.N. Belhumeur, J. P. Hespanha and D. J. Kriegman, “Eigenfaces vs.
Fisherfaces: Recognition using class specific linear projection,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No.
7, pp. 711-720, 1997.
[9] http://svmlight.joachims.org

Fig. 4. ROC curve for PCA and LDA

You might also like