You are on page 1of 4

2021 3rd International Conference on Signal Processing and Communication (ICPSC) | 13 – 14 May 2021 | Coimbatore

An Efficient Approach for Interpretation of Indian


Sign Language using Machine Learning
2021 3rd International Conference on Signal Processing and Communication (ICPSC) | 978-1-6654-2864-4/20/$31.00 ©2021 IEEE | DOI: 10.1109/ICSPC51351.2021.9451692

Dhivyasri S Krishnaa Hari K B Akash M


Department of ECE Department of ECE Department of ECE
PSG College of Technology PSG College of Technology PSG College of Technology
Coimbatore, India Coimbatore, India Coimbatore, India
dhivyasri2000@gmail.com krishnaaharikbk@gmail.com akashmohan1999m@gmail.com

Sona M Divyapriya S Dr. Krishnaveni V


Department of ECE Department of ECE Department of ECE
PSG College of Technology PSG College of Technology PSG College of Technology
Coimbatore, India Coimbatore, India Coimbatore, India
sonamani05121998@gmail.com divyapriyasivam98@gmail.com vk.ece@psgtech.ac.in

Abstract—Non-verbal communication involves the usage of


Sign Language. The sign language is used by people with hearing
/ speech disabilities to express their thoughts and feelings. But
normally, people find it difficult to understand the hand gestures
of the specially challenged people as they do not know the
meaning of the sign language gestures. Usually, a translator
is needed when a speech / hearing impaired person wants to
communicate with an ordinary person and vice versa. In order to
enable the specially challenged people to effectively communicate
with the people around them, a system that translates the Indian
Sign Language (ISL) hand gestures of numbers (1-9), English
alphabets (A-Z) and a few English words to understandable text Fig. 1. ISL Hand Gestures
and vice versa has been proposed in this paper. This is done using
image processing techniques and Machine Learning algorithms.
Different neural network classifiers are developed, tested and
validated for their performance in gesture recognition and the
most efficient classifier is identified. become isolated and lonely, as they face difficulties in com-
Index Terms—Indian Sign Language, hand gestures, inter- municating with other normal people. This has a tremendous
preter, SURF, Convolutional Neural Network, Recurrent Neural effect on both their social and working life.
Network, K-means clustering, Support Vector Machine Due to the above mentioned challenges that the specially
challenged people face, an automated real-time system that
I. I NTRODUCTION could translate English words to ISL and vice versa has
been proposed in this paper. This system makes it easy for
Sign Languages vary throughout the world. There are the specially challenged people to communicate effectively
around 300 different sign languages used across various parts with the rest of the world. This could enhance their abilities
of the world. This is because sign languages were developed and make them realize that they can do better in life. The
naturally by people belonging to different ethnic groups. proposed system performs two major tasks: (i) Gesture to
Perhaps, India does not have a standard sign language. Lexical Text conversion and (ii) Speech to Gesture conversion. Gesture
variations and different dialects of Indian Sign Language are to text conversion is done using neural network classifiers.
found in different parts of India. But, recently, efforts had Speech to gesture conversion is done using Google Speech
been taken to standardize the Indian Sign Language (ISL). Recognition API.
The ISL hand gestures are divided into two broad categories: This paper focuses on conversion of standard Indian Sign
(i) static gestures, and (ii) dynamic gestures. The static ISL Language gestures to English, and conversion of English
hand gestures of numbers (0-9), English alphabets (A-Z), and words (spoken) to Indian Sign Language gestures with highest
some English words are shown in Fig. 1. possible accuracy. For this, different neural network classifiers
According to the 2011 census, there are around 50 lakh people are developed, and their performance in gesture recognition
in India with speech/hearing impairments. But, there are only is tested. The most accurate and efficient classifier is chosen
less than 300 educated and trained sign language interpreters and is used to develop an application that converts ISL
in India. So, people with speech/hearing impairments tend to gestures to their corresponding English text, and speech to

978-1-6654-2864-4/21/$31.00 ©2021 IEEE 130

Authorized licensed use limited to: San Francisco State Univ. Downloaded on July 02,2021 at 19:06:28 UTC from IEEE Xplore. Restrictions apply.
2021 3rd International Conference on Signal Processing and Communication (ICPSC) | 13 – 14 May 2021 | Coimbatore

the corresponding ISL gestures. Here, feature extraction is done using the Speeded-Up
Robust Feature (SURF) method. The SURF is used as a
II. P ROPOSED M ETHODOLOGY feature descriptor or as a feature detector. It is often used
As mentioned in the above section, the proposed system for applications like object detection, image classification
for ISL interpretation performs two major tasks: (i) Gesture etc. It is a fast and robust algorithm for representing and
to Text conversion and (ii) Speech to Gesture conversion. comparing images. It acts as a blob detector in an image. The
SURF features are calculated by finding the interest points
A. Gesture to Text Conversion in the image that contain the meaningful features using the
Gesture to text conversion involves four major steps: (i) determinants of Hessian matrices. For each interest point found
Dataset collection, (ii) Segmentation, (iii) Feature Extraction in the previous process, the scale invariant descriptors are
and (iv) Classification. The concept diagram of gesture to text constructed.
conversion is given in Fig. 2. The first step in gesture to text The Hessian matrix and its determinant are given in (1) and
(2) respectively.
 2 2 
∂ f ∂ f
∂x2 ∂x∂y
H(f (x, y)) =  ∂2f ∂2f
 (1)
∂x∂y ∂y 2

det(H) = Dxx Dyy − (0.9Dxy )2 (2)

The extracted image features are fed as input to different


Machine Learning algorithms like Convolutional Neural Net-
work (CNN), Recurrent Neural Network (RNN) and Support
Vector machine (SVM).
SVM is a supervised machine learning algorithm that uses a
hyper-plane to separate different classes of data. The SVM
classifier is used along with K-means clustering classifier and
Bag of Visual words (BoV) model to achieve better accuracy.
The K-means clustering classifier is an unsupervised classifier
Fig. 2. Concept Diagram - Gesture to Text Conversion that is used to group the similar data points to ‘k’ number
of clusters, where ‘k’ is the number of classes in the dataset.
conversion is dataset collection. An image dataset consisting The output of the k-means clustering model is fed to the BoV
of ISL hand gestures of 9 numbers (1-9), 26 English alphabets model for classification. The BoV model further classifies the
and a few English words is collected. After the dataset is images based on the count of visual words (image features
ready, all the images in the dataset are pre-processed to mask are considered as visual words) that occur in an image. The
the unwanted areas and to remove noise from the image. output of the BoV classifier is fed to the SVM. The SVM is
Hence, pre-processing the images prior to feeding them to the main classifier with which training and testing are done.
a classifier improves the efficiency, accuracy and performance In the SVM classifier, about 80% of the data in the dataset is
of the system. Hence, this step is very important in the image used for training and the remaining 20% of the data is used
classification process. for testing.
Data pre-processing involves the following steps: The CNN and RNN classifier models are designed, and their
1) Resizing the images to the same size (for uniformity) performance in gesture recognition is noted. For both CNN
2) Conversion of RGB image to Grayscale image and RNN, the data in the dataset is divided into three parts:
3) Median Blur (i) 60% of the data is used for training, (ii) the next 20% of
4) Skin Masking and Skin Detection the data is used for testing and (iii) remaining 20% of the data
5) Canny Edge Detection (to detect sharp edges in the is used for validation.
image) The performance of all the above mentioned classifiers is
Next step in Gesture to text conversion is feature extraction. validated to figure out the most efficient image classifier for
Feature extraction is done on the pre-processed images. gesture recognition.
Feature extraction is a very important step in computer vision The classifier with the highest accuracy is identified and then
and image classification processes. This involves converting used to detect ISL gestures in a live video. Recognizing the
the raw data (images) into numerical features so that the data gestures in live video is as simple as identifying the gestures
can be processed by the classification algorithm. Though the in a still image. The steps involved in identifying the gestures
image is converted into numerical form, the information in in live (real-time) video are listed below.
the original data is preserved. 1) Video is captured using Web Camera
2) Each frame in the video is captured as an image

131

Authorized licensed use limited to: San Francisco State Univ. Downloaded on July 02,2021 at 19:06:28 UTC from IEEE Xplore. Restrictions apply.
2021 3rd International Conference on Signal Processing and Communication (ICPSC) | 13 – 14 May 2021 | Coimbatore

3) The captured images are resized and pre-processed


4) SURF Features are extracted
5) The image features are passed to the classifier
6) Gestures are predicted

B. Speech to Gesture Conversion


The conversion of Speech to ISL gestures is done using the
following processes:
1) Conversion of Speech to Text
2) Comparing the text obtained from the previous step with
the database Fig. 4. Data pre-processing
3) Displaying the corresponding ISL Gesture Output
Speech to text conversion is done using PyAudio and Google
Speech Recognition API. The concept diagram of speech to
gesture conversion is shown in Fig. 3.

Fig. 5. SURF Features

1) Support Vector Machine: The input images were passed


to the K-means clustering and Bag of Visual words classi-
fiers before passing them to the SVM classifier. As there
are 42 classes of images in the dataset, k=42 for the K-
means clustering classifier. The visual words were collected
for both the test and train dataset after applying the k-means
clustering algorithm. There are totally 50391 images in the
dataset. Among these images, 40320 images were used for
Fig. 3. Concept Diagram - Speech to Gesture Conversion
training the SVM model. The remaining, 10071 images were
used for testing the performance of the classifier model. A
testing accuracy of around 99.5% was achieved. The other
III. E XPERIMENTAL R ESULTS performance metrics like precision score, F1 score, and recall
score were also calculated. These are shown in Fig. 6.
A. Gesture to Text Conversion
A dataset consisting of ISL hand gestures of numbers (1-
9), English alphabets (A-Z), and 7 English words (BOAT,
FRIEND, HOLIDAY, OK, SWING, SMILE and STAND) was
collected (refer Fig. 1). So, the dataset consisted of 42 (9 + 26
+ 7) classes of images. 1200 different images were captured
for each image class in the dataset.
The images in the dataset were pre-processed to mask un-
wanted areas in the image and to remove noise as mentioned
in the previous section. The various image pre-processing steps
performed on a sample image from the dataset are shown in Fig. 6. Accuracy of SVM Classifier
Fig. 3.
The SURF feature matrix was calculated for every image 2) Convolutional Neural Network: A Convolutional Neural
in the dataset after pre-processing them. The SURF features Network was modeled and developed using the Keras library
extracted for a sample image is shown in Fig. 4. The blue in Python. Around 30,240 images (60% of the images in the
colored circles of varying sizes found in Fig. 4 are the dataset) were used to train the classifier model. The classifier
SURF Feature points. The SURF features of all the images was trained with different number of epochs. A maximum
were extracted and stored in a pickle file and then fed into average testing accuracy of around 88.89% was obtained.
different neural network classifiers. The accuracy of each of 3) Recurrent Neural Network: A Recurrent Neural Net-
the classifiers tested is discussed below. work was modeled and developed using the Keras library in

132

Authorized licensed use limited to: San Francisco State Univ. Downloaded on July 02,2021 at 19:06:28 UTC from IEEE Xplore. Restrictions apply.
2021 3rd International Conference on Signal Processing and Communication (ICPSC) | 13 – 14 May 2021 | Coimbatore

Python. Around 30,240 images were used to train the classifier IV. C ONCLUSION
model. The classifier was trained with different number of From the results obtained, it is inferred that the SVM clas-
epochs. A maximum overall testing accuracy of around 82.3% sifier along with the K-means clustering and BoV classifiers is
was obtained. best suited for gesture recognition. A user friendly application
From the above results obtained, it is concluded that the that can interpret Indian Sign Language has been developed
combination of K-Means clustering, BoV, and SVM classifiers using the most efficient SVM classifier (for gesture to text
has the highest accuracy in recognizing the hand gestures. conversion) and Google Speech Recognition API (for speech
Hence, it is more reliable for gesture recognition. to gesture conversion). Thus, a more reliable sign language
B. Gesture Recognition in Live video interpretation system has been developed.
A real-time gesture recognition system was developed using ACKNOWLEDGMENT
the SVM classifier. When the user shows an ISL hand gesture We would like to thank PSG College of Technology,
in front of the camera, the corresponding English text is ECE Department for providing an opportunity to work
displayed. The time taken to predict the hand gesture in the on the project. We thank Dr. V. Krishnaveni, Professor
real-time video is about 0.04s. Fig. 7 shows the screenshots (CAS) & Head In-charge, Department of Electronics and
of the real-time gesture recognition system. Communication Engineering for the encouragement and
support that she extended towards this project work. We also
thank our Programme Coordinator Dr. M. Santhanalakshmi,
Associate Professor, Department of ECE and our Tutor Ms.
P. Prabavathi, Assistant Professor, Department of ECE for the
advice and constructive feedback regarding various aspects
of the project.

R EFERENCES
Fig. 7. Real-time gesture recognition [1] Cheok Ming Jin, Zaid Omar, Mohamed Hisham Jaward, ”A Mobile Ap-
plication of American Sign Language Translation via Image Processing
Algorithms”, in IEEE Region 10 Symposium, on IEEEexplore, 2016.
[2] Mr. Sanket Kadam, Mr. Aakash GhodkeProf. Sumitra Sadhukhan, ”Hand
C. Speech to Gesture Conversion Gesture Recognition Software based on ISL”, IEEE Xplore, 20 June
Speech to gesture conversion is done using Google Speech 2019.
[3] Kartik Shenoy, Tejas Dastane, Varun Rao, Devendra Vyavaharkar, ”
Recognition and PyAudio. As Google Speech Recognition API Real-time Indian Sign Language (ISL) Recognition”, IEEE Xplore: 18
is used, this process requires an active internet connection. October 2018.
The speech duration is set to 5 sec. i.e. the user is given 5 [4] T Raghuveera, R Deepthi, R Mangalashri And R Akshaya, “A depth
based ISL Recognition using Microsoft Kinect”, ScienceDirect, 2018.
seconds to speak out the word into the microphone. After [5] Muthu Mariappan H, Dr Gomathi V, “Real time Recognition of ISL”,
this, the Google Speech Recognition API converts the speech IEEE Xplore: 10 October 2019.
into text. Then, the ISL gesture corresponding to the predicted [6] G. Ananth Rao a, P.V.V. Kishore, “Selfie video based continuous Indian
sign language recognition system”, ScienceDirect, 2018.
text is displayed. The implementation screenshot of Speech to [7] G. Ananth Rao a, P.V.V. Kishore, “Sign Language Recognition Based
Gesture conversion of the word ”Hello” is given in Fig. 8. On Hand And Body Skeletal Data”, IEEE 2018.
[8] Suharjitoa, Ricky Andersonb, Fanny Wiryanab, Meita Chandra Ariestab,
Gede Putra Kusuma, “Sign Language Recognition Application Systems
for Deaf-Mute People: A Review Based on Input-Process-Output”,
ScienceDirect, 2017.

Fig. 8. Speech to Gesture Conversion

133

Authorized licensed use limited to: San Francisco State Univ. Downloaded on July 02,2021 at 19:06:28 UTC from IEEE Xplore. Restrictions apply.

You might also like