Professional Documents
Culture Documents
References:
References A: S. Masood, A. Srivastava, H.C. Thuwal and M. Ahmad, “REAL-
TIME SIGN LANGUAGE GESTURE (WORD) RECOGNITION FROM
VIDEO SEQUENCES USING CNN AND RNN”, Springer Nature Singapore Pte
Ltd. 2018.
References B: Kartik Shenoy, Tejas Dastane, Varun Rao, Devendra Vyavaharkar,
“REAL-TIME INDIAN SIGN LANGUAGE (ISL) RECOGNITION”, 2018 9th
International Conference on Computing, Communication and Networking
Technologies (ICCCNT), IEEE.
ABSTRACT:
Then, a text to speech synthesizer can be used to convert text to speech. The
input text should be preprocessed and normalized, followed by linguistic or
prosodic processing. Linguistic/Prosodic processing is the translation of words to
segments with durations and an F0 contour. Then generating the waveform and the
output will be in speech format.
PROGRESS:
Dataset Specifications:
To begin with the project, gesture videos of 76 ISL signs are recorded in the
anechoic chamber of ECE Department, JNTUHUCEH. 50 videos for each sign
were recorded, which sum to a total of 380 videos. For training, 40 videos of each
gesture are used. Similarly, for testing, 10 videos of each gesture are used.
Following are the gesture names categorized into 4 sections: numbers, alphabets,
greetings and medical terms.
0 1 2 3 4
5 6 7 8 9
Table 1. Numbers
a b c d e f g
h i j k l m n
o p q r s t u
v w x y z
Table 2. Alphabets
all the best bye excuse me good good
afternoon evening
good morning good night hello how are you i am fine
my name is nice to meet no please sorry
you
thank you welcome what is your yes
name
Table 3. Greetings
The ISL gestures All the best and ECG are as shown in Fig 1 and Fig 2.
These stages are preprocessing stage, sign language to text conversion stage
and text to speech conversion stage.
In the first stage, i.e., preprocessing stage, it is aimed to convert the video
into frames and then extract region of hands from frames.
Mediapipe python library provides detection solutions for hands, face, etc.
Mediapipe hands outputs 21 landmarks of a hand. Using these landmarks, the
region of hands can be segmented.
After preprocessing the frames of the video, features from these frames must
be extracted and the sign language gesture has to be converted to text by training
and testing/classifying using Neural Networks. As the gesture video consists of
both spatial and temporal features, which when combined makes it possible to
detect a gesture.
To extract spatial features the dataset has to be trained using Convolution Neural
Networks. Inception-v3 model of the Tensorflow library is a deep Convolutional
Neural Network. It is a huge image classification model with millions of
parameters for images to classify.
In the gesture video, information lies in the sequence as well. A Recurrent
Neural Network has time-based functionality, and it uses this information in
recognition task. Long Short-Term Memory (LSTM) is a type of RNN which is
able to learn long term dependencies. Cascading these CNN and LSTM can detect
the gestures.
The last stage is to convert text to speech. For this we will require segments
(phonemes), durations for them and a tune (F0).
Pre-Processing:
The next step after recording the dataset is to perform preprocessing. The videos
are first converted into frames. Then, the region of hands is segmented from each
frame. Mediapipe python library provides detection solutions for hands, face, etc.
Mediapipe hands outputs 21 landmarks of a hand. Using these landmarks, the
region of hands can be segmented.
The preprocessing outputs are as follows.
The figure 6 represents a frame of an ISL gesture obtained after converting video
to frame. The figure 7 represents the same frame after applying mediapipe hands to
it.
After preprocessing the frames of the video, features from these frames have to be
extracted and the sign language gesture has to be converted to text by training and
testing/classifying using Neural Networks.
To extract spatial features the dataset has to be trained using Convolution Neural
Networks. Inception-v3 model of the Tensorflow library is a deep Convolutional
Neural Network. It is a huge image classification model with millions of
parameters for images to classify.
The architecture of Inception-v3 is as follows:
After converting the videos to frames, the frames are then passed to CNN to extract
the spatial features. In the gesture video, information lies in the sequence as well.
A Recurrent Neural Network has time-based functionality, and it uses this
information in recognition task. Long Short-Term Memory (LSTM) is a type of
RNN which can learn long term dependencies. Cascading these CNN and LSTM,
detection of gestures is possible.
Voice of Ms. Sangeetha, ISRO inertial Systems Unit (IISU) Scientist is recorded
for training and building the TTS engine.
The given speech data is segmented into .wav files (1235 files) with length of 5sec-
15sec and prepared corresponding text Prompt file for training the database.
Each .wav file consists of 16 kHz sampling frequency and quantized with signed
16bits/sample with mono stream. Using festvox, new voice model is built using the
trained dataset.
The design of the system is done by tuning the hyperparameters, i.e., epochs and
batch size. By changing the number of epochs and batch size, training has been
done and the accuracies are noted down in Table 5 and Table 6.
CNN LSTM
Batch Size Epochs Accuracy Batch Size Epochs Accuracy
200 4000 64.6% 32 75 73.13%
32 100 82.34%
500 4000 64.7% 32 75 78.89%
32 100 86.55%
1000 4000 64.9% 32 75 78.60%
32 100 90.13%
Table 5. Varying CNN batch size and LSTM Epochs
CNN LSTM
Batch Size Epochs Accuracy Batch Size Epochs Accuracy
1000 4000 64.9% 32 75 78.60%
32 100 90.13%
1000 7000 70.5% 32 75 75.65%
32 100 87.65%
1000 10000 79.8% 32 75 96.26%
32 100 94.26%
Table 6. Varying CNN Epochs and LSTM Epochs
When the batch size is increased, there is slight improvement in the CNN accuracy
and when the number of epochs is increased, there is drastic improvement in the
CNN accuracy. The same behavior is observed in the LSTM accuracy too.
On observing all the models, the model trained with the hyperparameters:
CNN batch size 1000, epochs 10000, RNN batch size 32 and epochs 75
Fig 13. Output of 1st conv layer Fig 14. Output of 2nd conv layer
The LSTM result and confusion matrix are as shown in Fig 12 and Fig 13
Text to Speech: