You are on page 1of 14

M.

TECH PROJECT REVIEW –III

NAME OF THE STUDENT : S. Nadia Begum

ROLL NUMBER : 17011P0410

NAME OF THE GUIDE : Dr. K. Anitha Sheela,


Professor, Department of ECE,
JNTUHUCEH
TITLE OF THE PROJECT : Design and Simulation of Indian
Sign Language to Speech conversion system.
PROJECT STATUS REPORT

a) Whether the project work is progressing as per proposal :

b) No. of specified objectives met :

c) Whether the progress of the student in


project execution is satisfactory :

d) Percentage of completion of project work :

e) Expected amount of time required for


completion of project :

References:
References A: S. Masood, A. Srivastava, H.C. Thuwal and M. Ahmad, “REAL-
TIME SIGN LANGUAGE GESTURE (WORD) RECOGNITION FROM
VIDEO SEQUENCES USING CNN AND RNN”, Springer Nature Singapore Pte
Ltd. 2018.
References B: Kartik Shenoy, Tejas Dastane, Varun Rao, Devendra Vyavaharkar,
“REAL-TIME INDIAN SIGN LANGUAGE (ISL) RECOGNITION”, 2018 9th
International Conference on Computing, Communication and Networking
Technologies (ICCCNT), IEEE.

Signature of the student Signature of the Guide


AIM: To design and simulate an Indian Sign Language to Speech conversion
system.
OBJECTIVES:
 To create dataset,
 To perform preprocessing i.e., to segment region of hands,
 Conversion of ISL Gesture signs to Text using CNN and LSTM,
 Conversion of Text to Speech.

ABSTRACT:

DESIGN AND SIMULATION OF

INDIAN SIGN LANGUAGE TO SPEECH CONVERSION SYSTEM

Indian Sign Language acts as a medium of communication by visually impaired,


deaf and dumb people who constitute a significant portion of Indian population.
Most people find it difficult to apprehend ISL gestures. This has created a
communication gap between the hearing and speech impaired and those who do
not understand ISL.

This project aims to bridge this gap of communication by developing a


model that converts Indian sign language to speech. A sign language gesture is
represented as a video sequence which consists of spatial and temporal features.
Spatial features are extracted from the frames of the video and the temporal
features are extracted by relating the frames of video with respect to time. A model
can be trained on the spatial features using CNN and temporal features using
LSTM, to convert Indian Sign Language to text.

Then, a text to speech synthesizer can be used to convert text to speech. The
input text should be preprocessed and normalized, followed by linguistic or
prosodic processing. Linguistic/Prosodic processing is the translation of words to
segments with durations and an F0 contour. Then generating the waveform and the
output will be in speech format.
PROGRESS:
Dataset Specifications:
To begin with the project, gesture videos of 76 ISL signs are recorded in the
anechoic chamber of ECE Department, JNTUHUCEH. 50 videos for each sign
were recorded, which sum to a total of 380 videos. For training, 40 videos of each
gesture are used. Similarly, for testing, 10 videos of each gesture are used.

The specifications of the dataset are as follows:

 Dataset size: 70.5GB


 Number of gestures: 76
 Number of persons: 10
 Number of videos per person: 5
 Video details:
o Frame width: 1920
o Frame height: 1080
o Duration: 2-4 seconds
o Frame rate: 50 fps

Following are the gesture names categorized into 4 sections: numbers, alphabets,
greetings and medical terms.

0 1 2 3 4
5 6 7 8 9

Table 1. Numbers

a b c d e f g
h i j k l m n
o p q r s t u
v w x y z

Table 2. Alphabets
all the best bye excuse me good good
afternoon evening
good morning good night hello how are you i am fine
my name is nice to meet no please sorry
you
thank you welcome what is your yes
name

Table 3. Greetings

accident allergies asthma blood pressure breathe


cancer diabetes doctor ecg emergency
fever headache health heart attack hospital
insurance
medicine operation stomachache thermometer virus
vomit
Table 4. Medical terms

The ISL gestures All the best and ECG are as shown in Fig 1 and Fig 2.

Fig 1. All the best ISL gesture

Fig 2. ECG ISL gesture


This project can be divided into three stages as shown in the below diagram.

Fig 1. Block diagram

These stages are preprocessing stage, sign language to text conversion stage
and text to speech conversion stage.

In the first stage, i.e., preprocessing stage, it is aimed to convert the video
into frames and then extract region of hands from frames.

Fig 2. Preprocessing stage

Mediapipe python library provides detection solutions for hands, face, etc.
Mediapipe hands outputs 21 landmarks of a hand. Using these landmarks, the
region of hands can be segmented.

After preprocessing the frames of the video, features from these frames must
be extracted and the sign language gesture has to be converted to text by training
and testing/classifying using Neural Networks. As the gesture video consists of
both spatial and temporal features, which when combined makes it possible to
detect a gesture.

To extract spatial features the dataset has to be trained using Convolution Neural
Networks. Inception-v3 model of the Tensorflow library is a deep Convolutional
Neural Network. It is a huge image classification model with millions of
parameters for images to classify.
In the gesture video, information lies in the sequence as well. A Recurrent
Neural Network has time-based functionality, and it uses this information in
recognition task. Long Short-Term Memory (LSTM) is a type of RNN which is
able to learn long term dependencies. Cascading these CNN and LSTM can detect
the gestures.

The last stage is to convert text to speech. For this we will require segments
(phonemes), durations for them and a tune (F0).

Fig 3. Text to Speech conversion stage

The input text should be preprocessed and normalized, followed by linguistic or


prosodic processing. Linguistic/Prosodic processing is the translation of words to
segments with durations and an F0 contour. Then generating the waveform and the
output will be in speech format.

Pre-Processing:
The next step after recording the dataset is to perform preprocessing. The videos
are first converted into frames. Then, the region of hands is segmented from each
frame. Mediapipe python library provides detection solutions for hands, face, etc.
Mediapipe hands outputs 21 landmarks of a hand. Using these landmarks, the
region of hands can be segmented.
The preprocessing outputs are as follows.

Fig 6. Input frame Fig 7. Plotting landmarks

The figure 6 represents a frame of an ISL gesture obtained after converting video
to frame. The figure 7 represents the same frame after applying mediapipe hands to
it.

Conversion of ISL gesture to text:

After preprocessing the frames of the video, features from these frames have to be
extracted and the sign language gesture has to be converted to text by training and
testing/classifying using Neural Networks.

To extract spatial features the dataset has to be trained using Convolution Neural
Networks. Inception-v3 model of the Tensorflow library is a deep Convolutional
Neural Network. It is a huge image classification model with millions of
parameters for images to classify.
The architecture of Inception-v3 is as follows:

Fig 8. Inception-v3 architecture

After converting the videos to frames, the frames are then passed to CNN to extract
the spatial features. In the gesture video, information lies in the sequence as well.
A Recurrent Neural Network has time-based functionality, and it uses this
information in recognition task. Long Short-Term Memory (LSTM) is a type of
RNN which can learn long term dependencies. Cascading these CNN and LSTM,
detection of gestures is possible.

Fig 9. LSTM architecture


Conversion of Text to Speech:
The next stage is to convert text to speech. This is done using festival speech
synthesis system. To generate natural sounding like human voice model 4hrs 10
minutes of speech was recorded at IISU Thiruvananthapuram.

Voice of Ms. Sangeetha, ISRO inertial Systems Unit (IISU) Scientist is recorded
for training and building the TTS engine.

The given speech data is segmented into .wav files (1235 files) with length of 5sec-
15sec and prepared corresponding text Prompt file for training the database.
Each .wav file consists of 16 kHz sampling frequency and quantized with signed
16bits/sample with mono stream. Using festvox, new voice model is built using the
trained dataset.

The Quality of Synthesized speech is measured using Mel-Cepstral Distrotion


(MCD). MCD objective measure is used for testing the generated TTS voice model
and the values are 5.31 for around 3.5hrs data and 5.24 for around 4hrs of data.
Results:
A total of 3040 videos are used in training the CNN model. The LSTM is trained
on the features from CNN, which then outputs the text based on the ISL gesture.

The design of the system is done by tuning the hyperparameters, i.e., epochs and
batch size. By changing the number of epochs and batch size, training has been
done and the accuracies are noted down in Table 5 and Table 6.

CNN LSTM
Batch Size Epochs Accuracy Batch Size Epochs Accuracy
200 4000 64.6% 32 75 73.13%
32 100 82.34%
500 4000 64.7% 32 75 78.89%
32 100 86.55%
1000 4000 64.9% 32 75 78.60%
32 100 90.13%
Table 5. Varying CNN batch size and LSTM Epochs

CNN LSTM
Batch Size Epochs Accuracy Batch Size Epochs Accuracy
1000 4000 64.9% 32 75 78.60%
32 100 90.13%
1000 7000 70.5% 32 75 75.65%
32 100 87.65%
1000 10000 79.8% 32 75 96.26%
32 100 94.26%
Table 6. Varying CNN Epochs and LSTM Epochs

When the batch size is increased, there is slight improvement in the CNN accuracy
and when the number of epochs is increased, there is drastic improvement in the
CNN accuracy. The same behavior is observed in the LSTM accuracy too.
On observing all the models, the model trained with the hyperparameters:

CNN batch size 1000, epochs 10000, RNN batch size 32 and epochs 75

is considered to be better model.

The CNN training graph is as shown in Fig 10.

Fig 10. CNN Training graphs (accuracy and cross entropy)


The input gesture frame and the frame after preprocessing is as follows:

Fig 11. Input Gesture Frame Fig 12. Plotting landmarks

The intermediate results are as follows:

Fig 13. Output of 1st conv layer Fig 14. Output of 2nd conv layer

The LSTM training graph is as shown in Fig 11.


Fig 11. LSTM Training graphs (accuracy and loss)

The LSTM result and confusion matrix are as shown in Fig 12 and Fig 13

Fig 12. LSTM Result


Fig 13. Confusion matrix

The final accuracy of the model is 96.26%.

Text to Speech:

The Quality of Synthesized speech is measured using Mel-Cepstral Distrotion


(MCD). MCD objective measure is used for testing the generated TTS voice model
and the values are 5.31 for around 3.5hrs data and 5.24 for around 4hrs of data.

You might also like