You are on page 1of 29

Indian Sign Language Recognition

CS4099D Project
End Semester Report

Submitted by

Challa Saketh (B190161CS)


Suddala Varun (B190321CS)
Masina Sai Bhargav Teja (B190432CS)

Under the Guidance of

Dr. Lijiya A
Assistant Professor

Department of Computer Science and Engineering


National Institute of Technology Calicut
Calicut, Kerala, India - 673 601

May 2023
NATIONAL INSTITUTE OF TECHNOLOGY CALICUT
KERALA, INDIA - 673 601

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE
Certified that this is a bonafide report of the project work titled

INDIAN SIGN LANGUAGE RECOGNITION

done by
Challa Saketh
Suddala Varun
Masina Sai Bhargav Teja
of Eighth Semester B. Tech, during the Winter Semester 2022-’23, in
partial fulfillment of the requirements for the award of the degree of
Bachelor of Technology in Computer Science and Engineering of the
National Institute of Technology, Calicut.

(Dr. Lijiya A)
04-05-2023 (Assistant Professor)
Date Project Guide
DECLARATION

I hereby declare that the project titled, Indian Sign Language Recog-
nition, is our own work and that, to the best of our knowledge and belief,
it contains no material previously published or written by another person
nor material which has been accepted for the award of any other degree or
diploma of the university or any other institute of higher learning, except
where due acknowledgement and reference has been made in the text.

Place : NIT Calicut Name : Challa Saketh


Date : 04-05-2023 Roll. No. : B190161CS

Name : Suddala Varun


Roll. No. : B190321CS

Name : Masina Sai Bhargav Teja


Roll. No. : B190432CS

ii
Abstract

This project explores the design and implementation of the Indian Sign Lan-
guage recognition system of static gestures. The system segregates and iden-
tifies the signs from the input video file containing gestures. This project aims
to ease the communication between the hearing impaired and normal people
without any involvement of sophisticated devices. The system will take a
video with a gesture or series of gestures as input and give corresponding
text as output.
ACKNOWLEDGEMENT

We would like to express our sincere and heartfelt gratitude to our guide
and mentor Dr. Lijiya A and Renjith P, who have guided us throughout
the course of the final year project. Without their active guidance, help,
cooperation and encouragement, we would not have made headway in the
project. We would like to thank our parents and the faculty members for
motivating us and being supportive throughout our work. We also take this
opportunity to thank our friends who have cooperated with us throughout
the course of the project.

i
Contents
1 Introduction 2

2 Problem Statement 4

3 Literature Survey 5

4 Proposed Work and Design Overview 8


4.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Work Done . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3.1 Vison Transformers . . . . . . . . . . . . . . . . . . . . 11
4.3.2 Convolutional Neural Networks . . . . . . . . . . . . . 11
4.3.3 Comparative Study of Vi-Transformers and CNNs . . . 12

5 Experimental Results 14
5.1 Vision Transformers . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 17

6 Conclusion 20

References 21

ii
List of Figures

4.1 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 9


4.2 Vision Transformers . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 11

5.1 Training loss vs Epoch(Vi-Transformers) . . . . . . . . . . . . 15


5.2 Training accuracy vs Epoch(Vi-Transformers) . . . . . . . . . 15
5.3 Validation loss vs Epoch(Vi-Transformers) . . . . . . . . . . . 16
5.4 Confusion Matrix(Vi-Transformers) . . . . . . . . . . . . . . . 16
5.5 Training loss vs Epoch(CNN) . . . . . . . . . . . . . . . . . . 17
5.6 Train accuracy vs Epoch(CNN) . . . . . . . . . . . . . . . . . 18
5.7 Validation loss vs Epoch(CNN) . . . . . . . . . . . . . . . . . 18
5.8 Confusion Matrix(CNN) . . . . . . . . . . . . . . . . . . . . . 19
5.9 CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 19

iii
List of Tables

5.1 Training accuracy and loss(Vi-Transformers) . . . . . . . . . . 14


5.2 Training accuracy and loss(CNN) . . . . . . . . . . . . . . . . 17

1
Chapter 1

Introduction

Since communication enables us to transfer information from one person to


another, it is essential in our daily lives. However, interacting with regular
people can be quite difficult for those who are deaf and dumb. The goal
of the current research is to enable communication between normal persons
and those who have speech and hearing impairments without the use of a
physical interpreter while safeguarding the privacy of both parties.

Indian Sign Language (ISL) is the sign language used by the speech and
hearing impaired populations in India. In order to convey linguistic infor-
mation, it makes use of face, head, arm, and hand motions. ISL generates
both solitary and continuous indicators. An isolated sign is portrayed with
a precise hand placement and poses that are only concerned with one hand
motion. A series of images used to indicate a moving gesture is called a
continuous sign.

However, since the gestures made by the hearing impaired may not always
be directly related to the referent phrase, there may be a significant communi-
cation gap between them and the hearing. As a result, it’s essential to trans-
late sign language into text or speech that everyone can understand. This

2
CHAPTER 1. INTRODUCTION 3

can be accomplished by using a system for sign language recognition (SLR).


The SLR system seeks to offer a quick and precise transcription method that
increases ease of use.
Chapter 2

Problem Statement

Our project deals with designing a Sign Language Recognition system to


identify Indian Signs in the given input and convert them to text as output.
Input is given as a video with continuous signs and output will be text cor-
responding to those signs.

4
Chapter 3

Literature Survey

Sign language recognition mainly has two types of approaches. Sensor-based


and vision based, As per the scope of our research we are going to discuss only
vision-based approaches mainly due to the simplicity it has to offer rather
than attaching sensors to the hands of actors which isn’t feasible all the time.

Although American Sign Language Recognition and Arabic Sign Lan-


guage Recognition have seen a lot of work, ISL has not received as many
contributions. From the best of our findings, one of the early contributions
to this field was a technique for the automatic translation of the manual al-
phabet’s gestures into Arabic Sign Language by [4]. Images of the gesture
were used as the system’s input, which was then analyzed and turned into
a set of features that included certain length measurements that indicated
the position of the fingertip. For classification, the least-squares estimator
and the subtractive clustering algorithm were employed. The accuracy of the
system was 95.55%˙

One of the earliest contributions regarding continuous sign recognition


through video was discussed in [5], which employed real-time video captured
by a video camera as an input. The authors extracted the region of interest

5
CHAPTER 3. LITERATURE SURVEY 6

using pre-processing methods including skin detection and size normalization


after using a Haar-like algorithm to track the hand in the video frames. The
generated images that have been converted into the frequency domain are
then subjected to the Fourier Transformation to produce the feature vectors.
The k-Nearest Neighbor (KNN) algorithm is used to do the classification,
and the system obtains a 90.55% accuracy rate. [3] provided a review of
techniques used in SLR. The techniques are categorized into different stages
Further, the challenges and limitations faced by SLR are also discussed.

Coming to the recent contributions, [1] gives us insight into an approach


named the Signet model, where CNNs are used to train the model. The
model took input from a data set of binary hand region silhouettes of the
signer images from [6] containing 2500 images initially which later were aug-
mented to 5157. The pre-processing of these images is done by a series of
algorithms such as Viola-Jones for face detection and elimination, followed
by Skin color segmentation and the largest component algorithm for hand
region segmentation. The CNN consisted of 9 layers(1 input+6 hidden +1
dropout +1 output). The first convolution layer consisted of 32 filters fol-
lowed by pooling, convolution, pooling, and again a convolution layer. The
6th had a fully connected layer of 128 neurons followed by another layer
of 24 neurons corresponding 24 alphabets of sign language.The former had
ReLU activation function for activation and latter had Softmax which as-
signed probabilities for 24 classes where relatively higher one getting 1 and
the rest 0 to form one-hot-vector(1xN matrix).This is then given to output
layer.The model produced a Training accuracy of 99.93% and Validation ac-
curacy of 98.64% which is far ahead of any other approaches discussed in the
paper even with much complex data set.

One more recent work is in the area of single hand dynamic gesture
recognition by [2].The database used here is the data collected from Rah-
CHAPTER 3. LITERATURE SURVEY 7

maniya HSS special school, Calicut,India.It included 900 static images and
700 videos.During the pre-processing phase of images Viola Jones algorithm
is used to remove the face and the standard color phase of RGB is converted
to YCbCr which is much less sensitive to light and the large connected frame
is assumed to be hand thereby eliminating the rest.Next step involved an ROI
(Region Of Interest) extraction algorithm.The speciality of this was unlike the
other ones discussed in the model which imposed extra conditions on the im-
agery such as wearing full sleeves or wearing some identification band to the
palm for identifying the palm, it just removes all the unwanted areas of skin
and identifies the palm without any such external conditions.This is achieved
by extending the boundary box of face to neck to get a complete black space
for them.Centroid calculation comes next in which the centroid of the min-
imal boundary box surrounding the palm area is calculated.Now, once we
track the movement of this particular point ,the trajectory we gonna get can
be used in trajectory based gesture recognition.After this a series of steps are
involved such as key frame extraction(Eliminating of uninformative frames
which have no significant change in hand position or shape),Co-articulation
detection and resolving,where one gesture is influenced by other.This can
be of 3 types static and static co-articulation,static dynamic and even dy-
namic and dynamic.Next is feature extraction where features here mean hand
shape, hand motion, hand location and orientation and finally comes the clas-
sification part where the classification/separation of these gestures is done
carefully.This model achieves an accuracy of 89% .
Chapter 4

Proposed Work and Design


Overview

4.1 Data Set


The dataset consists of 7114 images divided among 24 letters of the English
alphabet (A-Z except H and J). All the images have been generated from the
video dataset provided by the institute. Images are binarized (white gesture
on black background) with reference to the small static alphabet dataset
provided by the institute with a dimension of 640x360.

8
CHAPTER 4. PROPOSED WORK AND DESIGN OVERVIEW 9

4.2 Design

Figure 4.1: System Design

1. Image preprocessing is done first to get noise free frames and color
space is changed to YCbCr which is much less sensitive to light.

2. The palm region extraction and removal of other skin areas like face
and neck.

3. Each frame is XORed with the previous frame to identify the Keyframes
based on the change in number of white pixels

4. Extraction of Keyframes and leaving behind the ones which do have a


significant change in number of white pixels compared to the previous
frame.
CHAPTER 4. PROPOSED WORK AND DESIGN OVERVIEW 10

5. Recognition of gestures corresponding to each class using the classifi-


cation model.

4.3 Work Done


We’ve gone through Research papers regarding Sign language recognition
and got an overall idea about the work that has to be done in the Project
such as steps involved in it like Data Acquisition, Key-frame extraction, Seg-
mentation, Feature Extraction, and Classification. We figured out that using
CNNs would be best for static sign recognition. We have gone through some
CNN models and got an idea about implementation. Later we wanted to
try some other model for the static other than the standard CNN used in
most of the research papers we have gone through. So, after some work, we
figured out Vi-Transformers would be a better alternative. A comparative
study between the performance of Vi-Transformers and Convolutional Neural
Networks has been conducted and CNNs outperformed the Vi-Transformers
considerably in terms of accuracy.

The models are trained for static gestures recognizing the letters. The
data set is split into 80:20 ratios for training and testing respectively and
accordingly, lables are assigned. Necessary image transformations are made
to make the raw image set robust against overfitting. Batches of images are
created to train the model. Training parameters are defined for each model
and the models are trained accordingly. Lastly, the model is tested against
the test set and the results were inferred.
CHAPTER 4. PROPOSED WORK AND DESIGN OVERVIEW 11

4.3.1 Vison Transformers

Figure 4.2: Vision Transformers

Vi-Transformers is a neural network architecture that has been successfully


applied to image classification tasks as well. In the context of image clas-
sification, Vi-Transformers use the same basic principles as in natural lan-
guage processing, but with some modifications to make it suitable for image
data.ViT treats the image as a collection of patches that are subsequently
processed as token sequences that resemble text data. The transformer en-
coder then receives these patches and learns to draw out significant features
from the input sequence. Vi-Transformers are mainly used in image classifi-
cation tasks, particularly in cases where the input images have variations in
lighting, color, or texture.
The ”autoModelForImageClassification” from the transformers library is
used for classification of images.

4.3.2 Convolutional Neural Networks

Figure 4.3: Convolutional Neural Networks


CHAPTER 4. PROPOSED WORK AND DESIGN OVERVIEW 12

Convolutional Neural Networks (CNNs) are a form of neural network that


are generally used for image and video recognition but may also be applied
to other types of data having a grid-like structure like audio signals and 3D
data.
Convolutional, pooling and fully linked layers are among the many layers
that make up CNNs. By applying a series of filters to the input data, the
convolutional layer is responsible for extracting features from the data and
detecting distinct patterns and edges within the image. The representation’s
spatial dimension is decreased by the pooling layer but crucial characteristics
are kept. To carry out classification or regression tasks, the fully connected
layers process the retrieved features from the convolutional and pooling lay-
ers.
The current model employs a total of 9 layers consisting of 3 convolu-
tional layers and 3 pooling layers followed by one flatten layer and 2 dense
layers. the total parameters used in the model stand at 3,268,632 where all
of them are trained parameters. the output has 24 entries corresponding to
24 alphabet that will be recognized through this model.

4.3.3 Comparative Study of Vi-Transformers and CNNs


While ViT divides the input images into visual tokens, CNN uses pixel arrays.
Divided into fixed-size patches, the visual transformer accurately embeds
each one and provides positional embedding as an input to the transformer
encoder. .

For the current dataset, both models were applied and results were an-
alyzed. Vi-Transformers achieved an accuracy of about 76% whereas CNNs
outperformed it by achieving an accuracy of about 98% .CNNs produced con-
siderably less training loss and high training accuracy as well when compared
to the Vi-T model. The figures Fig 5.1 and Fig 5.2 represent the graphs of
Training loss and training accuracy vs each epoch while training. Similarly,
CHAPTER 4. PROPOSED WORK AND DESIGN OVERVIEW 13

the graphs Fig 5.5 and Fig 5.6 represent the same data for Convolutional
Neural Networks.
Also, the variation in the accuracy of training with each epoch can be
analyzed with the help of the tables represented in Table 5.1 and Table 5.2.
Finally, it can be concluded that CNNs perform better than Vi-Transformers
in classifying the hand signs accurately for this dataset.
The following section of experimental results will give a much better idea
of the performance of the models.
Chapter 5

Experimental Results

5.1 Vision Transformers

Validation
Epoch Loss Accuracy
loss
1 2.621300 0.612790 2.427532
2 1.660000 0.711876 1.607804
3 1.358800 0.735770 1.375403
4 1.106200 0.738580 1.205604
5 1.188300 0.744202 1.095763
6 1.159000 0.750527 1.022064
7 0.993400 0.767393 0.983076
8 0.888500 0.743500 1.027860
9 0.932500 0.768096 0.951064
10 0.925400 0.754041 0.981194

Table 5.1: Training accuracy and loss(Vi-Transformers)

14
CHAPTER 5. EXPERIMENTAL RESULTS 15

Figure 5.1: Training loss vs Epoch(Vi-Transformers)

Figure 5.2: Training accuracy vs Epoch(Vi-Transformers)


CHAPTER 5. EXPERIMENTAL RESULTS 16

Figure 5.3: Validation loss vs Epoch(Vi-Transformers)

Figure 5.4: Confusion Matrix(Vi-Transformers)


CHAPTER 5. EXPERIMENTAL RESULTS 17

5.2 Convolutional Neural Networks

Validation
Epoch Loss Accuracy
loss
1 0.7559 0.8472 0.2046
2 0.0681 0.9834 0.1596
3 0.0241 0.9934 0.1283
4 0.0131 0.9965 0.1167
5 0.0018 0.9995 0.1189
6 0.0016 0.9995 0.1215
7 0.0015 0.9995 0.1235
8 0.0015 0.9998 0.1256
9 0.0015 0.9998 0.1271
10 0.0015 0.9998 0.1286

Table 5.2: Training accuracy and loss(CNN)

Figure 5.5: Training loss vs Epoch(CNN)


CHAPTER 5. EXPERIMENTAL RESULTS 18

Figure 5.6: Train accuracy vs Epoch(CNN)

Figure 5.7: Validation loss vs Epoch(CNN)


CHAPTER 5. EXPERIMENTAL RESULTS 19

Figure 5.8: Confusion Matrix(CNN)

Figure 5.9: CNN Architecture


Chapter 6

Conclusion

In this report, we tried to summarize different methodologies proposed for


recognition of hand gestures from diverse propositions. The ultimate aim
of the hand gesture recognition system is to construct an effective human-
computer interaction system while also recognizing the language of physically
disabled persons. In order to attain interaction and usefulness, hand ges-
ture research must outperform present performance in terms of accuracy and
speed. Extracting characteristics that would recognize each sign precisely
regardless of source, color, and lighting conditions requires more attention.
From our findings, we conclude that CNN achieved better accuracy than Vi-
transformers. Static sign recognition is done with an accuracy of 98 percent.

20
References

[1] S. C.J. and L.A., ”Signet: A Deep Learning based Indian Sign Language
Recognition System,” 2019 International Conference on Communication
and Signal Processing (ICCSP), 2019,pp.0596-0600.

[2] P.K. Athira, C.J. Sruthi, A. Lijiya,A Signer Independent Sign Language
Recognition with Co-articulation Elimination from Live Videos: An In-
dian Scenario,Journal of King Saud University - Computer and Infor-
mation Sciences,Volume 34, Issue 3,2022.

[3] Cheok, Ming Jin Omar, Zaid Jaward, Mohamed. (2019). A review
of hand gesture and sign language recognition techniques. International
Journal of Machine Learning and Cybernetics. 10. 10.1007/s13042-017-
0705-5

[4] Al-Jarrah, A. Halawani, “Recognition of gestures in Arabic sign lan-


guage using neuro-fuzzy systems,” The Journal of Artificial Intelligence
133 (2001) 117–138..

[5] Nadia R. Albelwi, Yasser M. Alginahi, “Real-Time Arabic Sign Lan-


guage (ArSL) Recognition“ International Conference on Communica-
tions and Information Technology 2012.

[6] P. K. Athira, ”Indian sign language recognition,” Phd thesis, Dept.


CSE., NITC., Calicut, India, 2017.

21
REFERENCES 22

[7] Adithya V.,Rajesh R., A Deep Convolutional Neural Network Approach


for Static Hand Gesture Recognition, Procedia Computer Science, Vol-
ume 171, 2020, Pages 2353-236 ISSN 1877-0509.

[8] Das, S., Biswas, S.K. Purkayastha, B. A deep sign language recognition
system for Indian sign langu.age. Neural Computing and Applications
(2022)

[9] E. Abraham, A. Nayak and A. Iqbal, ”Real-Time Translation


of Indian Sign Language using LSTM,” 2019 Global Conference
for Advancement in Technology (GCAT), 2019, pp. 1-5, doi:
10.1109/GCAT47503.2019.8978343.

You might also like