Professional Documents
Culture Documents
INSTITUTE OF ENGINEERING
LALITPUR ENGINEERING COLLEGE
BY
AMRIT SAPKOTA
ASMIT OLI
NISCHAL MAHARJAN
SAKSHYAM ARYAL
FEBRUARY, 2024
AMERICAN SIGN LANGUAGE USING CNN
Submited By
Amrit Sapkota (076 BCT 05)
Asmit Oli (076 BCT 43)
Nischal Maharjan (076 BCT 20)
Sakshyam Aryal (076 BCT 29)
Submitted To
Department of Computer Engineering
Institute of Engineering, Lalitpur Engineering College
Tribhuvan University
Lalitpur, Nepal
Project Supervisor
Er. Hemant Joshi
February, 2024
ii
COPYRIGHT ©
The author has agreed that the library, Department of Computer Engineering, Institute of
Engineering, Lalitpur Engineering College, may make this project work freely available
for inspection. Moreover the author has agreed that the permission for extensive copying
of this project work for scholarly purpose may be granted by the professor(s), who
supervised the project work recorded herein or, in their absence, by the Head of the
Department, wherein this project work was done. It is understood that the recognition
will be given to the author of this project work and to the Department of Computer
Engineering, Institute of Engineering, Lalitpur Engineering College in any use of the
material of this project work. Copying of publication or other use of this project work for
financial gain without approval of the Department of Computer Engineering, Institute of
Engineering, Lalitpur Engineering College and author’s written permission is prohibited.
Request for permission to copy or to make any use of the material in this thesis in whole
or part should be addressed to:
iii
DECLARATION
I declare that the work hereby submitted for Master of Science in Infomatics and
Intelligent Systems Engineering (MSIISE) at the Institute of Engineering, Lalitpur
Engineering College entitled ”AMERICAN SIGN LANGUAGE USING CNN” is
my own work and has not been previously submitted by me at any university for any
academic award. I authorize the Institute of Engineering, Lalitpur Engineering College
to lend this project work to other institutions or individuals for the purpose of scholarly
research.
February, 2024
iv
CERTIFICATE OF APPROVAL
The undersigned certify that they have read and recommend to the Department of
Computer Engineering for acceptance, a project work entitled “AMERICAN SIGN
LANGUAGE USING CNN”, submitted by Amrit Sapkota (076 BCT 05), Asmit Oli
(076 BCT 43), Nischal Maharjan (076 BCT 20), Sakshyam Aryal (076 BCT 29)
in partial fulfillment of the requirement for the award of the degree of “Bachelor of
Engineering in Computer Engineering”.
Project Supervisor
Er. Hemant Joshi
Head Of Department
Department of Computer Engineering, Universal Engineering College
Project Coordinator
Er. Bibat Thokar
Lecturer
Department of Computer Engineering, Lalitpur Engineering College
February, 2024
v
ACKNOWLEDGEMENT
This project work would not have been possible without the guidance and the help of
several individuals who in one way or another contributed and extended their valuable
assistance in the preparation and completion of this study.
First of all, We would like to express our sincere gratitude to our supervisor, Er. Hemant
Joshi, Head Of Department, Department of Computer Engineering, Universal En-
gineering College for providing invaluable guidance, insightful comments, meticulous
suggestions, and encouragement throughout the duration of this project work. Our
sincere thanks also goes to the Project Coordinator, Er. Bibat Thokar, for coordinating
the project works, providing astute criticism, and having inexhaustible patience.
We are also grateful to our classmates and friends for offering us advice and moral
support. To our family, thank you for encouraging us in all of our pursuits and inspiring
us to follow our dreams. We are especially grateful to our parents, who supported us
emotionally, believed in us and wanted the best for us.
February, 2024
vi
ABSTRACT
There is an undeniable communication problem between the Deaf community and the
hearing majority. It becomes hard for deaf people to communicate because many people
don’t understand sign language. With the use of innovation, in sign language recognition,
we tried to teardown this communication barrier. In this proposal, it is shown how using
Artificial Intelligence can play a key role in providing the solution. Using the dataset,
through the front camera of the laptop, translation of sign language to text format can be
seen on the screen in real-time i.e. the input is in video format whereas the output is in
text format. Extraction of complex head and hand movements along with their constantly
changing shapes for recognition of sign language is considered a difficult problem in
computer vision. Mediapipe provides necessary key points or landmarks of hand, face,
and pose. The model is then trained using a Convolutional neural network(CNN). The
trained model is used to recognize sign language.
vii
TABLE OF CONTENTS
COPYRIGHT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
DECLARATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
CERTIFICATE OF APPROVAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Problem Statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Project Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Scope of Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.6 Potential Project Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.7 Originality of Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.8 Organisation of Project Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Video Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.2 Video Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.3 Frame Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.4 Preprocessing Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Convolution Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
viii
3.5 Long Short Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6 System Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.7 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.8 Level 0 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.9 Level 1 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.10 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.11 Instrumental Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.11.1 Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.11.2 Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.12 User Requirement Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.13 Dataset Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.14 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.15 Non-functional Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.16 Elaboration of Working Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.17 Verification and Validation Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
APPENDIX
A.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
ix
LIST OF FIGURES
x
LIST OF TABLES
xi
LIST OF ABBREVIATIONS
xii
1 INTRODUCTION
American Sign Language (ASL) is a visual language used by Deaf and hard-of-hearing
communities in the US and Canada. It relies on handshapes, movements, and facial
expressions for communication. ASL plays a vital role in cultural identity and community
cohesion among Deaf individuals. Advancements in technology and education have
increased its recognition and accessibility. Understanding ASL is crucial for promoting
inclusivity and breaking down communication barriers.
1.1 Background
With the rapid growth of technology around us, Machine Learning and Artificial Intelli-
gence have been used in various sectors to support mankind including gesture, object,
face detection, etc. With the help of Deep Learning, a machine imitates the way humans
gain certain types of knowledge. Using an Artificial Neural Network, a simulation of the
human brain is done and using Convolution layers, extraction of selected important parts
from an image to make computation easy. ”Sign Language Detection”, the name itself
specifies the gist of the project. Sign language recognition has been a major problem
between mute disabilities people in the community. People do not understand sign
language and also it is difficult for them to learn those sign language. Apart from the
scoring grades from this minor project, the core idea is to make communication easy
for deaf people. We set the bar of the project such that it would be beneficial to society
as well. The main reason for us to choose this project is to aid people using Artificial
Intelligence.
1.2 Motivation
The motivation behind studying American Sign Language (ASL) stems from its profound
impact on communication and inclusivity. ASL serves as a primary means of communi-
cation for Deaf and hard-of-hearing individuals, enabling them to express themselves,
interact with others, and participate fully in society. By learning ASL, individuals can
foster greater understanding, empathy, and connection with the Deaf community, break-
ing down communication barriers and promoting inclusivity. Furthermore, studying ASL
provides insight into the linguistic and cultural richness of sign languages, contributing
to a more diverse and inclusive society. Ultimately, the motivation for studying ASL lies
in its ability to empower individuals, promote communication equality, and celebrate the
1
unique language and culture of the Deaf community.
• To design and implement a system that can understand the sign language of
Hearing-impaired people.
2
bility and inclusivity by bridging communication gaps between deaf or hard-of-hearing
individuals and the hearing community.
3
presents a notable research opportunity. Additionally, delving into the socio-cultural
aspects of ASL, such as regional variations and linguistic evolution, alongside addressing
accessibility challenges in ASL interpretation services, offers avenues for original re-
search. Lastly, investigating the interplay between ASL and other languages, modalities,
or communication systems, such as gesture-based interfaces or multimodal platforms,
contributes to the field’s advancement. By addressing these research gaps, a project can
offer invaluable insights, methodologies, and solutions that propel ASL studies forward,
foster inclusivity and accessibility, and ultimately enhance the quality of life for ASL
users.
4
2 LITERATURE REVIEW
Sign language is the most natural and effective way for communications among deaf and
normal people. American Sign Language (ASL) alphabet recognition (i.e. fingerspelling)
using marker-less vision sensor is a challenging task due to the difficulties in hand seg-
mentation and appearance variations among signers. Existing color-based sign language
recognition systems suffer from many challenges such as complex background, hand
segmentation, large inter-class and intra-class variations. In this paper, we propose a new
user independent recognition system for American sign language alphabet using depth
images captured from the low-cost Microsoft Kinect depth sensor. Exploiting depth
information instead of color images overcomes many problems due to their robustness
against illumination and background variations. Hand region can be segmented by
applying a simple preprocessing algorithm over depth image. Feature learning using
convolutional neural network architectures is applied instead of the classical handcrafted
feature extraction methods. Local features extracted from the segmented hand are ef-
fectively learned using a simple unsupervised Principal Component Analysis Network
(PCANet) deep learning architecture. Two strategies of learning the PCANet model are
proposed, namely to train a single PCANet model from samples of all users and to train
a separate PCANet model for each user, respectively. The extracted features are then
recognized using linear Support Vector Machine (SVM) classifier. The performance of
the proposed method is evaluated using public dataset of real depth images captured
from various users. Experimental results show that the performance of the proposed
method outperforms state-of-the-art recognition accuracy using leave-one-out evaluation
strategy.A survey conducted in 2019 by IEEE in their User-Independent American Sign
Language Alphabet Recognition Based on Depth Image and PCANet Features paper
used Principal Component Analysis Network (PCANet) and predicted the result with
accuracy 88% [3].
5
Jyotishman Bora et al. in 2023 in their Real-time Assamese Sign Language Recog-
nition using MediaPipe and Deep Learning paper used Mediapipe, Microsoft Kinect
sensor and predicted the result with accuracy 96.21% People lacking the sense of hearing
and the ability to speak have undeniable communication problems in their life. People
with hearing and speech problems communicate using sign language with themselves
and others. Sign language is not essentially known to a more significant portion of the
human population who uses spoken and written language for communication. Therefore,
it is a necessity to develop technological tools for interpretation of sign language. Much
research have been carried out to acknowledge sign language using technology for most
global languages. But there are still scopes of development of tools and techniques for
sign language development for local dialects. There are 22 modern Indian languages
and more than 19000 languages that are spoken regionally as mother tongue. This work
attempts to develop a technical approach for recognizing Assamese Sign Language,
which is one of the 22 modern languages of India. Using machine learning techniques,
this work tried to establish a system for identifying the hand gestures from Assamese
Sign Language. A combination of two-dimensional and three-dimensional images of
Assamesegestures has been used to prepare a dataset. The MediaPipe framework has
been implemented to detect landmarks in the images. The dataset was used for training
of a feed-forward neural network. The results reveal that the method implemented in this
work is effective for the recognition of the other alphabets and gestures in the Assamese
Sign Language. This method could also be tried and tested for the recognition of signs
and gestures for various other local languages of India [4].
A survey conducted in 2016 by Brandon Garcia Stanford University Stanford, CA in their
Real-time American Sign Language Recognition with Convolutional Neural Networks
used the CNN algorithm and predicted the result with accuracy 95.72% A real-time sign
language translator is an important milestone in facilitating communication between
the deaf community and the general public. We hereby present the development and
implementation of an American Sign Language (ASL) fingerspelling translator based on
a convolutional neural network. We utilize a pre-trained GoogLeNet architecture trained
on the ILSVRC2012 dataset, as well as the Surrey University and Massey University
ASL datasets in order to apply transfer learning to this task. We produced a robust model
6
that consistently classifies letters a-e correctly with first-time users and another that
correctly classifies letters a-k in a majority of cases. Given the limitations of the datasets
and the encouraging results achieved, we are confident that with further research and
more data, we can produce a fully generalizable translator for all ASL letters.[5].
C.K.M Lee et al. in 2021 in their American sign language recognition and training
method with recurrent neural network paper used LSTM,SVM,RNN and predicted the
result with accuracy 93.36%,94.23% and 95.03% respectively Though American sign
language (ASL) has gained recognition from the American society, few ASL applications
have been developed with educational purposes. Those designed with real-time sign
recognition systems are also lacking. Leap motion controller facilitates the real-time
and accurate recognition of ASL signs. It allows an opportunity for designing a learning
application with a real-time sign recognition system that seeks to improve the effective-
ness of ASL learning. The project proposes an ASL learning application prototype. The
application would be a whack-a-mole game with a real-time sign recognition system
embedded. Since both static and dynamic signs (J, Z) exist in ASL alphabets, Long-Short
Term Memory Recurrent Neural Network with k-Nearest-Neighbour method is adopted
as the classification method is based on handling of sequences of input. Characteristics
such as sphere radius, angles between fingers and distance between finger positions are
extracted as input for the classification model.[6].
Yulius Obi et al. in 2023 in their Sign language recognition system for communicating
to people with disabilities paper used CNN and predicted the result with accuracy 95.1%
Sign language is one of the most reliable ways of communicating with special needs peo-
ple, as it can be done anywhere. However, most people do not understand sign language.
Therefore, we have devised an idea to make a desktop application that can recognize sign
language and convert it to text in real time. This research uses American Sign Language
(ASL) datasets and the Convolutional Neural Networks (CNN) classification system.
In the classification, the hand image is first passed through a filter and after the filter
is applied, the hand is passed through a classifier which predicts the class of the hand
gestures. This research focuses on the accuracy of the recognition.[7].
Md. Moklesur Rahman et al. in 2019 in their A New Benchmark on American Sign Lan-
guage Recognition using Convolutional Neural Network paper used CNN and predicted
the result with accuracy 95.9% The listening or hearing impaired (deaf/dumb) people use
7
a set of signs, called sign language instead of speech for communication among them.
However, it is very challenging for non-sign language speakers to communicate with
this community using signs. It is very necessary to develop an application to recognize
gestures or actions of sign languages to make easy communication between the normal
and the deaf community. The American Sign Language (ASL) is one of the mostly
used sign languages in the World, and considering its importance, there are already
existing methods for recognition of ASL with limited accuracy. The objective of this
study is to propose a novel model to enhance the accuracy of the existing methods for
ASL recognition. The study has been performed on the alphabet and numerals of four
publicly available ASL datasets. After preprocessing, the images of the alphabet and
numerals were fed to a newly proposed convolutional neural network (CNN) model, and
the performance of this model was evaluated to recognize the numerals and alphabet of
these datasets. The proposed CNN model significantly (9%) improves the recognition
accuracy of ASL reported by some existing prominent methods. [8].
Jungpil Shin et al. in 2020 in their American Sign Language Alphabet Recognition by
Extracting Feature from Hand Pose Estimation paper used SVM and predicted the result
with accuracy 87% Sign language is designed to assist the deaf and hard of hearing
community to convey messages and connect with society. Sign language recognition
has been an important domain of research for a long time. Previously, sensor-based
approaches have obtained higher accuracy than vision-based approaches. Due to the
cost-effectiveness of vision-based approaches, researchers have been conducted here
also despite the accuracy drop. The purpose of this research is to recognize American
sign characters using hand images obtained from a web camera. In this work, the media-
pipe hands algorithm was used for estimating hand joints from RGB images of hands
obtained from a web camera and two types of features were generated from the estimated
coordinates of the joints obtained for classification: one is the distances between the
joint points and the other one is the angles between vectors and 3D axes. The classifiers
utilized to classify the characters were support vector machine (SVM) and light gradient
boosting machine (GBM). Three character datasets were used for recognition: the ASL
Alphabet dataset, the Massey dataset, and the finger spelling A dataset.[9].
8
Table 2.1: Summary of Related Works on ASL Recognition
9
3 METHODOLOGY
10
view of the signer’s upper body, focusing on the hand region.
11
The CNN layer is the most significant; it builds a convolved feature map by applying a
filter to an array of picture pixels. I developed a CNN with three layers, each layer using
convolution, ReLU, and pooling. Because CNN does not handle rotation and scaling
by itself, a data augmentation approach was used. A few samples have been rotated,
enlarged, shrunk, thickened, and thinned manually.
Convolution filters are applied to the input using 1D Convolutions to extract the most
significant characteristics. The kernel glides in one dimension in 1D convolution, which
exactly suits the spatial properties. Convolution sparsity, when used with pooling for
location invariant feature detection and parameter sharing, lowers overfitting.
ReLU layer is a layer where data travels through each layer of the network, the ReLU
layer functions as an activation function, ensuring non-linearity. Without ReLU, the
dimensionality that is desired would be lost. It introduces non-linearity, accelerates
training, and reduces computation time.
Pooling layer is a layer that gradually decreases the dimension of the feature and variation
of the represented data. Decreases dimensions and computation, speeds up processing by
reducing the number of parameters that the network must compute, reduces overfitting by
reducing the number of parameters, and makes the model more tolerant of changes and
distortions. Pooling strategies include max pooling, min pooling, and average pooling; I
tried max pooling. The maximum input of a convolved feature is used in max pooling.
Flatten is used to transform the data into a one-dimensional array for input to the next
layer.
Dense the weights are multiplied by a matrix-vector multiplication of the input tensors,
followed by an activation function. Apart from an activation function, the essential
argument that we define here is units, which is an integer that we use to select the output
size.
Dropout layer is a regularisation approach that eliminates neurons from layers at random,
along with their input and output connections. As a consequence, generality is improved,
and overfitting is avoided.
12
mutually exclusive classes, categorical cross-entropy measures the dissimilarity between
the predicted probability distribution of classes and the true distribution. In the realm
of ASL detection, where accurate classification of various sign gestures is crucial,
this loss function plays a pivotal role in guiding the training process. By penalizing
deviations from the actual class probabilities, categorical cross-entropy effectively steers
the model towards learning to make more precise predictions. Its implementation ensures
that the model is trained to discern subtle differences among ASL gestures, ultimately
contributing to enhanced accuracy and proficiency in sign language recognition. The
documentation should underscore the significance of categorical cross-entropy in the
ASL detection pipeline, elucidating its role in optimizing the model’s ability to interpret
and classify a diverse range of sign language expressions accurately.
1 N M
Categorical Cross-Entropy = − ∑ ∑ yi j log(pi j ) (3.1)
N i=1 j=1
13
information to discard from the memory cell. It is trained to open when the information
is no longer important and close when it is. The output gate is responsible for deciding
which information to use for the output of the LSTM. It is trained to open when the
information is important and close when it is not. The gates in an LSTM are trained
to open and close based on the input and the previous hidden state. This allows the
LSTM to selectively retain or discard information, making it more effective at capturing
long-term dependencies.
Structure of LSTM
14
analyze sequential data over multiple time steps.
The overall workflow of the system is shown in the above block diagram. Data sets are
like the memory of the system. Every detection that we view in real time is the result
of the data set. Data sets are captured in real-time from the front camera of the laptop.
Using media pipe live perception of simultaneous human pose, face landmarks, and hand
tracking in real-time various modern life applications including sign language detection
can be enabled. With the help of the landmarks or let’s say key points of hands we get
from the media pipe we train our model. All the data that we collected from the data sets
and deep learning models are considered as training data. These data are provided to
the system such that the system can detect the sign language in real time. Input to this
system is a live video using the front camera of the laptop. As the real-time input i.e. sign
language is provided using the front camera of the laptop, simultaneously live output
can be seen on the screen in text format. It acts as an interface for the Sign Language
System providing an environment for input data to get processed and provide the output.
15
3.7 Use Case Diagram
16
3.8 Level 0 DFD
17
3.9 Level 1 DFD
18
3.10 Activity Diagram
19
3.11 Instrumental Requirements
3.11.1 Hardware Requirements
The hardware required for the projects are:
• CPU
• GPU
• Storage
TensorFlow
TensorFlow is a free open source library that can be used in the field of machine learning
and artificial intelligence. Including many other tasks, it can be used for training purposes
in deep learning.
Mediapipe
MediaPipe offers cross-platform, customizable machine learning solutions for live and
streaming media i.e. real time videos. Its features are End to end acceleration, Build
once deployed anywhere, ready to use solution, and Free and open source.
20
Figure 3.9: Hand Landmarks
21
3.15 Non-functional Requirement
• Performance Requirement.
• Design Constraints.
• Reliability.
• Usability.
• Maintainability.
22
in recognizing ASL gestures. Additionally, early stopping is implemented using the
EarlyStopping callback to prevent overfitting and improve the generalization of the
model. This technique monitors the validation loss during training and stops training if
the loss fails to improve for a specified number of epochs (patience), thereby preventing
the model from memorizing the training data and ensuring better performance on unseen
data. Finally, the model.fit function is used to train the model on the training set
(x train) and corresponding labels (Y tr). The validation split parameter is set to
0.1, indicating that 10% of the training data will be used for validation during training.
This allows for monitoring the model’s performance on a separate validation set during
training, helping to detect overfitting and adjust hyperparameters accordingly.
23
4 RESULTS AND ANALYSIS
After training our model we have visualized the accuracy and the loss curve.We have
found that the accuracy at epoch 1 was 0.7227 and validation accuracy was 0.9026,
0.9086 accuracy and 0.9407 validation accuracy at epoch 2 and goes on increasing
as shown in figure 4.1. Also we have found that the loss at epoch 1 was 0.9629 and
validation loss was 0.2303, 0.2372 loss and 0.1435 validation loss at epoch 2 and goes
on decreasing as shown in figure 4.2.
24
Figure 4.2: Loss Graph
After training our model we have visualized the accuracy and the loss curve.We have
found that the accuracy at epoch 1 was 0.1270 and validation accuracy was 0.1012,
0.2012 accuracy and 0.2673 validation accuracy at epoch 2 and goes on increasing
as shown in figure 4.3. Also we have found that the loss at epoch 1 was 2.5367 and
validation loss was 2.3158, 6.2265 loss and 1.9238 validation loss at epoch 2 and goes
on decreasing as shown in figure 4.4.
25
Figure 4.3: LSTM Accuracy Graph
TP+TN
Accuracy = (4.1)
T P + FN + FP + T N
26
Precision is a metric that measures the accuracy of positive predictions made by a system.
It can be obtained by dividing true positives by the sum of true positives and false
positives.
TP
Precision = (4.2)
T P + FP
In machine learning, recall, also referred to as sensitivity or true positive rate, represents
the likelihood that the model accurately recognizes the detected anomaly.
TP
Recall = (4.3)
T P + FN
The F1-Score is a metric that combines precision and recall using their harmonic mean. It
provides a single value for comparison, with higher values indicating better performance.
2 ∗ Precision ∗ Recall
F1 − Score = (4.4)
Precision + Recall
27
Figure 4.6: LSTM Confusion Matrix
28
Figure 4.7: Output from Model
29
4.6 Model Summary
4.6.1 CNN Model Summary
The following figure 4.9 is our model’s summary which shows no of convolutional layer,
kernel, filters, Max pooling layer and their no of units.
30
4.6.2 LSTM Model Summary
The following figure 4.10 is our model’s summary of LSTM.
31
5 CONCLUSION
We developed the American Sign Language (ASL) website and added Convolutional
Neural Networks (CNNs) to improve the website’s functioning and user experience.
CNNs were used for picture identification and classification, among other tasks. We
used convolutional neural networks (CNNs) to generate hand gesture recognition and
ASL sign interpretation from picture or video frames by making use of CNNs’ capacity
to extract spatial elements from visual data. Thanks to this technology, visitors could
interact with the website through movements that were recorded by the camera on
their device, making learning easier and more engaging. LSTM networks, on the other
hand, were used for sequential data processing, especially for tasks involving temporal
dependencies, like sequence recognition in sign language. Long-range relationships and
temporal patterns are excellently captured by LSTM networks, which makes them a
good fit for situations where interpreting a sign accurately requires a comprehension of
its context. In order to assess user-inputted ASL sign sequences and provide real-time
feedback and corrections during sign language practice sessions, we implemented LSTM
networks.
We conducted a comparative study between CNNs and LSTM networks in the context of
the ASL website. The results showed that CNNs performed significantly better for tasks
using static visual data, like hand gesture detection from pictures or videos. Their aptitude
for acquiring knowledge of spatial feature hierarchies made them especially suitable
for image-based ASL recognition applications. However, LSTM networks performed
exceptionally well in tasks that required sequential data processing, like deciphering and
understanding ASL sign sequences. Their ability to accurately comprehend sequences in
sign language was made possible by their capacity to describe temporal dynamics and
long-range relationships. This allowed the website to offer users contextually relevant
feedback and help during sign language practice sessions. Overall, we were able to
improve the ASL website’s functionality and user experience by utilizing the advantages
of both CNNs and LSTM networks. This gave users access to a thorough and engaging
platform for learning and using American Sign Language.
32
APPENDIX
33
Figure A.2: Home Page
34
Figure A.4: Tutorials
35
Figure A.6: Number Sign
36
REFERENCES
[1] Reham Mohamed Abdulhamied, Mona M Nasr, and Sarah N Abdulkader. Real-time
recognition of american sign language using long-short term memory neural network
and hand detection. 2023.
[2] Rahib H Abiyev, Murat Arslan, and John Bush Idoko. Sign language translation
using deep convolutional neural networks. KSII Transactions on Internet & Infor-
mation Systems, 14(2), 2020.
[3] Walaa Aly, Saleh Aly, and Sultan Almotairi. User-independent american sign
language alphabet recognition based on depth image and pcanet features. IEEE
Access, 7:123138–123150, 2019.
[4] Jyotishman Bora, Saine Dehingia, Abhijit Boruah, Anuraag Anuj Chetia, and Dikhit
Gogoi. Real-time assamese sign language recognition using mediapipe and deep
learning. Procedia Computer Science, 218:1384–1393, 2023.
[5] Brandon Garcia and Sigberto Alarcon Viesca. Real-time american sign language
recognition with convolutional neural networks. Convolutional Neural Networks for
Visual Recognition, 2(225-232):8, 2016.
[6] Carman KM Lee, Kam KH Ng, Chun-Hsien Chen, Henry CW Lau, SY Chung,
and Tiffany Tsoi. American sign language recognition and training method with
recurrent neural network. Expert Systems with Applications, 167:114403, 2021.
[7] Yulius Obi, Kent Samuel Claudio, Vetri Marvel Budiman, Said Achmad, and Aditya
Kurniawan. Sign language recognition system for communicating to people with
disabilities. Procedia Computer Science, 216:13–20, 2023.
[9] Jungpil Shin, Akitaka Matsuoka, Md Al Mehedi Hasan, and Azmain Yakin Srizon.
37
American sign language alphabet recognition by extracting feature from hand pose
estimation. Sensors, 21(17):5856, 2021.
38