Professional Documents
Culture Documents
INSTITUTE OF ENGINEERING
LALITPUR ENGINEERING COLLEGE
BY
AMRIT SAPKOTA
ASMIT OLI
NISCHAL MAHARJAN
SAKSHYAM ARYAL
FEBRUARY, 2024
AMERICAN SIGN LANGUAGE USING CNN
By
Amrit Sapkota (076 BCT 05)
Asmit Oli (076 BCT 43)
Nischal Maharjan (076 BCT 20)
Sakshyam Aryal (076 BCT 29)
Project Supervisor
Er. Hemant Joshi
February, 2024
ii
ACKNOWLEDGEMENT
This project work would not have been possible without the guidance and the help of
several individuals who in one way or another contributed and extended their valuable
assistance in the preparation and completion of this study.
First of all, We would like to express our sincere gratitude to our supervisor, Er. Hemant
Joshi, Head Of Department, Department of Computer Engineering, Universal En-
gineering College for providing invaluable guidance, insightful comments, meticulous
suggestions, and encouragement throughout the duration of this project work. Our
sincere thanks also goes to the Project Coordinator, Er. Bibat Thokar, for coordinating
the project works, providing astute criticism, and having inexhaustible patience.
We are also grateful to our classmates and friends for offering us advice and moral
support. To our family, thank you for encouraging us in all of our pursuits and inspiring
us to follow our dreams. We are especially grateful to our parents, who supported us
emotionally, believed in us and wanted the best for us.
February, 2024
iii
ABSTRACT
There is an undeniable communication problem between the Deaf community and the
hearing majority. It becomes hard for deaf people to communicate because many people
don’t understand sign language. With the use of innovation, in sign language recognition,
we tried to teardown this communication barrier. In this proposal, it is shown how using
Artificial Intelligence can play a key role in providing the solution. Using the dataset,
through the front camera of the laptop, translation of sign language to text format can be
seen on the screen in real-time i.e. the input is in video format whereas the output is in
text format. Extraction of complex head and hand movements along with their constantly
changing shapes for recognition of sign language is considered a difficult problem in
computer vision. Mediapipe provides necessary key points or landmarks of hand, face,
and pose. The model is then trained using a Convolutional neural network(CNN). The
trained model is used to recognize sign language.
iv
TABLE OF CONTENTS
ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Problem Statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Project Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Scope of Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.6 Potential Project Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.7 Originality of Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.8 Organisation of Project Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.1 Video Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.2 Video Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.3 Frame Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.4 Preprocessing Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Convolution Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.5 Long Short Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.6 System Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.7 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.8 Level 0 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
v
3.9 Level 1 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.10 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.11 INSTRUMENTAL REQUIREMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.11.1 Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.11.2 Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.12 User Requirement Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.13 Dataset Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.14 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.15 Non-functional Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.16 Elaboration of Working Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.17 Verification and Validation Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Comparision of CNN and LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5 TASK COMPLETED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6 REMAINING TASK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
APPENDIX
A.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
vi
LIST OF FIGURES
vii
LIST OF TABLES
viii
LIST OF ABBREVIATIONS
ix
1 INTRODUCTION
American Sign Language (ASL) is a visual language used by Deaf and hard-of-hearing
communities in the US and Canada. It relies on handshapes, movements, and facial
expressions for communication. ASL plays a vital role in cultural identity and community
cohesion among Deaf individuals. Advancements in technology and education have
increased its recognition and accessibility. Understanding ASL is crucial for promoting
inclusivity and breaking down communication barriers.
1.1 Background
With the rapid growth of technology around us, Machine Learning and Artificial Intelli-
gence have been used in various sectors to support mankind including gesture, object,
face detection, etc. With the help of Deep Learning, a machine imitates the way humans
gain certain types of knowledge. Using an Artificial Neural Network, a simulation of the
human brain is done and using Convolution layers, extraction of selected important parts
from an image to make computation easy. ”Sign Language Detection”, the name itself
specifies the gist of the project. Sign language recognition has been a major problem
between mute disabilities people in the community. People do not understand sign
language and also it is difficult for them to learn those sign language. Apart from the
scoring grades from this minor project, the core idea is to make communication easy
for deaf people. We set the bar of the project such that it would be beneficial to society
as well. The main reason for us to choose this project is to aid people using Artificial
Intelligence.
1.2 Motivation
The motivation behind studying American Sign Language (ASL) stems from its profound
impact on communication and inclusivity. ASL serves as a primary means of communi-
cation for Deaf and hard-of-hearing individuals, enabling them to express themselves,
interact with others, and participate fully in society. By learning ASL, individuals can
foster greater understanding, empathy, and connection with the Deaf community, break-
ing down communication barriers and promoting inclusivity. Furthermore, studying ASL
provides insight into the linguistic and cultural richness of sign languages, contributing
to a more diverse and inclusive society. Ultimately, the motivation for studying ASL lies
in its ability to empower individuals, promote communication equality, and celebrate the
1
unique language and culture of the Deaf community.
• To design and implement a system that can understand the sign language of
Hearing-impaired people.
• To train the model with a variety of datasets using MediaPipe and CNN, and
provide the output in real-time.
2
real-time performance in a variety of environments.
The project focuses on creating a system to translate American Sign Language gestures
into text, benefiting individuals with hearing impairments. It aims to enhance accessi-
bility and inclusivity by bridging communication gaps between deaf or hard-of-hearing
individuals and the hearing community.
3
effectively, accommodating diverse learning styles and preferences. Exploring new appli-
cations of ASL technology in domains like healthcare, where effective communication
between deaf or hard-of-hearing individuals and healthcare providers is critical, also
presents a notable research opportunity. Additionally, delving into the socio-cultural
aspects of ASL, such as regional variations and linguistic evolution, alongside addressing
accessibility challenges in ASL interpretation services, offers avenues for original re-
search. Lastly, investigating the interplay between ASL and other languages, modalities,
or communication systems, such as gesture-based interfaces or multimodal platforms,
contributes to the field’s advancement. By addressing these research gaps, a project can
offer invaluable insights, methodologies, and solutions that propel ASL studies forward,
foster inclusivity and accessibility, and ultimately enhance the quality of life for ASL
users.
4
2 LITERATURE REVIEW
5
Table 2.1: Summary of Related Works on ASL Recognition
6
3 METHODOLOGY
7
view of the signer’s upper body, focusing on the hand region.
8
The CNN layer is the most significant; it builds a convolved feature map by applying a
filter to an array of picture pixels. I developed a CNN with three layers, each layer using
convolution, ReLU, and pooling. Because CNN does not handle rotation and scaling
by itself, a data augmentation approach was used. A few samples have been rotated,
enlarged, shrunk, thickened, and thinned manually.
Convolution filters are applied to the input using 1D Convolutions to extract the most
significant characteristics. The kernel glides in one dimension in 1D convolution, which
exactly suits the spatial properties. Convolution sparsity, when used with pooling for
location invariant feature detection and parameter sharing, lowers overfitting.
ReLU layer is a layer where data travels through each layer of the network, the ReLU
layer functions as an activation function, ensuring non-linearity. Without ReLU, the
dimensionality that is desired would be lost. It introduces non-linearity, accelerates
training, and reduces computation time.
Pooling layer is a layer that gradually decreases the dimension of the feature and variation
of the represented data. Decreases dimensions and computation, speeds up processing by
reducing the number of parameters that the network must compute, reduces overfitting by
reducing the number of parameters, and makes the model more tolerant of changes and
distortions. Pooling strategies include max pooling, min pooling, and average pooling; I
tried max pooling. The maximum input of a convolved feature is used in max pooling.
Flatten is used to transform the data into a one-dimensional array for input to the next
layer.
Dense the weights are multiplied by a matrix-vector multiplication of the input tensors,
followed by an activation function. Apart from an activation function, the essential
argument that we define here is units, which is an integer that we use to select the output
size.
Dropout layer is a regularisation approach that eliminates neurons from layers at random,
along with their input and output connections. As a consequence, generality is improved,
and overfitting is avoided.
9
mutually exclusive classes, categorical cross-entropy measures the dissimilarity between
the predicted probability distribution of classes and the true distribution. In the realm
of ASL detection, where accurate classification of various sign gestures is crucial,
this loss function plays a pivotal role in guiding the training process. By penalizing
deviations from the actual class probabilities, categorical cross-entropy effectively steers
the model towards learning to make more precise predictions. Its implementation ensures
that the model is trained to discern subtle differences among ASL gestures, ultimately
contributing to enhanced accuracy and proficiency in sign language recognition. The
documentation should underscore the significance of categorical cross-entropy in the
ASL detection pipeline, elucidating its role in optimizing the model’s ability to interpret
and classify a diverse range of sign language expressions accurately.
1 N M
Categorical Cross-Entropy = − ∑ ∑ yi j log(pi j ) (3.1)
N i=1 j=1
Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) com-
monly used in American Sign Language (ASL) recognition due to its ability to effectively
process sequential data and it is also a special type of tool that computers use to under-
10
stand American Sign Language (ASL) better.
LSTM helps computers keep track of all the movements and expressions happening in
sign language videos. It’s like having a super-smart assistant that learns from watching
lots of ASL videos and becomes good at recognizing different signs. With LSTM,
computers can understand ASL more accurately and help bridge communication barriers
between people who use sign language and those who don’t.
LSTM networks play a vital role in ASL recognition systems by effectively capturing
the temporal structure of sign language gestures and enabling accurate classification
of ASL signs. Their ability to handle sequential data makes them well-suited for the
dynamic nature of ASL communication, contributing to the development of accessible
and inclusive technologies for the Deaf and hard-of-hearing communities.
LSTM was introduced and its suitability for sequential data processing that highlights its
ability to capture long-term dependencies in time-series data like ASL gestures.
The LSTM (Long Short-Term Memory) architecture for American Sign Language (ASL)
typically consists of several layers designed to process sequential data effectively. At
its core, an LSTM network comprises LSTM cells, which are specialized units capable
of retaining information over long sequences. In the context of ASL recognition, the
input to the LSTM architecture typically consists of sequential data representing hand
movements or gestures captured over time.
The architecture typically starts with an input layer that receives sequential data, such as
hand pose coordinates or frames from a video sequence of ASL gestures. These inputs
are then passed through one or more LSTM layers. Each LSTM layer contains multiple
LSTM cells, which internally maintain a cell state and several gating mechanisms to
control the flow of information. These mechanisms enable LSTM cells to selectively
remember or forget information based on the input data and the network’s previous state,
making them well-suited for modeling long-range dependencies in sequential data like
ASL gestures.
Additionally, the LSTM architecture may include optional layers such as dropout layers
to prevent overfitting, batch normalization layers to stabilize training, and dense layers
for feature aggregation and classification. The final layer typically outputs predictions
for ASL signs or gestures based on the processed sequential data.
11
Overall, the LSTM architecture for ASL recognition leverages its ability to capture
long-term dependencies in sequential data to effectively model the temporal dynamics
of ASL gestures, enabling accurate recognition and interpretation of sign language.
The overall workflow of the system is shown in the above block diagram. Data sets are
like the memory of the system. Every detection that we view in real time is the result
of the data set. Data sets are captured in real-time from the front camera of the laptop.
Using media pipe live perception of simultaneous human pose, face landmarks, and hand
tracking in real-time various modern life applications including sign language detection
can be enabled. With the help of the landmarks or let’s say key points of features (face,
pose, and hands) we get from the media pipe we train our model. All the data that we
collected from the data sets and deep learning models are considered as training data.
These data are provided to the system such that the system can detect the sign language
in real time. Input to this system is real-time or say live video using the front camera of
the laptop. As the real-time input i.e. sign language is provided using the front camera
of the laptop, simultaneously live output can be seen on the screen in text format. It acts
as an interface for the Sign Language System providing an environment for input data to
get processed and provide the output.
12
3.7 Use Case Diagram
13
3.8 Level 0 DFD
14
3.9 Level 1 DFD
15
3.10 Activity Diagram
16
3.11 INSTRUMENTAL REQUIREMENT
3.11.1 Hardware Requirements
The hardware required for the projects are:
• CPU
• GPU
• Storage
TensorFlow
TensorFlow is a free open source library that can be used in the field of machine learning
and artificial intelligence. Including many other tasks, it can be used for training purposes
in deep learning.
Mediapipe
MediaPipe offers cross-platform, customizable machine learning solutions for live and
streaming media i.e. real time videos. Its features are End to end acceleration, Build
once deployed anywhere, ready to use solution, and Free and open source.
17
Figure 3.9: Hand Landmarks
18
3.15 Non-functional Requirement
• Performance Requirement.
• Design Constraints.
• Reliability.
• Usability.
• Maintainability.
19
reliable. Early stopping utilizes the validation dataset within each fold to prevent over-
fitting, while k-fold cross-validation provides a systematic approach to validate model
performance across different subsets of the data. Together, these techniques enable the
selection of the best-performing model parameters while minimizing the risk of overfit-
ting and ensuring generalization to unseen data. In summary, the combination of epochs,
early stopping, and k-fold cross-validation, with k = 5, forms a powerful framework for
training machine learning models effectively and producing reliable predictions.
20
4 RESULTS
CNN model has been trained in 8 epochs with an accuracy of 97.52%. We have used the
following parameters for training our model.
After training our model we have visualized the accuracy and the loss curve.We have
found that the accuracy at epoch 1 was 0.7227 and validation accuracy was 0.9026,
0.9086 accuracy and 0.9407 validation accuracy at epoch 2 and goes on increasing
as shown in figure 7.1. Also we have found that the loss at epoch 1 was 0.9629 and
validation loss was 0.2303, 0.2372 loss and 0.1435 validation loss at epoch 2 and goes
on decreasing as shown in figure 7.2.
21
Figure 4.2: Loss Graph
In order to evaluate the effectiveness of the system being proposed, we have measured
its performance using various metrics, including Accuracy, Precision, Recall, F1-Score,
and Error Rate. Accuracy refers to how closely the measurements of the system align
TP+TN
Accuracy = (4.1)
T P + FN + FP + T N
TP
Precision = (4.2)
T P + FP
22
detected anomaly.
TP
Recall = (4.3)
T P + FN
The F1-Score is a metric that combines precision and recall using their
harmonic mean. It provides a single value for comparison, with higher
values indicating better performance.
2 ∗ Precision ∗ Recall
F1 − Score = (4.4)
Precision + Recall
23
4.2 Qualitative Analysis
Output from our model is shown below where ”Hello, no, I Love You ”,
This kind of hand gestures input has been provided and output can be seen
on screen.
24
4.3 Comparision of CNN and LSTM
The model has been evaluated using both CNN and LSTM model for 13
epochs and it was found that CNN model out-performs LSTM model as
shown in figure 7.5. So we have chosen CNN model over LSTM model.
25
4.4 Model Summary
26
5 TASK COMPLETED
27
significantly enhanced. Collectively, these refinements, encompassing
dataset expansion, augmentation of model architecture, and fine-tuning
of hyperparameters, resulted in a substantial improvement in the over-
all efficiency of both the LSTM and CNN models, positioning them as
formidable tools in predictive analytics and image processing.
28
6 REMAINING TASK
29
APPENDIX
30
Figure A.2: Home Page
31
Figure A.4: Tutorials
32
Figure A.6: Number Sign
33
REFERENCES
[2] Rahib H Abiyev, Murat Arslan, and John Bush Idoko. Sign lan-
guage translation using deep convolutional neural networks. KSII
Transactions on Internet & Information Systems, 14(2), 2020.
[3] Walaa Aly, Saleh Aly, and Sultan Almotairi. User-independent amer-
ican sign language alphabet recognition based on depth image and
pcanet features. IEEE Access, 7:123138–123150, 2019.
[7] Yulius Obi, Kent Samuel Claudio, Vetri Marvel Budiman, Said
Achmad, and Aditya Kurniawan. Sign language recognition system
34
for communicating to people with disabilities. Procedia Computer
Science, 216:13–20, 2023.
35