American Sign Language Progress (1) Final

TRIBHUVAN UNIVERSITY
INSTITUTE OF ENGINEERING
LALITPUR ENGINEERING COLLEGE
AMERICAN SIGN LANGUAGE USING CNN
BY
AMRIT SAPKOTA
ASMIT OLI
NISCHAL MAHARJAN
SAKSHYAM ARYAL
A PROJECT PROGRESS REPORT

SUBMITTED TO THE DEPARTMENT OF COMPUTER ENGINEERING
IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR
THE DEGREE OF BACHELOR OF ENGINEERING IN COMPUTER
ENGINEERING
DEPARTMENT OF COMPUTER ENGINEERING

LALITPUR, NEPAL
FEBRUARY, 2024
AMERICAN SIGN LANGUAGE USING CNN
By
Amrit Sapkota (076 BCT 05)
Asmit Oli (076 BCT 43)
Nischal Maharjan (076 BCT 20)
Sakshyam Aryal (076 BCT 29)
Project Supervisor
Er. Hemant Joshi
A project submitted in partial fulfillment of the requirements for the degree of

Bachelor of Engineering in Computer Engineering
Department of Computer Engineering

Institute of Engineering, Lalitpur Engineering College
Tribhuvan University
Lalitpur, Nepal
February, 2024
ii
ACKNOWLEDGEMENT
This project work would not have been possible without the guidance and the help of
several individuals who in one way or another contributed and extended their valuable
assistance in the preparation and completion of this study.
First of all, We would like to express our sincere gratitude to our supervisor, Er. Hemant
Joshi, Head Of Department, Department of Computer Engineering, Universal En-
gineering College for providing invaluable guidance, insightful comments, meticulous
suggestions, and encouragement throughout the duration of this project work. Our
sincere thanks also goes to the Project Coordinator, Er. Bibat Thokar, for coordinating
the project works, providing astute criticism, and having inexhaustible patience.
We are also grateful to our classmates and friends for offering us advice and moral
support. To our family, thank you for encouraging us in all of our pursuits and inspiring
us to follow our dreams. We are especially grateful to our parents, who supported us
emotionally, believed in us and wanted the best for us.
Amrit Sapkota (076 BCT 05)

Asmit Oli (076 BCT 43)
Nischal Maharjan (076 BCT 20)
Sakshyam Aryal (076 BCT 29)
February, 2024
iii
ABSTRACT
There is an undeniable communication problem between the Deaf community and the
hearing majority. It becomes hard for deaf people to communicate because many people
don’t understand sign language. With the use of innovation, in sign language recognition,
we tried to teardown this communication barrier. In this proposal, it is shown how using
Artificial Intelligence can play a key role in providing the solution. Using the dataset,
through the front camera of the laptop, translation of sign language to text format can be
seen on the screen in real-time i.e. the input is in video format whereas the output is in
text format. Extraction of complex head and hand movements along with their constantly
changing shapes for recognition of sign language is considered a difficult problem in
computer vision. Mediapipe provides necessary key points or landmarks of hand, face,
and pose. The model is then trained using a Convolutional neural network(CNN). The
trained model is used to recognize sign language.
Keywords: Convolution Neural Network (CNN) Deep Learning Gesture Recognition

Long Short Term Memory (LSTM) Sign Language Recognition
iv
TABLE OF CONTENTS
ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Problem Statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Project Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Scope of Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.6 Potential Project Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.7 Originality of Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.8 Organisation of Project Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.1 Video Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.2 Video Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.3 Frame Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.4 Preprocessing Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Convolution Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.5 Long Short Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.6 System Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.7 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.8 Level 0 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
v
3.9 Level 1 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.10 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.11 INSTRUMENTAL REQUIREMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.11.1 Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.11.2 Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.12 User Requirement Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.13 Dataset Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.14 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.15 Non-functional Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.16 Elaboration of Working Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.17 Verification and Validation Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Comparision of CNN and LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5 TASK COMPLETED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6 REMAINING TASK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
APPENDIX
A.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
vi
LIST OF FIGURES
Figure 3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Figure 3.2 CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Figure 3.3 LSTM Long Short Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Figure 3.4 System Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Figure 3.5 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Figure 3.6 Level 0 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Figure 3.7 Level 1 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Figure 3.8 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Figure 3.9 Hand Landmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 4.1 Accuracy Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Figure 4.2 Loss Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Figure 4.3 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Figure 4.4 Output from Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Figure 4.5 Comparision between LSTM and CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Figure 4.6 CNN model summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Figure A.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Figure A.2 Home Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Figure A.3 Sign and Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Figure A.4 Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Figure A.5 Alphabet Sign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Figure A.6 Number Sign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
vii
LIST OF TABLES
Table 2.1 Summary of Related Works on ASL Recognition . . . . . . . . . . . . . . . . . . . . 6

Table 4.1 Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
viii
LIST OF ABBREVIATIONS
ConvNets Convolutional Networks

CNN Convolution Neural Network
CT Computed Tomography
CPU Central Processing Unit
DFD Data Flow Diagram
DTW Dynamic Time Warping
Et al. And Others
FPS Frames Per Second
GPU Graphics processing unit
LSTM Long Short Term Memory
ReLu rectified linear unit
RGB Red Green Blue
SVM Support Vector Machine
1D One Dimensions
ix
1 INTRODUCTION
American Sign Language (ASL) is a visual language used by Deaf and hard-of-hearing
communities in the US and Canada. It relies on handshapes, movements, and facial
expressions for communication. ASL plays a vital role in cultural identity and community
cohesion among Deaf individuals. Advancements in technology and education have
increased its recognition and accessibility. Understanding ASL is crucial for promoting
inclusivity and breaking down communication barriers.
1.1 Background
With the rapid growth of technology around us, Machine Learning and Artificial Intelli-
gence have been used in various sectors to support mankind including gesture, object,
face detection, etc. With the help of Deep Learning, a machine imitates the way humans
gain certain types of knowledge. Using an Artificial Neural Network, a simulation of the
human brain is done and using Convolution layers, extraction of selected important parts
from an image to make computation easy. ”Sign Language Detection”, the name itself
specifies the gist of the project. Sign language recognition has been a major problem
between mute disabilities people in the community. People do not understand sign
language and also it is difficult for them to learn those sign language. Apart from the
scoring grades from this minor project, the core idea is to make communication easy
for deaf people. We set the bar of the project such that it would be beneficial to society
as well. The main reason for us to choose this project is to aid people using Artificial
Intelligence.
1.2 Motivation
The motivation behind studying American Sign Language (ASL) stems from its profound
impact on communication and inclusivity. ASL serves as a primary means of communi-
cation for Deaf and hard-of-hearing individuals, enabling them to express themselves,
interact with others, and participate fully in society. By learning ASL, individuals can
foster greater understanding, empathy, and connection with the Deaf community, break-
ing down communication barriers and promoting inclusivity. Furthermore, studying ASL
provides insight into the linguistic and cultural richness of sign languages, contributing
to a more diverse and inclusive society. Ultimately, the motivation for studying ASL lies
in its ability to empower individuals, promote communication equality, and celebrate the
1
unique language and culture of the Deaf community.
1.3 Problem Statement

Though there is a lot of research going on regarding sign language recognition, there is
very little implementation in practical life. As per the team research, up to that extent,
we came across this problem though there are many Sign language recognition software
and hardware out there to some extent, we felt that people who don’t understand sign
language or can’t read may have the problem in communication also recognition using
gloves is practically not portable and costly as well. This made us think about using
image processing and deep learning for sign language recognition and providing the
output as voice such that everyone can understand.
1.4 Project Objectives

The following are the objectives for sign language detection.
• To design and implement a system that can understand the sign language of
Hearing-impaired people.
• To train the model with a variety of datasets using MediaPipe and CNN, and
provide the output in real-time.
• To recognize sign language and provide the output as voice or text.
1.5 Scope of Project

The field of sign language recognition includes the development and application of
techniques for recognizing and interpreting sign language gestures. This involves using
computer vision and machine learning techniques to analyze video input and identify
gestures of sign language users. Sign language recognition has a wide range of potential
applications, including communication aids for deaf people, automatic translation of
sign language into spoken or written language, and an interactive platform for learning
sign language. The scope also extends to improving the accuracy and efficiency of sign
language recognition systems through advances in algorithms, sensor technology, and
data collection. Additionally, this scope also includes addressing challenges related to
sign language diversity, gestural variation, lighting conditions, and the need for robust
2
real-time performance in a variety of environments.
The project focuses on creating a system to translate American Sign Language gestures
into text, benefiting individuals with hearing impairments. It aims to enhance accessi-
bility and inclusivity by bridging communication gaps between deaf or hard-of-hearing
individuals and the hearing community.
1.6 Potential Project Applications

Potential project applications of American Sign Language (ASL) encompass a wide
range of fields and industries. These applications include enhancing communication
accessibility for deaf or hard-of-hearing individuals in various settings such as education,
workplaces, and public spaces. ASL recognition systems can be developed to facili-
tate seamless communication, allowing individuals to express themselves effectively
through sign language. Additionally, ASL can be integrated into educational tools and
applications to teach hearing individuals or support ASL learners in improving their
singing skills. ASL translation services can provide accessibility for online content,
video calls, or live events, ensuring that deaf or hard-of-hearing individuals can access
information and participate in activities. In healthcare, ASL recognition systems can
improve communication between medical professionals and patients with hearing impair-
ments, leading to better healthcare outcomes. Furthermore, ASL interpretation services
can be invaluable in emergencies, enabling effective communication and assistance for
individuals who rely on sign language. ASL-accessible entertainment and media content
promote inclusivity and representation, while integrating ASL into customer service
channels and social media platforms fosters greater accessibility and engagement for deaf
or hard-of-hearing individuals. Supporting research initiatives in ASL linguistics, tech-
nology, and accessibility is crucial for advancing the field of sign language recognition
and promoting equality and inclusion for individuals with hearing impairments.
1.7 Originality of Project

The originality of a project centered on American Sign Language (ASL) stems from its
capacity to address prevailing research gaps and unmet needs within the field. These
gaps span a spectrum of areas, including the improvement of ASL recognition systems’
accuracy and robustness, particularly in challenging environments or with complex signs.
Moreover, there’s a need for innovative methodologies and technologies to teach ASL
3
effectively, accommodating diverse learning styles and preferences. Exploring new appli-
cations of ASL technology in domains like healthcare, where effective communication
between deaf or hard-of-hearing individuals and healthcare providers is critical, also
presents a notable research opportunity. Additionally, delving into the socio-cultural
aspects of ASL, such as regional variations and linguistic evolution, alongside addressing
accessibility challenges in ASL interpretation services, offers avenues for original re-
search. Lastly, investigating the interplay between ASL and other languages, modalities,
or communication systems, such as gesture-based interfaces or multimodal platforms,
contributes to the field’s advancement. By addressing these research gaps, a project can
offer invaluable insights, methodologies, and solutions that propel ASL studies forward,
foster inclusivity and accessibility, and ultimately enhance the quality of life for ASL
users.
1.8 Organisation of Project Report

The material in this project report is organised into seven chapters. After this introductory
chapter introduces the problem topic this research tries to address, chapter 2 contains the
literature review of vital and relevant publications, pointing toward a notable research
gap. Chapter 3 describes the methodology for the implementation of this project. Chapter
4 provides an overview of what has been accomplished. Chapter 5 contains some crucial
discussions on the used model and methods. Chapter 6 mentions pathways for future
research direction for the same problem or in the same domain. Chapter 7 concludes
the project shortly, mentioning the accomplishment and comparing it with the main
objectives.
4
2 LITERATURE REVIEW
Abdulhamied et al. in 2021 in their Real-time recognition of American sign language

using long- short term memory neural network and hand detection paper used LSTM
and predicted the result with accuracy 93.81% [1]. Rahib H.Abiyev et al. in 2019 in
their Sign Language Translation Using Deep Convolutional Neural Networks paper used
CNN and Finger-spelling dataset and predicted the result with accuracy 94.91% [2].A
survey conducted in 2019 by IEEE in their User-Independent American Sign Language
Alphabet Recognition Based on Depth Image and PCANet Features paper used Principal
Component Analysis Network (PCANet) and predicted the result with accuracy 88%
[3].
Jyotishman Bora et al. in 2023 in their Real-time Assamese Sign Language Recognition
using MediaPipe and Deep Learning paper used Mediapipe, Microsoft Kinect sensor and
predicted the result with accuracy 96.21% [4].A survey conducted in 2016 by Brandon
Garcia Stanford University Stanford, CA in their Real-time American Sign Language
Recognition with Convolutional Neural Networks used the CNN algorithm and predicted
the result with accuracy 95.72% [5].C.K.M Lee et al. in 2021 in their American sign
language recognition and training method with recurrent neural network paper used
LSTM,SVM,RNN and predicted the result with accuracy 93.36%,94.23% and 95.03%
respectively [6].Yulius Obi et al. in 2023 in their Sign language recognition system for
communicating to people with disabilities paper used CNN and predicted the result with
accuracy 95.1% [7].Md. Moklesur Rahman et al. in 2019 in their A New Benchmark on
American Sign Language Recognition using Convolutional Neural Network paper used
CNN and predicted the result with accuracy 95.9% [8].Jungpil Shin et al. in 2020 in
their American Sign Language Alphabet Recognition by Extracting Feature from Hand
Pose Estimation paper used SVM and predicted the result with accuracy 87% [9].
5
Table 2.1: Summary of Related Works on ASL Recognition
S.N. Related Works Results Tools Used

Real-time American
Sign Language Recognition
1 accuracy 95.72% CNN
with Convolutional Neural
Networks (2016)
Sign Language Translation
CNN,
Using
2 accuracy 94.91% Fingerspelling
Deep Convolutional Neural
dataset
Networks Rahib H (2019)
User-Independent American
Sign Language Alphabet Principal
Recognition Based on Depth Component
3 accuracy 88%
Image and PCA Net Analysis Network
Features Walah Aly (PCANet)
and Saleh Aly (2019)
A New Benchmark on
American Sign Language
Recognition using
Convolutional Neural
Network.
Md. Rahman (2019)
American Sign Language
Alphabet Recognition by
5 Extracting Features from accuracy 87% SVM
Hand Pose Estimation.
(2020)
American sign language accuracy
recognition and training 93.36%(LSTM),
6 LSTM, SVM, RNN
method with recurrent 94.23%(SVM) ,
neural network lee (2021) 95.03%(RNN)
Real-time recognition of
American sign language
7 using long-short-term accuracy 93.81% LSTM
memory neural network and
hand detection (2021)
Sign language recognition
system for communicating
to people with disabilities.
(2023)
Real-time Assamese Sign
Mediapipe,
Language Recognition using
9 accuracy 96.21% Microsoft Kinect
MediaPipe and Deep
sensor
Learning. (2023)
6
3 METHODOLOGY
3.1 Data Collection

In this project, we have collected sign language data of around 20,000 and made 10
classes. These classes are then made labels and the predictions are made from these
labels. High-quality video recording tools, including cameras and lighting setups that
allow for good viewing of hand motions, will be used to collect the data.
We have used the Media-pipe library to extract key points from the images, which are
stored as data. The sample of data collection is shown below.
Figure 3.1: Data Collection
3.2 Data Preprocessing

Data preprocessing is a process of preparing the raw data and making it suitable for a
machine-learning model. It is the first and crucial step in creating a machine-learning
model. Real-world data generally contains noises and missing values and may be unus-
able, which cannot be directly used for machine, learning models. Data preprocessing
is a required task for cleaning the data and making it suitable for a machine-learning
model, which also increases the accuracy and efficiency of a machine-learning model. In
our project, we have used a media-pipe library that does preprocessing for our data.
3.2.1 Video Acquisition

The video data for sign language detection will be captured using a high-definition
camera with a decent resolution. The camera will be positioned to capture the frontal
7
view of the signer’s upper body, focusing on the hand region.
3.2.2 Video Segmentation

Then the acquired video data will be segmented into individual sign language gestures.
We will employ an automatic gesture detection algorithm based on motion and hand
region analysis. This algorithm will then detect significant changes in motion and will
use hand-tracking techniques to separate consecutive gestures from video sequences.
3.2.3 Frame Extraction

From the segmented video data, frames will be extracted at a rate of one frame per
second to capture key moments of each gesture. A sample set of frames will thus be
guaranteed for additional study.
3.2.4 Preprocessing Technique

Noise Reduction
After collecting the data using media-pipe, there will be 5 features in landmark i.e
x,y,z,visibility and presence. Here visibility and presence is considered as noise. We will
create a python function and remove that visibility and presence so that our model will
predict well.
3.3 Convolution Neural Network
Figure 3.2: CNN Architecture
8
The CNN layer is the most significant; it builds a convolved feature map by applying a
filter to an array of picture pixels. I developed a CNN with three layers, each layer using
convolution, ReLU, and pooling. Because CNN does not handle rotation and scaling
by itself, a data augmentation approach was used. A few samples have been rotated,
enlarged, shrunk, thickened, and thinned manually.
Convolution filters are applied to the input using 1D Convolutions to extract the most
significant characteristics. The kernel glides in one dimension in 1D convolution, which
exactly suits the spatial properties. Convolution sparsity, when used with pooling for
location invariant feature detection and parameter sharing, lowers overfitting.
ReLU layer is a layer where data travels through each layer of the network, the ReLU
layer functions as an activation function, ensuring non-linearity. Without ReLU, the
dimensionality that is desired would be lost. It introduces non-linearity, accelerates
training, and reduces computation time.
Pooling layer is a layer that gradually decreases the dimension of the feature and variation
of the represented data. Decreases dimensions and computation, speeds up processing by
reducing the number of parameters that the network must compute, reduces overfitting by
reducing the number of parameters, and makes the model more tolerant of changes and
distortions. Pooling strategies include max pooling, min pooling, and average pooling; I
tried max pooling. The maximum input of a convolved feature is used in max pooling.
Flatten is used to transform the data into a one-dimensional array for input to the next
layer.
Dense the weights are multiplied by a matrix-vector multiplication of the input tensors,
followed by an activation function. Apart from an activation function, the essential
argument that we define here is units, which is an integer that we use to select the output
size.
Dropout layer is a regularisation approach that eliminates neurons from layers at random,
along with their input and output connections. As a consequence, generality is improved,
and overfitting is avoided.
3.4 Loss Function

Categorical cross-entropy is a widely used loss function in machine learning, particularly
in the context of multi-class classification tasks such as American Sign Language (ASL)
detection. Specifically tailored for scenarios where instances belong to one of several
9
mutually exclusive classes, categorical cross-entropy measures the dissimilarity between
the predicted probability distribution of classes and the true distribution. In the realm
of ASL detection, where accurate classification of various sign gestures is crucial,
this loss function plays a pivotal role in guiding the training process. By penalizing
deviations from the actual class probabilities, categorical cross-entropy effectively steers
the model towards learning to make more precise predictions. Its implementation ensures
that the model is trained to discern subtle differences among ASL gestures, ultimately
contributing to enhanced accuracy and proficiency in sign language recognition. The
documentation should underscore the significance of categorical cross-entropy in the
ASL detection pipeline, elucidating its role in optimizing the model’s ability to interpret
and classify a diverse range of sign language expressions accurately.
1 N M
Categorical Cross-Entropy = − ∑ ∑ yi j log(pi j ) (3.1)
N i=1 j=1
3.5 Long Short Term Memory
Figure 3.3: LSTM Long Short Term Memory
Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) com-
monly used in American Sign Language (ASL) recognition due to its ability to effectively
process sequential data and it is also a special type of tool that computers use to under-
10
stand American Sign Language (ASL) better.
LSTM helps computers keep track of all the movements and expressions happening in
sign language videos. It’s like having a super-smart assistant that learns from watching
lots of ASL videos and becomes good at recognizing different signs. With LSTM,
computers can understand ASL more accurately and help bridge communication barriers
between people who use sign language and those who don’t.
LSTM networks play a vital role in ASL recognition systems by effectively capturing
the temporal structure of sign language gestures and enabling accurate classification
of ASL signs. Their ability to handle sequential data makes them well-suited for the
dynamic nature of ASL communication, contributing to the development of accessible
and inclusive technologies for the Deaf and hard-of-hearing communities.
LSTM was introduced and its suitability for sequential data processing that highlights its
ability to capture long-term dependencies in time-series data like ASL gestures.
The LSTM (Long Short-Term Memory) architecture for American Sign Language (ASL)
typically consists of several layers designed to process sequential data effectively. At
its core, an LSTM network comprises LSTM cells, which are specialized units capable
of retaining information over long sequences. In the context of ASL recognition, the
input to the LSTM architecture typically consists of sequential data representing hand
movements or gestures captured over time.
The architecture typically starts with an input layer that receives sequential data, such as
hand pose coordinates or frames from a video sequence of ASL gestures. These inputs
are then passed through one or more LSTM layers. Each LSTM layer contains multiple
LSTM cells, which internally maintain a cell state and several gating mechanisms to
control the flow of information. These mechanisms enable LSTM cells to selectively
remember or forget information based on the input data and the network’s previous state,
making them well-suited for modeling long-range dependencies in sequential data like
ASL gestures.
Additionally, the LSTM architecture may include optional layers such as dropout layers
to prevent overfitting, batch normalization layers to stabilize training, and dense layers
for feature aggregation and classification. The final layer typically outputs predictions
for ASL signs or gestures based on the processed sequential data.
11
Overall, the LSTM architecture for ASL recognition leverages its ability to capture
long-term dependencies in sequential data to effectively model the temporal dynamics
of ASL gestures, enabling accurate recognition and interpretation of sign language.
3.6 System Block Diagram
Figure 3.4: System Block Diagram
The overall workflow of the system is shown in the above block diagram. Data sets are
like the memory of the system. Every detection that we view in real time is the result
of the data set. Data sets are captured in real-time from the front camera of the laptop.
Using media pipe live perception of simultaneous human pose, face landmarks, and hand
tracking in real-time various modern life applications including sign language detection
can be enabled. With the help of the landmarks or let’s say key points of features (face,
pose, and hands) we get from the media pipe we train our model. All the data that we
collected from the data sets and deep learning models are considered as training data.
These data are provided to the system such that the system can detect the sign language
in real time. Input to this system is real-time or say live video using the front camera of
the laptop. As the real-time input i.e. sign language is provided using the front camera
of the laptop, simultaneously live output can be seen on the screen in text format. It acts
as an interface for the Sign Language System providing an environment for input data to
get processed and provide the output.
12
3.7 Use Case Diagram
Figure 3.5: Use Case Diagram
13
3.8 Level 0 DFD
Figure 3.6: Level 0 DFD
14
3.9 Level 1 DFD
Figure 3.7: Level 1 DFD
15
3.10 Activity Diagram
Figure 3.8: Activity Diagram
16
3.11 INSTRUMENTAL REQUIREMENT
3.11.1 Hardware Requirements
The hardware required for the projects are:
• CPU
• GPU
• Storage
3.11.2 Software Requirements

The software required for the projects is
Python
Python is a high level language that is used for general purpose programming. It was
developed by Guido van Rossum. The first release of Python was in the year 1991 as
Python 0.9.0. Programming paradigms such as structured, object oriented, and functional
programming are supported in Python.
TensorFlow
TensorFlow is a free open source library that can be used in the field of machine learning
and artificial intelligence. Including many other tasks, it can be used for training purposes
in deep learning.
Mediapipe
MediaPipe offers cross-platform, customizable machine learning solutions for live and
streaming media i.e. real time videos. Its features are End to end acceleration, Build
once deployed anywhere, ready to use solution, and Free and open source.
17
Figure 3.9: Hand Landmarks
3.12 User Requirement Definition

The user requirement for this system is to make the system fast, feasible, less prone to
error, save time and improve the communication gap between normal people and deaf
people
• The system can translate sign language into text.
• The system should have a user-friendly interface.
3.13 Dataset Explanation

In this project, we’ve gathered data on sign language consisting of approximately 20,000
samples, which have been categorized into 10 distinct classes. Each class has been
assigned a label, and predictions are made based on these labels. To capture the data,
we’re using high-quality video recording equipment including cameras and lighting
setups, ensuring optimal visibility of hand movements. We’re employing the MediaPipe
library to extract key points from the images, which are then stored as data.
3.14 Functional Requirements

• Real-time Output.
• Accurate detection of gestures.
• Data sets comment.
18
3.15 Non-functional Requirement
• Performance Requirement.
• Design Constraints.
• Reliability.
• Usability.
• Maintainability.
3.16 Elaboration of Working Principle

When communicating solely through hand gestures, individuals rely on a rich vocabulary
of signs and movements to convey meaning effectively. In this mode of communication,
hand gestures become the primary means of expression, encompassing a diverse range
of handshapes, movements, and placements. Each gesture represents specific words,
concepts, or ideas, allowing for the exchange of information without the need for spoken
language. Through the manipulation of handshapes and the fluidity of movements, indi-
viduals can articulate a wide array of thoughts, emotions, and actions. Facial expressions
and body language may still accompany hand gestures, enhancing comprehension and
adding depth to the communication. This form of non-verbal communication transcends
linguistic barriers, enabling people to interact and connect across diverse cultural and
linguistic backgrounds. Whether used in American Sign Language (ASL) or other sign
systems, hand gestures serve as a powerful tool for communication and expression,
fostering understanding and fostering inclusivity in human interaction.
3.17 Verification and Validation Procedures

In machine learning, the effectiveness of model training is further enhanced by incorporat-
ing k-fold cross-validation alongside epochs and early stopping. K-fold cross-validation
involves partitioning the dataset into k subsets, and in this case, we use k = 5, meaning the
dataset is divided into five equally sized subsets. The training process then iterates five
times, with each iteration using a different subset as the validation set and the remaining
four subsets for training. This process ensures robustness by allowing the model to
learn from different combinations of the data. By integrating k-fold cross-validation
with epochs and early stopping, the training process becomes more comprehensive and
19
reliable. Early stopping utilizes the validation dataset within each fold to prevent over-
fitting, while k-fold cross-validation provides a systematic approach to validate model
performance across different subsets of the data. Together, these techniques enable the
selection of the best-performing model parameters while minimizing the risk of overfit-
ting and ensuring generalization to unseen data. In summary, the combination of epochs,
early stopping, and k-fold cross-validation, with k = 5, forms a powerful framework for
training machine learning models effectively and producing reliable predictions.
20
4 RESULTS
CNN model has been trained in 8 epochs with an accuracy of 97.52%. We have used the
following parameters for training our model.
Table 4.1: Model Parameters
S.N. Parameter Used Value

1 Number of Convolutional Layers 3 (32,64,128)
2 Activation Functions ReLU
3 Learning Rate 0.001
4 Optimizer Adam
5 Batch Size 32
6 Epochs 13
7 Input Array Size (63,1)
8 Loss Function categorical crossentropy
After training our model we have visualized the accuracy and the loss curve.We have
found that the accuracy at epoch 1 was 0.7227 and validation accuracy was 0.9026,
0.9086 accuracy and 0.9407 validation accuracy at epoch 2 and goes on increasing
as shown in figure 7.1. Also we have found that the loss at epoch 1 was 0.9629 and
validation loss was 0.2303, 0.2372 loss and 0.1435 validation loss at epoch 2 and goes
on decreasing as shown in figure 7.2.
Figure 4.1: Accuracy Graph
21
Figure 4.2: Loss Graph
4.1 Quantitative Analysis
In order to evaluate the effectiveness of the system being proposed, we have measured
its performance using various metrics, including Accuracy, Precision, Recall, F1-Score,
and Error Rate. Accuracy refers to how closely the measurements of the system align
with a particular value, and is expressed as follows:
TP+TN
Accuracy = (4.1)
T P + FN + FP + T N
Precision is a metric that measures the accuracy of positive predictions

made by a system. It can be obtained by dividing true positives by the
sum of true positives and false positives.
TP
Precision = (4.2)
T P + FP
In machine learning, recall, also referred to as sensitivity or true positive

rate, represents the likelihood that the model accurately recognizes the
22
detected anomaly.
TP
Recall = (4.3)
T P + FN
The F1-Score is a metric that combines precision and recall using their
harmonic mean. It provides a single value for comparison, with higher
values indicating better performance.
2 ∗ Precision ∗ Recall
F1 − Score = (4.4)
Precision + Recall
Figure 4.3: Confusion Matrix
23
4.2 Qualitative Analysis
Output from our model is shown below where ”Hello, no, I Love You ”,
This kind of hand gestures input has been provided and output can be seen
on screen.
Figure 4.4: Output from Model
24
4.3 Comparision of CNN and LSTM
The model has been evaluated using both CNN and LSTM model for 13
epochs and it was found that CNN model out-performs LSTM model as
shown in figure 7.5. So we have chosen CNN model over LSTM model.
Figure 4.5: Comparision between LSTM and CNN
25
4.4 Model Summary
The following figure 4.6 is our model’s summary which shows no of

convolutional layer, kernel, filters, Max pooling layer and their no of units.
Figure 4.6: CNN model summary
26
5 TASK COMPLETED
A significant milestone was reached with the development and training

of both Long Short-Term Memory (LSTM) and Convolutional Neural
Network (CNN) models. The LSTM model, known for its proficiency
in sequence prediction tasks, underwent thorough training to understand
intricate patterns within the data. Simultaneously, the CNN model, tai-
lored for image-related tasks, underwent rigorous training to improve its
feature extraction capabilities. Model evaluation played a crucial role in
assessing the efficacy of these trained models. Through meticulous vali-
dation and testing, their performance metrics were scrutinized to ensure
their suitability for real-world applications.
To further enhance model efficiency, a multifaceted approach was im-

plemented. Firstly, the dataset was significantly expanded, exposing the
models to a more diverse range of examples, thereby improving their
ability to generalize. This step proved pivotal in enhancing the models’
robustness across various scenarios. Additionally, the number of units
in both dense layers of the models was increased, enabling them to cap-
ture more intricate relationships in the data. This strategic increase in
model complexity contributed to a more nuanced understanding of the
underlying patterns.
Furthermore, deliberate adjustments were made to the CNN model, includ-

ing an increase in kernel size and the number of filters. This modification
facilitated a broader scope of feature extraction, allowing the model to
discern more complex spatial hierarchies within the input data. Conse-
quently, the model’s capability to recognize subtle patterns in images was
27
significantly enhanced. Collectively, these refinements, encompassing
dataset expansion, augmentation of model architecture, and fine-tuning
of hyperparameters, resulted in a substantial improvement in the over-
all efficiency of both the LSTM and CNN models, positioning them as
formidable tools in predictive analytics and image processing.
In addition, a responsive website was designed using HTML, CSS, and

JavaScript to complement the model development efforts. The website’s
responsive design ensures optimal viewing and interaction experiences
across a variety of devices and screen sizes. Incorporating user-friendly
navigation and visually appealing layouts, the website serves as an in-
tuitive platform for users to interact with the ASL recognition system.
This integration of machine learning models with a responsive website
enhances accessibility and usability, further contributing to the project’s
success in promoting inclusivity and advancing ASL technology.
28
6 REMAINING TASK
The remaining task involves creating a system to generate American Sign

Language (ASL) signs for alphabet letters, enhancing communication
accessibility and aiding ASL learners. A user-friendly interface will
enable users to input letters and visualize the corresponding ASL signs.
Integration with existing ASL systems ensures seamless accessibility.
Thorough testing, optimization, documentation, and deployment finalize
the system, providing a valuable resource for ASL learners to express
alphabet letters through sign language effectively.
29
APPENDIX
A.1 Gantt Chart
Figure A.1: Gantt Chart
30
Figure A.2: Home Page
Figure A.3: Sign and Tutorials
31
Figure A.4: Tutorials
Figure A.5: Alphabet Sign
32
Figure A.6: Number Sign
33
REFERENCES
[1] Reham Mohamed Abdulhamied, Mona M Nasr, and Sarah N Abdulka-

der. Real-time recognition of american sign language using long-short
term memory neural network and hand detection. 2023.
[2] Rahib H Abiyev, Murat Arslan, and John Bush Idoko. Sign lan-
guage translation using deep convolutional neural networks. KSII
Transactions on Internet & Information Systems, 14(2), 2020.
[3] Walaa Aly, Saleh Aly, and Sultan Almotairi. User-independent amer-
ican sign language alphabet recognition based on depth image and
pcanet features. IEEE Access, 7:123138–123150, 2019.
[4] Jyotishman Bora, Saine Dehingia, Abhijit Boruah, Anuraag Anuj

Chetia, and Dikhit Gogoi. Real-time assamese sign language recogni-
tion using mediapipe and deep learning. Procedia Computer Science,
218:1384–1393, 2023.
[5] Brandon Garcia and Sigberto Alarcon Viesca. Real-time american

sign language recognition with convolutional neural networks. Con-
volutional Neural Networks for Visual Recognition, 2(225-232):8,
2016.
[6] Carman KM Lee, Kam KH Ng, Chun-Hsien Chen, Henry CW Lau,

SY Chung, and Tiffany Tsoi. American sign language recognition
and training method with recurrent neural network. Expert Systems
with Applications, 167:114403, 2021.
[7] Yulius Obi, Kent Samuel Claudio, Vetri Marvel Budiman, Said
Achmad, and Aditya Kurniawan. Sign language recognition system
34
for communicating to people with disabilities. Procedia Computer
Science, 216:13–20, 2023.
[8] Md Moklesur Rahman, Md Shafiqul Islam, Md Hafizur Rahman,

Roberto Sassi, Massimo W Rivolta, and Md Aktaruzzaman. A new
benchmark on american sign language recognition using convolutional
neural network. In 2019 International Conference on Sustainable
Technologies for Industry 4.0 (STI), pages 1–6. IEEE, 2019.
[9] Jungpil Shin, Akitaka Matsuoka, Md Al Mehedi Hasan, and Az-

main Yakin Srizon. American sign language alphabet recognition by
extracting feature from hand pose estimation. Sensors, 21(17):5856,
2021.
35

American Sign Language Progress (1) Final

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

American Sign Language Progress (1) Final

Uploaded by

Copyright:

Available Formats

TRIBHUVAN UNIVERSITY

AMERICAN SIGN LANGUAGE USING CNN

A PROJECT PROGRESS REPORT

DEPARTMENT OF COMPUTER ENGINEERING

A project submitted in partial fulfillment of the requirements for the degree of

Department of Computer Engineering

Amrit Sapkota (076 BCT 05)

Keywords: Convolution Neural Network (CNN) Deep Learning Gesture Recognition

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Figure 3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Table 2.1 Summary of Related Works on ASL Recognition . . . . . . . . . . . . . . . . . . . . 6

ConvNets Convolutional Networks

1.3 Problem Statement

1.4 Project Objectives

• To recognize sign language and provide the output as voice or text.

1.5 Scope of Project

1.6 Potential Project Applications

1.7 Originality of Project

1.8 Organisation of Project Report

Abdulhamied et al. in 2021 in their Real-time recognition of American sign language

S.N. Related Works Results Tools Used

3.1 Data Collection

Figure 3.1: Data Collection

3.2 Data Preprocessing

3.2.1 Video Acquisition

3.2.2 Video Segmentation

3.2.3 Frame Extraction

3.2.4 Preprocessing Technique

3.3 Convolution Neural Network

Figure 3.2: CNN Architecture

3.4 Loss Function

3.5 Long Short Term Memory

Figure 3.3: LSTM Long Short Term Memory

3.6 System Block Diagram

Figure 3.4: System Block Diagram

Figure 3.5: Use Case Diagram

Figure 3.6: Level 0 DFD

Figure 3.7: Level 1 DFD

Figure 3.8: Activity Diagram

3.11.2 Software Requirements

3.12 User Requirement Definition

• The system can translate sign language into text.

• The system should have a user-friendly interface.

3.13 Dataset Explanation

3.14 Functional Requirements

• Accurate detection of gestures.

• Data sets comment.

3.16 Elaboration of Working Principle

3.17 Verification and Validation Procedures

Table 4.1: Model Parameters

S.N. Parameter Used Value

Figure 4.1: Accuracy Graph

4.1 Quantitative Analysis

with a particular value, and is expressed as follows:

Precision is a metric that measures the accuracy of positive predictions

In machine learning, recall, also referred to as sensitivity or true positive

Figure 4.3: Confusion Matrix

Figure 4.4: Output from Model

Figure 4.5: Comparision between LSTM and CNN

The following figure 4.6 is our model’s summary which shows no of

Figure 4.6: CNN model summary

A significant milestone was reached with the development and training

To further enhance model efficiency, a multifaceted approach was im-