You are on page 1of 50

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING
LALITPUR ENGINEERING COLLEGE

AMERICAN SIGN LANGUAGE USING CNN

BY
AMRIT SAPKOTA
ASMIT OLI
NISCHAL MAHARJAN
SAKSHYAM ARYAL

A PROJECT PROGRESS REPORT


SUBMITTED TO THE DEPARTMENT OF COMPUTER ENGINEERING
IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR
THE DEGREE OF BACHELOR OF ENGINEERING IN COMPUTER
ENGINEERING

DEPARTMENT OF COMPUTER ENGINEERING


LALITPUR, NEPAL

FEBRUARY, 2024
AMERICAN SIGN LANGUAGE USING CNN

Submited By
Amrit Sapkota (076 BCT 05)
Asmit Oli (076 BCT 43)
Nischal Maharjan (076 BCT 20)
Sakshyam Aryal (076 BCT 29)

Submitted To
Department of Computer Engineering
Institute of Engineering, Lalitpur Engineering College
Tribhuvan University
Lalitpur, Nepal

A project submitted in partial fulfillment of the requirements for the degree of


Bachelor of Engineering in Computer Engineering

Project Supervisor
Er. Hemant Joshi

February, 2024

ii
COPYRIGHT ©

The author has agreed that the library, Department of Computer Engineering, Institute of
Engineering, Lalitpur Engineering College, may make this project work freely available
for inspection. Moreover the author has agreed that the permission for extensive copying
of this project work for scholarly purpose may be granted by the professor(s), who
supervised the project work recorded herein or, in their absence, by the Head of the
Department, wherein this project work was done. It is understood that the recognition
will be given to the author of this project work and to the Department of Computer
Engineering, Institute of Engineering, Lalitpur Engineering College in any use of the
material of this project work. Copying of publication or other use of this project work for
financial gain without approval of the Department of Computer Engineering, Institute of
Engineering, Lalitpur Engineering College and author’s written permission is prohibited.

Request for permission to copy or to make any use of the material in this thesis in whole
or part should be addressed to:

Department of Computer Engineering


Institute of Engineering, Lalitpur Engineering College
Patan, Lalitpur, Nepal

iii
DECLARATION

I declare that the work hereby submitted for Master of Science in Infomatics and
Intelligent Systems Engineering (MSIISE) at the Institute of Engineering, Lalitpur
Engineering College entitled ”AMERICAN SIGN LANGUAGE USING CNN” is
my own work and has not been previously submitted by me at any university for any
academic award. I authorize the Institute of Engineering, Lalitpur Engineering College
to lend this project work to other institutions or individuals for the purpose of scholarly
research.

Amrit Sapkota (076 BCT 05)

Asmit Oli (076 BCT 43)

Nischal Maharjan (076 BCT 20)

Sakshyam Aryal (076 BCT 29)

February, 2024

iv
CERTIFICATE OF APPROVAL

The undersigned certify that they have read and recommend to the Department of
Computer Engineering for acceptance, a project work entitled “AMERICAN SIGN
LANGUAGE USING CNN”, submitted by Amrit Sapkota (076 BCT 05), Asmit Oli
(076 BCT 43), Nischal Maharjan (076 BCT 20), Sakshyam Aryal (076 BCT 29)
in partial fulfillment of the requirement for the award of the degree of “Bachelor of
Engineering in Computer Engineering”.

Project Supervisor
Er. Hemant Joshi
Head Of Department
Department of Computer Engineering, Universal Engineering College

Project Coordinator
Er. Bibat Thokar
Lecturer
Department of Computer Engineering, Lalitpur Engineering College

Er. Praches Acharya


Head of the Department
Department of Computer Engineering,Lalitpur Engineering College,

February, 2024

v
ACKNOWLEDGEMENT

This project work would not have been possible without the guidance and the help of
several individuals who in one way or another contributed and extended their valuable
assistance in the preparation and completion of this study.
First of all, We would like to express our sincere gratitude to our supervisor, Er. Hemant
Joshi, Head Of Department, Department of Computer Engineering, Universal En-
gineering College for providing invaluable guidance, insightful comments, meticulous
suggestions, and encouragement throughout the duration of this project work. Our
sincere thanks also goes to the Project Coordinator, Er. Bibat Thokar, for coordinating
the project works, providing astute criticism, and having inexhaustible patience.

We are also grateful to our classmates and friends for offering us advice and moral
support. To our family, thank you for encouraging us in all of our pursuits and inspiring
us to follow our dreams. We are especially grateful to our parents, who supported us
emotionally, believed in us and wanted the best for us.

Amrit Sapkota (076 BCT 05)


Asmit Oli (076 BCT 43)
Nischal Maharjan (076 BCT 20)
Sakshyam Aryal (076 BCT 29)

February, 2024

vi
ABSTRACT

There is an undeniable communication problem between the Deaf community and the
hearing majority. It becomes hard for deaf people to communicate because many people
don’t understand sign language. With the use of innovation, in sign language recognition,
we tried to teardown this communication barrier. In this proposal, it is shown how using
Artificial Intelligence can play a key role in providing the solution. Using the dataset,
through the front camera of the laptop, translation of sign language to text format can be
seen on the screen in real-time i.e. the input is in video format whereas the output is in
text format. Extraction of complex head and hand movements along with their constantly
changing shapes for recognition of sign language is considered a difficult problem in
computer vision. Mediapipe provides necessary key points or landmarks of hand, face,
and pose. The model is then trained using a Convolutional neural network(CNN). The
trained model is used to recognize sign language.

Keywords: Convolution Neural Network (CNN) Deep Learning Gesture Recognition


Long Short Term Memory (LSTM) Sign Language Recognition

vii
TABLE OF CONTENTS

COPYRIGHT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

DECLARATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

CERTIFICATE OF APPROVAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Problem Statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Project Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Scope of Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.6 Potential Project Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.7 Originality of Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.8 Organisation of Project Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Video Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.2 Video Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.3 Frame Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.4 Preprocessing Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Convolution Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

viii
3.5 Long Short Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6 System Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.7 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.8 Level 0 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.9 Level 1 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.10 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.11 Instrumental Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.11.1 Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.11.2 Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.12 User Requirement Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.13 Dataset Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.14 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.15 Non-functional Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.16 Elaboration of Working Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.17 Verification and Validation Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 RESULTS AND ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24


4.1 CNN Accuracy And Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 LSTM Accuracy And Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 Comparision of CNN and LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.6 Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6.1 CNN Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6.2 LSTM Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
APPENDIX
A.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

ix
LIST OF FIGURES

Figure 3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10


Figure 3.2 CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Figure 3.3 Structure of LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Figure 3.4 System Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Figure 3.5 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Figure 3.6 Level 0 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Figure 3.7 Level 1 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 3.8 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Figure 3.9 Hand Landmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Figure 4.1 Accuracy Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Figure 4.2 Loss Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Figure 4.3 LSTM Accuracy Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Figure 4.4 LSTM Loss Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Figure 4.5 CNN Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Figure 4.6 LSTM Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Figure 4.7 Output from Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Figure 4.8 Comparision between LSTM and CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Figure 4.9 CNN model summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 4.10 LSTM model summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Figure A.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Figure A.2 Home Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Figure A.3 Sign and Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Figure A.4 Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Figure A.5 Alphabet Sign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Figure A.6 Number Sign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

x
LIST OF TABLES

Table 2.1 Summary of Related Works on ASL Recognition . . . . . . . . . . . . . . . . . . . . 9


Table 4.1 Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Table 4.2 Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

xi
LIST OF ABBREVIATIONS

ConvNets Convolutional Networks


CNN Convolution Neural Network
CT Computed Tomography
CPU Central Processing Unit
DFD Data Flow Diagram
DTW Dynamic Time Warping
Et al. And Others
FPS Frames Per Second
GPU Graphics processing unit
LSTM Long Short Term Memory
ReLu rectified linear unit
RGB Red Green Blue
SVM Support Vector Machine
1D One Dimensions

xii
1 INTRODUCTION

American Sign Language (ASL) is a visual language used by Deaf and hard-of-hearing
communities in the US and Canada. It relies on handshapes, movements, and facial
expressions for communication. ASL plays a vital role in cultural identity and community
cohesion among Deaf individuals. Advancements in technology and education have
increased its recognition and accessibility. Understanding ASL is crucial for promoting
inclusivity and breaking down communication barriers.

1.1 Background
With the rapid growth of technology around us, Machine Learning and Artificial Intelli-
gence have been used in various sectors to support mankind including gesture, object,
face detection, etc. With the help of Deep Learning, a machine imitates the way humans
gain certain types of knowledge. Using an Artificial Neural Network, a simulation of the
human brain is done and using Convolution layers, extraction of selected important parts
from an image to make computation easy. ”Sign Language Detection”, the name itself
specifies the gist of the project. Sign language recognition has been a major problem
between mute disabilities people in the community. People do not understand sign
language and also it is difficult for them to learn those sign language. Apart from the
scoring grades from this minor project, the core idea is to make communication easy
for deaf people. We set the bar of the project such that it would be beneficial to society
as well. The main reason for us to choose this project is to aid people using Artificial
Intelligence.

1.2 Motivation
The motivation behind studying American Sign Language (ASL) stems from its profound
impact on communication and inclusivity. ASL serves as a primary means of communi-
cation for Deaf and hard-of-hearing individuals, enabling them to express themselves,
interact with others, and participate fully in society. By learning ASL, individuals can
foster greater understanding, empathy, and connection with the Deaf community, break-
ing down communication barriers and promoting inclusivity. Furthermore, studying ASL
provides insight into the linguistic and cultural richness of sign languages, contributing
to a more diverse and inclusive society. Ultimately, the motivation for studying ASL lies
in its ability to empower individuals, promote communication equality, and celebrate the

1
unique language and culture of the Deaf community.

1.3 Problem Statement


Though there is a lot of research going on regarding sign language recognition, there is
very little implementation in practical life. As per the team research, up to that extent,
we came across this problem though there are many Sign language recognition software
and hardware out there to some extent, we felt that people who don’t understand sign
language or can’t read may have the problem in communication also recognition using
gloves is practically not portable and costly as well. This made us think about using
image processing and deep learning for sign language recognition and providing the
output as voice such that everyone can understand.

1.4 Project Objectives


The following are the objectives for sign language detection.

• To design and implement a system that can understand the sign language of
Hearing-impaired people.

• To recognize sign language and provide the output as text.

1.5 Scope of Project


The field of sign language recognition includes the development and application of
techniques for recognizing and interpreting sign language gestures. This involves using
computer vision and machine learning techniques to analyze video input and identify
gestures of sign language users. Sign language recognition has a wide range of potential
applications, including communication aids for deaf people, automatic translation of
sign language into spoken or written language, and an interactive platform for learning
sign language. The scope also extends to improving the accuracy and efficiency of sign
language recognition systems through advances in algorithms, sensor technology, and
data collection. Additionally, this scope also includes addressing challenges related to
sign language diversity, gestural variation, lighting conditions, and the need for robust
real-time performance in a variety of environments.
The project focuses on creating a system to translate American Sign Language gestures
into text, benefiting individuals with hearing impairments. It aims to enhance accessi-

2
bility and inclusivity by bridging communication gaps between deaf or hard-of-hearing
individuals and the hearing community.

1.6 Potential Project Applications


Potential project applications of American Sign Language (ASL) encompass a wide
range of fields and industries. These applications include enhancing communication
accessibility for deaf or hard-of-hearing individuals in various settings such as education,
workplaces, and public spaces. ASL recognition systems can be developed to facili-
tate seamless communication, allowing individuals to express themselves effectively
through sign language. Additionally, ASL can be integrated into educational tools and
applications to teach hearing individuals or support ASL learners in improving their
singing skills. ASL translation services can provide accessibility for online content,
video calls, or live events, ensuring that deaf or hard-of-hearing individuals can access
information and participate in activities. In healthcare, ASL recognition systems can
improve communication between medical professionals and patients with hearing impair-
ments, leading to better healthcare outcomes. Furthermore, ASL interpretation services
can be invaluable in emergencies, enabling effective communication and assistance for
individuals who rely on sign language. ASL-accessible entertainment and media content
promote inclusivity and representation, while integrating ASL into customer service
channels and social media platforms fosters greater accessibility and engagement for deaf
or hard-of-hearing individuals. Supporting research initiatives in ASL linguistics, tech-
nology, and accessibility is crucial for advancing the field of sign language recognition
and promoting equality and inclusion for individuals with hearing impairments.

1.7 Originality of Project


The originality of a project centered on American Sign Language (ASL) stems from its
capacity to address prevailing research gaps and unmet needs within the field. These
gaps span a spectrum of areas, including the improvement of ASL recognition systems’
accuracy and robustness, particularly in challenging environments or with complex signs.
Moreover, there’s a need for innovative methodologies and technologies to teach ASL
effectively, accommodating diverse learning styles and preferences. Exploring new appli-
cations of ASL technology in domains like healthcare, where effective communication
between deaf or hard-of-hearing individuals and healthcare providers is critical, also

3
presents a notable research opportunity. Additionally, delving into the socio-cultural
aspects of ASL, such as regional variations and linguistic evolution, alongside addressing
accessibility challenges in ASL interpretation services, offers avenues for original re-
search. Lastly, investigating the interplay between ASL and other languages, modalities,
or communication systems, such as gesture-based interfaces or multimodal platforms,
contributes to the field’s advancement. By addressing these research gaps, a project can
offer invaluable insights, methodologies, and solutions that propel ASL studies forward,
foster inclusivity and accessibility, and ultimately enhance the quality of life for ASL
users.

1.8 Organisation of Project Report


The material in this project report is organized into six chapters. After this introductory
chapter introduces the problem topic this research tries to address, chapter 2 contains the
literature review of vital and relevant publications, pointing toward a notable research
gap. Chapter 3 describes the methodology for the implementation of this project. Chapter
4 and Chapter 5 provide an overview of what has been accomplished and contain some
crucial discussions on the used model and methods. Chapter 6 mentions pathways for
future research direction for the same problem or in the same domain and concludes
the project shortly, mentioning the accomplishment and comparing it with the main
objectives.

4
2 LITERATURE REVIEW

Abdulhamied et al. in 2021 in their Real-time recognition of American sign language


using long- short term memory neural network and hand detection paper used LSTM
and predicted the result with accuracy 93.81% [1]. Rahib H.Abiyev et al. in 2019 in
their Sign Language Translation Using Deep Convolutional Neural Networks paper used
CNN and Finger-spelling dataset and predicted the result with accuracy 94.91% [2].

Sign language is the most natural and effective way for communications among deaf and
normal people. American Sign Language (ASL) alphabet recognition (i.e. fingerspelling)
using marker-less vision sensor is a challenging task due to the difficulties in hand seg-
mentation and appearance variations among signers. Existing color-based sign language
recognition systems suffer from many challenges such as complex background, hand
segmentation, large inter-class and intra-class variations. In this paper, we propose a new
user independent recognition system for American sign language alphabet using depth
images captured from the low-cost Microsoft Kinect depth sensor. Exploiting depth
information instead of color images overcomes many problems due to their robustness
against illumination and background variations. Hand region can be segmented by
applying a simple preprocessing algorithm over depth image. Feature learning using
convolutional neural network architectures is applied instead of the classical handcrafted
feature extraction methods. Local features extracted from the segmented hand are ef-
fectively learned using a simple unsupervised Principal Component Analysis Network
(PCANet) deep learning architecture. Two strategies of learning the PCANet model are
proposed, namely to train a single PCANet model from samples of all users and to train
a separate PCANet model for each user, respectively. The extracted features are then
recognized using linear Support Vector Machine (SVM) classifier. The performance of
the proposed method is evaluated using public dataset of real depth images captured
from various users. Experimental results show that the performance of the proposed
method outperforms state-of-the-art recognition accuracy using leave-one-out evaluation
strategy.A survey conducted in 2019 by IEEE in their User-Independent American Sign
Language Alphabet Recognition Based on Depth Image and PCANet Features paper
used Principal Component Analysis Network (PCANet) and predicted the result with
accuracy 88% [3].

5
Jyotishman Bora et al. in 2023 in their Real-time Assamese Sign Language Recog-
nition using MediaPipe and Deep Learning paper used Mediapipe, Microsoft Kinect
sensor and predicted the result with accuracy 96.21% People lacking the sense of hearing
and the ability to speak have undeniable communication problems in their life. People
with hearing and speech problems communicate using sign language with themselves
and others. Sign language is not essentially known to a more significant portion of the
human population who uses spoken and written language for communication. Therefore,
it is a necessity to develop technological tools for interpretation of sign language. Much
research have been carried out to acknowledge sign language using technology for most
global languages. But there are still scopes of development of tools and techniques for
sign language development for local dialects. There are 22 modern Indian languages
and more than 19000 languages that are spoken regionally as mother tongue. This work
attempts to develop a technical approach for recognizing Assamese Sign Language,
which is one of the 22 modern languages of India. Using machine learning techniques,
this work tried to establish a system for identifying the hand gestures from Assamese
Sign Language. A combination of two-dimensional and three-dimensional images of
Assamesegestures has been used to prepare a dataset. The MediaPipe framework has
been implemented to detect landmarks in the images. The dataset was used for training
of a feed-forward neural network. The results reveal that the method implemented in this
work is effective for the recognition of the other alphabets and gestures in the Assamese
Sign Language. This method could also be tried and tested for the recognition of signs
and gestures for various other local languages of India [4].
A survey conducted in 2016 by Brandon Garcia Stanford University Stanford, CA in their
Real-time American Sign Language Recognition with Convolutional Neural Networks
used the CNN algorithm and predicted the result with accuracy 95.72% A real-time sign
language translator is an important milestone in facilitating communication between
the deaf community and the general public. We hereby present the development and
implementation of an American Sign Language (ASL) fingerspelling translator based on
a convolutional neural network. We utilize a pre-trained GoogLeNet architecture trained
on the ILSVRC2012 dataset, as well as the Surrey University and Massey University
ASL datasets in order to apply transfer learning to this task. We produced a robust model

6
that consistently classifies letters a-e correctly with first-time users and another that
correctly classifies letters a-k in a majority of cases. Given the limitations of the datasets
and the encouraging results achieved, we are confident that with further research and
more data, we can produce a fully generalizable translator for all ASL letters.[5].
C.K.M Lee et al. in 2021 in their American sign language recognition and training
method with recurrent neural network paper used LSTM,SVM,RNN and predicted the
result with accuracy 93.36%,94.23% and 95.03% respectively Though American sign
language (ASL) has gained recognition from the American society, few ASL applications
have been developed with educational purposes. Those designed with real-time sign
recognition systems are also lacking. Leap motion controller facilitates the real-time
and accurate recognition of ASL signs. It allows an opportunity for designing a learning
application with a real-time sign recognition system that seeks to improve the effective-
ness of ASL learning. The project proposes an ASL learning application prototype. The
application would be a whack-a-mole game with a real-time sign recognition system
embedded. Since both static and dynamic signs (J, Z) exist in ASL alphabets, Long-Short
Term Memory Recurrent Neural Network with k-Nearest-Neighbour method is adopted
as the classification method is based on handling of sequences of input. Characteristics
such as sphere radius, angles between fingers and distance between finger positions are
extracted as input for the classification model.[6].
Yulius Obi et al. in 2023 in their Sign language recognition system for communicating
to people with disabilities paper used CNN and predicted the result with accuracy 95.1%
Sign language is one of the most reliable ways of communicating with special needs peo-
ple, as it can be done anywhere. However, most people do not understand sign language.
Therefore, we have devised an idea to make a desktop application that can recognize sign
language and convert it to text in real time. This research uses American Sign Language
(ASL) datasets and the Convolutional Neural Networks (CNN) classification system.
In the classification, the hand image is first passed through a filter and after the filter
is applied, the hand is passed through a classifier which predicts the class of the hand
gestures. This research focuses on the accuracy of the recognition.[7].
Md. Moklesur Rahman et al. in 2019 in their A New Benchmark on American Sign Lan-
guage Recognition using Convolutional Neural Network paper used CNN and predicted
the result with accuracy 95.9% The listening or hearing impaired (deaf/dumb) people use

7
a set of signs, called sign language instead of speech for communication among them.
However, it is very challenging for non-sign language speakers to communicate with
this community using signs. It is very necessary to develop an application to recognize
gestures or actions of sign languages to make easy communication between the normal
and the deaf community. The American Sign Language (ASL) is one of the mostly
used sign languages in the World, and considering its importance, there are already
existing methods for recognition of ASL with limited accuracy. The objective of this
study is to propose a novel model to enhance the accuracy of the existing methods for
ASL recognition. The study has been performed on the alphabet and numerals of four
publicly available ASL datasets. After preprocessing, the images of the alphabet and
numerals were fed to a newly proposed convolutional neural network (CNN) model, and
the performance of this model was evaluated to recognize the numerals and alphabet of
these datasets. The proposed CNN model significantly (9%) improves the recognition
accuracy of ASL reported by some existing prominent methods. [8].
Jungpil Shin et al. in 2020 in their American Sign Language Alphabet Recognition by
Extracting Feature from Hand Pose Estimation paper used SVM and predicted the result
with accuracy 87% Sign language is designed to assist the deaf and hard of hearing
community to convey messages and connect with society. Sign language recognition
has been an important domain of research for a long time. Previously, sensor-based
approaches have obtained higher accuracy than vision-based approaches. Due to the
cost-effectiveness of vision-based approaches, researchers have been conducted here
also despite the accuracy drop. The purpose of this research is to recognize American
sign characters using hand images obtained from a web camera. In this work, the media-
pipe hands algorithm was used for estimating hand joints from RGB images of hands
obtained from a web camera and two types of features were generated from the estimated
coordinates of the joints obtained for classification: one is the distances between the
joint points and the other one is the angles between vectors and 3D axes. The classifiers
utilized to classify the characters were support vector machine (SVM) and light gradient
boosting machine (GBM). Three character datasets were used for recognition: the ASL
Alphabet dataset, the Massey dataset, and the finger spelling A dataset.[9].

8
Table 2.1: Summary of Related Works on ASL Recognition

S.N. Related Works Results Tools Used


Real-time American
Sign Language Recognition
1 accuracy 95.72% CNN
with Convolutional Neural
Networks (2016)
Sign Language Translation
CNN,
Using
2 accuracy 94.91% Fingerspelling
Deep Convolutional Neural
dataset
Networks Rahib H (2019)
User-Independent American
Sign Language Alphabet Principal
Recognition Based on Depth Component
3 accuracy 88%
Image and PCA Net Analysis Network
Features Walah Aly (PCANet)
and Saleh Aly (2019)
A New Benchmark on
American Sign Language
Recognition using
4 accuracy 95.9% CNN
Convolutional Neural
Network.
Md. Rahman (2019)
American Sign Language
Alphabet Recognition by
5 Extracting Features from accuracy 87% SVM
Hand Pose Estimation.
(2020)
American sign language accuracy
recognition and training 93.36%(LSTM),
6 LSTM, SVM, RNN
method with recurrent 94.23%(SVM) ,
neural network lee (2021) 95.03%(RNN)
Real-time recognition of
American sign language
7 using long-short-term accuracy 93.81% LSTM
memory neural network and
hand detection (2021)
Sign language recognition
system for communicating
8 accuracy 95.1% CNN
to people with disabilities.
(2023)
Real-time Assamese Sign
Mediapipe,
Language Recognition using
9 accuracy 96.21% Microsoft Kinect
MediaPipe and Deep
sensor
Learning. (2023)

9
3 METHODOLOGY

3.1 Data Collection


In this project, we have collected sign language data of around 20,000 and made 10
classes. These classes are then made labels and the predictions are made from these
labels. High-quality video recording tools, including cameras and lighting setups that
allow for good viewing of hand motions, will be used to collect the data.
We have used the Media-pipe library to extract key points from the images, which are
stored as data. The sample of data collection is shown below.

Figure 3.1: Data Collection

3.2 Data Preprocessing


Data preprocessing is a process of preparing the raw data and making it suitable for a
machine-learning model. It is the first and crucial step in creating a machine-learning
model. Real-world data generally contains noises and missing values and may be unus-
able, which cannot be directly used for machine, learning models. Data preprocessing
is a required task for cleaning the data and making it suitable for a machine-learning
model, which also increases the accuracy and efficiency of a machine-learning model. In
our project, we have used a media-pipe library that does preprocessing for our data.

3.2.1 Video Acquisition


The video data for sign language detection will be captured using a high-definition
camera with a decent resolution. The camera will be positioned to capture the frontal

10
view of the signer’s upper body, focusing on the hand region.

3.2.2 Video Segmentation


Then the acquired video data will be segmented into individual sign language gestures.
We will employ an automatic gesture detection algorithm based on motion and hand
region analysis. This algorithm will then detect significant changes in motion and will
use hand-tracking techniques to separate consecutive gestures from video sequences.

3.2.3 Frame Extraction


From the segmented video data, frames will be extracted at a rate of one frame per
second to capture key moments of each gesture. A sample set of frames will thus be
guaranteed for additional study.

3.2.4 Preprocessing Technique


Noise Reduction
After collecting the data using media-pipe, there will be 5 features in landmark i.e
x,y,z,visibility and presence. Here visibility and presence is considered as noise. We will
create a python function and remove that visibility and presence so that our model will
predict well.

3.3 Convolution Neural Network

Figure 3.2: CNN Architecture

11
The CNN layer is the most significant; it builds a convolved feature map by applying a
filter to an array of picture pixels. I developed a CNN with three layers, each layer using
convolution, ReLU, and pooling. Because CNN does not handle rotation and scaling
by itself, a data augmentation approach was used. A few samples have been rotated,
enlarged, shrunk, thickened, and thinned manually.
Convolution filters are applied to the input using 1D Convolutions to extract the most
significant characteristics. The kernel glides in one dimension in 1D convolution, which
exactly suits the spatial properties. Convolution sparsity, when used with pooling for
location invariant feature detection and parameter sharing, lowers overfitting.
ReLU layer is a layer where data travels through each layer of the network, the ReLU
layer functions as an activation function, ensuring non-linearity. Without ReLU, the
dimensionality that is desired would be lost. It introduces non-linearity, accelerates
training, and reduces computation time.
Pooling layer is a layer that gradually decreases the dimension of the feature and variation
of the represented data. Decreases dimensions and computation, speeds up processing by
reducing the number of parameters that the network must compute, reduces overfitting by
reducing the number of parameters, and makes the model more tolerant of changes and
distortions. Pooling strategies include max pooling, min pooling, and average pooling; I
tried max pooling. The maximum input of a convolved feature is used in max pooling.
Flatten is used to transform the data into a one-dimensional array for input to the next
layer.
Dense the weights are multiplied by a matrix-vector multiplication of the input tensors,
followed by an activation function. Apart from an activation function, the essential
argument that we define here is units, which is an integer that we use to select the output
size.
Dropout layer is a regularisation approach that eliminates neurons from layers at random,
along with their input and output connections. As a consequence, generality is improved,
and overfitting is avoided.

3.4 Loss Function


Categorical cross-entropy is a widely used loss function in machine learning, particularly
in the context of multi-class classification tasks such as American Sign Language (ASL)
detection. Specifically tailored for scenarios where instances belong to one of several

12
mutually exclusive classes, categorical cross-entropy measures the dissimilarity between
the predicted probability distribution of classes and the true distribution. In the realm
of ASL detection, where accurate classification of various sign gestures is crucial,
this loss function plays a pivotal role in guiding the training process. By penalizing
deviations from the actual class probabilities, categorical cross-entropy effectively steers
the model towards learning to make more precise predictions. Its implementation ensures
that the model is trained to discern subtle differences among ASL gestures, ultimately
contributing to enhanced accuracy and proficiency in sign language recognition. The
documentation should underscore the significance of categorical cross-entropy in the
ASL detection pipeline, elucidating its role in optimizing the model’s ability to interpret
and classify a diverse range of sign language expressions accurately.

1 N M
Categorical Cross-Entropy = − ∑ ∑ yi j log(pi j ) (3.1)
N i=1 j=1

3.5 Long Short Term Memory


LSTMs Long Short-Term Memory is a type of RNNs Recurrent Neural Network that
can detain long-term dependencies in sequential data. LSTMs are able to process and
analyze sequential data, such as time series, text, and speech. They use a memory
cell and gates to control the flow of information, allowing them to selectively retain
or discard information as needed and thus avoid the vanishing gradient problem that
plagues traditional RNNs. LSTMs are widely used in various applications such as natural
language processing, speech recognition, and time series forecasting.
There are three types of gates in an LSTM: the input gate, the forget gate, and the output
gate. The input gate controls the flow of information into the memory cell. The forget
gate controls the flow of information out of the memory cell. The output gate controls
the flow of information out of the LSTM and into the output. Three gates input gate,
forget gate, and output gate are all implemented using sigmoid functions, which produce
an output between 0 and 1. These gates are trained using a backpropagation algorithm
through the network.
The input gate decides which information to store in the memory cell. It is trained to
open when the input is important and close when it is not. The forget gate decides which

13
information to discard from the memory cell. It is trained to open when the information
is no longer important and close when it is. The output gate is responsible for deciding
which information to use for the output of the LSTM. It is trained to open when the
information is important and close when it is not. The gates in an LSTM are trained
to open and close based on the input and the previous hidden state. This allows the
LSTM to selectively retain or discard information, making it more effective at capturing
long-term dependencies.

Structure of LSTM

Figure 3.3: Structure of LSTM

An LSTM (Long Short-Term Memory) network is a type of RNN recurrent neural


network that is capable of handling and processing sequential data. The structure of
an LSTM network consists of a series of LSTM cells, each of which has a set of gates
(input, output, and forget gates) that control the flow of information into and out of the
cell. The gates are used to selectively forget or retain information from the previous time
steps, allowing the LSTM to maintain long-term dependencies in the input data. The
LSTM cell also has a memory cell that stores information from previous time steps and
uses it to influence the output of the cell at the current time step. The output of each
LSTM cell is passed to the next cell in the network, allowing the LSTM to process and

14
analyze sequential data over multiple time steps.

3.6 System Block Diagram

Figure 3.4: System Block Diagram

The overall workflow of the system is shown in the above block diagram. Data sets are
like the memory of the system. Every detection that we view in real time is the result
of the data set. Data sets are captured in real-time from the front camera of the laptop.
Using media pipe live perception of simultaneous human pose, face landmarks, and hand
tracking in real-time various modern life applications including sign language detection
can be enabled. With the help of the landmarks or let’s say key points of hands we get
from the media pipe we train our model. All the data that we collected from the data sets
and deep learning models are considered as training data. These data are provided to
the system such that the system can detect the sign language in real time. Input to this
system is a live video using the front camera of the laptop. As the real-time input i.e. sign
language is provided using the front camera of the laptop, simultaneously live output
can be seen on the screen in text format. It acts as an interface for the Sign Language
System providing an environment for input data to get processed and provide the output.

15
3.7 Use Case Diagram

Figure 3.5: Use Case Diagram

16
3.8 Level 0 DFD

Figure 3.6: Level 0 DFD

17
3.9 Level 1 DFD

Figure 3.7: Level 1 DFD

18
3.10 Activity Diagram

Figure 3.8: Activity Diagram

19
3.11 Instrumental Requirements
3.11.1 Hardware Requirements
The hardware required for the projects are:

• CPU

• GPU

• Storage

3.11.2 Software Requirements


The software required for the projects is
Python
Python is a high level language that is used for general purpose programming. It was
developed by Guido van Rossum. The first release of Python was in the year 1991 as
Python 0.9.0. Programming paradigms such as structured, object oriented, and functional
programming are supported in Python.

TensorFlow
TensorFlow is a free open source library that can be used in the field of machine learning
and artificial intelligence. Including many other tasks, it can be used for training purposes
in deep learning.

Mediapipe
MediaPipe offers cross-platform, customizable machine learning solutions for live and
streaming media i.e. real time videos. Its features are End to end acceleration, Build
once deployed anywhere, ready to use solution, and Free and open source.

20
Figure 3.9: Hand Landmarks

3.12 User Requirement Definition


The user requirement for this system is to make the system fast, feasible, less prone to
error, save time and improve the communication gap between normal people and deaf
people

• The system can translate sign language into text.

• The system should have a user-friendly interface.

3.13 Dataset Explanation


In this project, we’ve gathered data on sign language consisting of approximately 20,000
samples, which have been categorized into 10 distinct classes. Each class has been
assigned a label, and predictions are made based on these labels. To capture the data,
we’re using high-quality video recording equipment including cameras and lighting
setups, ensuring optimal visibility of hand movements. We’re employing the MediaPipe
library to extract key points from the images, which are then stored as data.

3.14 Functional Requirements


• Real-time Output.

• Accurate detection of gestures.

• Data sets comment.

21
3.15 Non-functional Requirement
• Performance Requirement.

• Design Constraints.

• Reliability.

• Usability.

• Maintainability.

3.16 Elaboration of Working Principle


When communicating solely through hand gestures, individuals rely on a rich vocabulary
of signs and movements to convey meaning effectively. In this mode of communication,
hand gestures become the primary means of expression, encompassing a diverse range
of handshapes, movements, and placements. Each gesture represents specific words,
concepts, or ideas, allowing for the exchange of information without the need for spoken
language. Through the manipulation of handshapes and the fluidity of movements, indi-
viduals can articulate a wide array of thoughts, emotions, and actions. Facial expressions
and body language may still accompany hand gestures, enhancing comprehension and
adding depth to the communication. This form of non-verbal communication transcends
linguistic barriers, enabling people to interact and connect across diverse cultural and
linguistic backgrounds. Whether used in American Sign Language (ASL) or other sign
systems, hand gestures serve as a powerful tool for communication and expression,
fostering understanding and fostering inclusivity in human interaction.

3.17 Verification and Validation Procedures


Verification and validation are crucial steps in ensuring the accuracy and reliability
of a machine learning model, particularly in the context of American Sign Language
(ASL) recognition. Firstly, a validation size of 20% of the actual data is set aside to
evaluate the model’s performance on unseen data. This ensures that the model is tested
on data that it hasn’t been trained on, helping to assess its generalization capabilities.
Next, the model.evaluate function is used to evaluate the model’s performance on the
test set (x test) and corresponding labels (Y te). This provides quantitative metrics
such as loss and accuracy, which are essential for assessing the model’s effectiveness

22
in recognizing ASL gestures. Additionally, early stopping is implemented using the
EarlyStopping callback to prevent overfitting and improve the generalization of the
model. This technique monitors the validation loss during training and stops training if
the loss fails to improve for a specified number of epochs (patience), thereby preventing
the model from memorizing the training data and ensuring better performance on unseen
data. Finally, the model.fit function is used to train the model on the training set
(x train) and corresponding labels (Y tr). The validation split parameter is set to
0.1, indicating that 10% of the training data will be used for validation during training.
This allows for monitoring the model’s performance on a separate validation set during
training, helping to detect overfitting and adjust hyperparameters accordingly.

23
4 RESULTS AND ANALYSIS

4.1 CNN Accuracy And Loss


CNN model has been trained in 8 epochs with an accuracy of 97.52%. We have used the
following parameters for training our model.

Table 4.1: Model Parameters

S.N. Parameter Used Value


1 Number of Convolutional Layers 3 (32,64,128)
2 Number of Dense Layers 2(128,10)
3 Activation Functions ReLU
4 Learning Rate 0.001
5 Optimizer Adam
6 Batch Size 32
7 Epochs 8
8 Input Array Size (63,1)
9 Loss Function categorical crossentropy

After training our model we have visualized the accuracy and the loss curve.We have
found that the accuracy at epoch 1 was 0.7227 and validation accuracy was 0.9026,
0.9086 accuracy and 0.9407 validation accuracy at epoch 2 and goes on increasing
as shown in figure 4.1. Also we have found that the loss at epoch 1 was 0.9629 and
validation loss was 0.2303, 0.2372 loss and 0.1435 validation loss at epoch 2 and goes
on decreasing as shown in figure 4.2.

Figure 4.1: Accuracy Graph

24
Figure 4.2: Loss Graph

4.2 LSTM Accuracy And Loss


The LSTM model has been trained in 13 epochs with an accuracy of 97.15%. We have
used the following parameters for training our model.

Table 4.2: Model Parameters

S.N. Parameter Used Value


1 Number of LSTM Layers 3 (64,128,64)
2 Number of Dense Layers 3(64,32,10)
3 Activation Functions ReLU
4 Learning Rate 0.001
5 Optimizer Adam
6 Batch Size 32
7 Epochs 13
8 Input Array Size (63,1)
9 Loss Function categorical crossentropy

After training our model we have visualized the accuracy and the loss curve.We have
found that the accuracy at epoch 1 was 0.1270 and validation accuracy was 0.1012,
0.2012 accuracy and 0.2673 validation accuracy at epoch 2 and goes on increasing
as shown in figure 4.3. Also we have found that the loss at epoch 1 was 2.5367 and
validation loss was 2.3158, 6.2265 loss and 1.9238 validation loss at epoch 2 and goes
on decreasing as shown in figure 4.4.

25
Figure 4.3: LSTM Accuracy Graph

Figure 4.4: LSTM Loss Graph

4.3 Quantitative Analysis


In order to evaluate the effectiveness of the system being proposed, we have measured
its performance using various metrics, including Accuracy, Precision, Recall, F1-Score,
and Error Rate. Accuracy refers to how closely the measurements of the system align
with a particular value, and is expressed as follows:

TP+TN
Accuracy = (4.1)
T P + FN + FP + T N

26
Precision is a metric that measures the accuracy of positive predictions made by a system.
It can be obtained by dividing true positives by the sum of true positives and false
positives.
TP
Precision = (4.2)
T P + FP
In machine learning, recall, also referred to as sensitivity or true positive rate, represents
the likelihood that the model accurately recognizes the detected anomaly.

TP
Recall = (4.3)
T P + FN

The F1-Score is a metric that combines precision and recall using their harmonic mean. It
provides a single value for comparison, with higher values indicating better performance.

2 ∗ Precision ∗ Recall
F1 − Score = (4.4)
Precision + Recall

Figure 4.5: CNN Confusion Matrix

27
Figure 4.6: LSTM Confusion Matrix

4.4 Qualitative Analysis


Output from our model is shown below where ”Hello, no, I Love You ”, This kind of
hand gestures input has been provided and output can be seen on screen.

28
Figure 4.7: Output from Model

4.5 Comparision of CNN and LSTM


The model has been evaluated using both CNN and LSTM model for 13 epochs and it
was found that CNN model out-performs LSTM model as shown in figure 4.8. So we
have chosen CNN model over LSTM model.

Figure 4.8: Comparision between LSTM and CNN

29
4.6 Model Summary
4.6.1 CNN Model Summary
The following figure 4.9 is our model’s summary which shows no of convolutional layer,
kernel, filters, Max pooling layer and their no of units.

Figure 4.9: CNN model summary

30
4.6.2 LSTM Model Summary
The following figure 4.10 is our model’s summary of LSTM.

Figure 4.10: LSTM model summary

31
5 CONCLUSION

We developed the American Sign Language (ASL) website and added Convolutional
Neural Networks (CNNs) to improve the website’s functioning and user experience.
CNNs were used for picture identification and classification, among other tasks. We
used convolutional neural networks (CNNs) to generate hand gesture recognition and
ASL sign interpretation from picture or video frames by making use of CNNs’ capacity
to extract spatial elements from visual data. Thanks to this technology, visitors could
interact with the website through movements that were recorded by the camera on
their device, making learning easier and more engaging. LSTM networks, on the other
hand, were used for sequential data processing, especially for tasks involving temporal
dependencies, like sequence recognition in sign language. Long-range relationships and
temporal patterns are excellently captured by LSTM networks, which makes them a
good fit for situations where interpreting a sign accurately requires a comprehension of
its context. In order to assess user-inputted ASL sign sequences and provide real-time
feedback and corrections during sign language practice sessions, we implemented LSTM
networks.
We conducted a comparative study between CNNs and LSTM networks in the context of
the ASL website. The results showed that CNNs performed significantly better for tasks
using static visual data, like hand gesture detection from pictures or videos. Their aptitude
for acquiring knowledge of spatial feature hierarchies made them especially suitable
for image-based ASL recognition applications. However, LSTM networks performed
exceptionally well in tasks that required sequential data processing, like deciphering and
understanding ASL sign sequences. Their ability to accurately comprehend sequences in
sign language was made possible by their capacity to describe temporal dynamics and
long-range relationships. This allowed the website to offer users contextually relevant
feedback and help during sign language practice sessions. Overall, we were able to
improve the ASL website’s functionality and user experience by utilizing the advantages
of both CNNs and LSTM networks. This gave users access to a thorough and engaging
platform for learning and using American Sign Language.

32
APPENDIX

A.1 Gantt Chart

Figure A.1: Gantt Chart

33
Figure A.2: Home Page

Figure A.3: Sign and Tutorials

34
Figure A.4: Tutorials

Figure A.5: Alphabet Sign

35
Figure A.6: Number Sign

36
REFERENCES

[1] Reham Mohamed Abdulhamied, Mona M Nasr, and Sarah N Abdulkader. Real-time
recognition of american sign language using long-short term memory neural network
and hand detection. 2023.

[2] Rahib H Abiyev, Murat Arslan, and John Bush Idoko. Sign language translation
using deep convolutional neural networks. KSII Transactions on Internet & Infor-
mation Systems, 14(2), 2020.

[3] Walaa Aly, Saleh Aly, and Sultan Almotairi. User-independent american sign
language alphabet recognition based on depth image and pcanet features. IEEE
Access, 7:123138–123150, 2019.

[4] Jyotishman Bora, Saine Dehingia, Abhijit Boruah, Anuraag Anuj Chetia, and Dikhit
Gogoi. Real-time assamese sign language recognition using mediapipe and deep
learning. Procedia Computer Science, 218:1384–1393, 2023.

[5] Brandon Garcia and Sigberto Alarcon Viesca. Real-time american sign language
recognition with convolutional neural networks. Convolutional Neural Networks for
Visual Recognition, 2(225-232):8, 2016.

[6] Carman KM Lee, Kam KH Ng, Chun-Hsien Chen, Henry CW Lau, SY Chung,
and Tiffany Tsoi. American sign language recognition and training method with
recurrent neural network. Expert Systems with Applications, 167:114403, 2021.

[7] Yulius Obi, Kent Samuel Claudio, Vetri Marvel Budiman, Said Achmad, and Aditya
Kurniawan. Sign language recognition system for communicating to people with
disabilities. Procedia Computer Science, 216:13–20, 2023.

[8] Md Moklesur Rahman, Md Shafiqul Islam, Md Hafizur Rahman, Roberto Sassi,


Massimo W Rivolta, and Md Aktaruzzaman. A new benchmark on american sign
language recognition using convolutional neural network. In 2019 International
Conference on Sustainable Technologies for Industry 4.0 (STI), pages 1–6. IEEE,
2019.

[9] Jungpil Shin, Akitaka Matsuoka, Md Al Mehedi Hasan, and Azmain Yakin Srizon.

37
American sign language alphabet recognition by extracting feature from hand pose
estimation. Sensors, 21(17):5856, 2021.

38

You might also like