Professional Documents
Culture Documents
AMERICAN - SIGN - LANGUAGE - DETECTION Mid Term Defence
AMERICAN - SIGN - LANGUAGE - DETECTION Mid Term Defence
SUBMITTED BY
Amrit Sapkota (076 BCT 05)
Asmit Oli (076 BCT 43)
Nischal Maharjan (076 BCT 20)
Sakshyam Aryal (076 BCT 29)
SUBMITTED TO:
DEPARTMENT OF COMPUTER ENGINEERING
LALITPUR ENGINEERING COLLEGE
LALITPUR, NEPAL
SUPERVISED BY
Er. Hemant Joshi
December, 2023
TRIBHUVAN UNIVERSITY
INSTITUTE OF ENGINEERING
LALITPUR ENGINEERING COLLEGE
DEPARTMENT OF COMPUTER ENGINEERING
SUBMITTED BY
Amrit Sapkota (076 BCT 05)
Asmit Oli (076 BCT 43)
Nischal Maharjan (076 BCT 20)
Sakshyan Aryal (076 BCT 29)
SUPERVISED BY
Er. Hemant Joshi
December, 2023
ACKNOWLEDGEMENT
First and foremost, we would like to thank our supervisor, Er. Hemant Joshi sir,
who guided us in doing this project. He provided us with invaluable advice and
helped us in difficult stages. His motivations helped tremendously to the successful
completion of the project. We are really grateful to our project coordinator, Er.
Bibat Thokar, for advising us and introducing the project to us in an easy-to-
understand way which has helped us to complete our project easily and effectively
on time. We would like to express our special thanks of gratitude to IOE as well
as our principal, Dr.Surendra Tamrakar, who gave us the golden opportunity to do
this wonderful project on the topic of sign language Detection, which also helped us
in a lot of research and we came to know about so many new things. We are really
thankful to them. Besides, we would like to thank all the teachers who helped us
by advising us and providing the equipment we needed. We are overwhelmed in
all humbleness and gratefulness to acknowledge our depth to all those who have
helped us to put these ideas, well above the level of simplicity and into something
concrete. Also, we would like to thank our family and friends for their support.
Without their support we wouldn’t have succeeded in completing this project. Last
but not the least, we would like to thank everyone who helped and motivate us to
work on this project.
Sincerely,
Amrit Sapkota (076 BCT 05)
Asmit Oli (076 BCT 43)
Nischal Maharjan (076 BCT 20)
Sakshyan Aryal (076 BCT 29)
i
ABSTRACT
ii
TABLE OF CONTENTS
ACKNOWLEDGEMENT i
ABSTRACT ii
TABLE OF CONTENTS iii
LIST OF FIGURES v
LIST OF ABBREVIATIONS vi
1 INTRODUCTION 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 System Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5.1 Functional Requirements . . . . . . . . . . . . . . . . . . . 2
1.5.2 Non-functional Requirement . . . . . . . . . . . . . . . . . . 3
2 LITERATURE REVIEW 4
3 BLOCK DIAGRAM 7
3.1 System Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Level 0 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Level 1 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 METHODOLOGY 12
4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.1 Video Acquisition . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.2 Video Segmentation . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.3 Frame Extraction . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.4 Preprocessing Techniques . . . . . . . . . . . . . . . . . . . 13
4.2.5 Hand Segmentation . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.6 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . 14
iii
4.3 Convolution Neural Network . . . . . . . . . . . . . . . . . . . . . . 14
4.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5 IMPLEMENTATION PLAN 17
5.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6 REQUIREMENT ANALYSIS 18
6.1 Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . 18
6.2 Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.3 User Requirement Definition . . . . . . . . . . . . . . . . . . . . . . 19
7 RESULT AND ANALYSIS 20
7.1 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.2 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.3 Comparision of CNN and LSTM . . . . . . . . . . . . . . . . . . . . 23
8 EPILOGUE 24
8.1 Task Completed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8.2 Remaining Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
REFERENCES 27
iv
LIST OF FIGURES
v
LIST OF ABBREVIATIONS
vi
CHAPTER 1
INTRODUCTION
1.1 Background
With the rapid growth of technology around us, Machine Learning and Artificial
Intelligence has been used in various sectors to support mankind including gesture,
object, face detection etc. With the help of Deep Learning, a machine imitates
the way humans gain certain types of knowledge. Using Artificial Neural Network,
simulation of human brain is done and using Convolution layers, extraction of
selected important part from an image to make computation easy. ”Sign Language
Detection”, the name itself specifies the gist of the project. Sign language recog-
nition has been a major problem between mute disabilities people in community.
People does not understand sign language and also it is difficult for them to learn
those sign language. A part from the scoring grades form this minor project, the
core idea is to make communication easy for deaf people. We set the bar of the
project such that it would be beneficial to society as well. The main reason for us
to choose this project is to aid people using Artificial Intelligence.
1
1.3 Scope
The field of sign language recognition includes the development and application of
techniques for recognizing and interpreting sign language gestures. This involves
using computer vision and machine learning techniques to analyze video input and
identify gestures of sign language users. Sign language recognition has a wide range
of potential applications, including communication aids for deaf people, automatic
translation of sign language into spoken or written language, and an interactive
platform for learning sign language. The scope also extends to improving the
accuracy and efficiency of sign language recognition systems through advances in
algorithms, sensor technology and data collection. Additionally, this scope also
includes addressing challenges related to sign language diversity, gestural variation,
lighting conditions, and the need for robust real-time performance in a variety of
environments.
1.4 Objectives
• To design and implement a system that can understand the sign language of
Hearing-impaired people.
• To train the model with a variety of datasets using MediaPipe and CNN,
and provide the output in real-time.
The following is the desired functionality of the new system. The proposed project
would cover.
• Real-time Output.
2
• Data sets comment.
• Performance Requirement.
• Design Constraints.
• Reliability.
• Usability.
• Maintainability.
3
CHAPTER 2
LITERATURE REVIEW
Rahib H.Abiyev et al. in 2019 in their Sign Language Translation Using Deep
Convolutional Neural Networks paper used CNN and Finger-spelling dataset and
predicted the result with accuracy 94.91% [2].
Jungpil Shin et al. in 2020 in their American Sign Language Alphabet Recognition
by Extracting Feature from Hand Pose Estimation paper used SVM and predicted
the result with accuracy 87% [5].
C.K.M Lee et al. in 2021 in their American sign language recognition and training
method with recurrent neural network paper used LSTM,SVM,RNN and predicted
the result with accuracy 93.36%,94.23% and 95.03% respectively [6].
Yulius Obi et al. in 2023 in their Sign language recognition system for communi-
cating to people with disabilities paper used CNN and predicted the result with
accuracy 95.1% [8].
4
Jyotishman Bora et al. in 2023 in their Real-time Assamese Sign Language
Recognition using MediaPipe and Deep Learning paper used Mediapipe, Microsoft
Kinect sensor and predicted the result with accuracy 96.21% [9].
5
Table 2.1: Summary of Related Works on ASL Recognition
6
CHAPTER 3
BLOCK DIAGRAM
The overall workflow of the system is shown in the above block diagram. Data-set
are like the memory of the system. Each and every detection that we view in real
time are the results of the data-set. Data-sets are captured in real time from the
front camera of the laptop . Using media pipe live perception of simultaneous
human pose, face landmarks, and hand tracking in real-time various modern life
applications including sign language detection can be enabled. With the help of
the landmarks or let’s say key points of features (face, pose, and hands) we get
from the media pipe we train our model. All the data that we collected from the
data-sets and from deep learning models are considered as training data. These
data are provided to the system such that the system can detect the sign language
in real-time. Input to this system is real-time or say live video using the front
camera of the laptop. As the real-time input i.e. sign language is provided using
the front camera of laptop, simultaneously live output can be seen on the screen in
text format. It acts as an interface for the Sign Language System providing an
environment for input data to get processed and provide the output.
7
3.2 Use Case Diagram
8
3.3 Level 0 DFD
9
3.4 Level 1 DFD
10
3.5 Activity Diagram
11
CHAPTER 4
METHODOLOGY
In this project, we have collected sign language data of around 20,000 and made
10 classes. These classes are then made labels and the predictions are made from
these labels. High-quality video recording tools, including cameras and lighting
setups that allow for good viewing of hand motions, will be used to collect the data
We have used the Media-pipe library to extract key points from the images, which
are stored as data. The sample of data collection is shown below.
Data preprocessing is a process of preparing the raw data and making it suitable
for a machine-learning model. It is the first and crucial step in creating a machine-
learning model. Real-world data generally contains noises and missing values and
may be unusable, which cannot be directly used for machine, learning models.
Data preprocessing is a required task for cleaning the data and making it suitable
for a machine-learning model, which also increases the accuracy and efficiency of a
12
machine-learning model. In our project, we have used a media-pipe library that
does preprocessing for our data.
The video data for sign language detection will be captured using a high-definition
camera with a decent resolution. The camera will be positioned to capture the
frontal view of the signer’s upper body, focusing on the hand region.
Then the acquired video data will be segmented into individual sign language
gestures. We will employ an automatic gesture detection algorithm based on motion
and hand region analysis. This algorithm will then detect significant changes in
motion and will use hand-tracking techniques to separate consecutive gestures from
video sequences.
From the segmented video data, frames will be extracted at a rate of one frame
per second to capture key moments of each gesture. A sample set of frames will
thus be guaranteed for additional study.
13
minimize the impact of minor variations caused by lighting conditions.
Contrast Enhancements
Histogram equalization will be applied to the grayscale frames to improve the
visibility of hand features. This will enhance the contrast and increase the overall
dynamic range of pixel intensities.
Normalization
Min-max scaling has been used to translate the intensity values from the [0, 255]
range to [0, 1], standardizing the pixel values across frames. By ensuring that the
input data has consistent ranges, this normalization step will help in convergence
during model training.
Hand segmentation techniques based on color and region analysis have been
used because hand movements are important in sign language. To separate the
hands from the background and other unimportant items, this technique will use
background subtraction and skin color modeling.
Data augmentation techniques have been used to broaden the variety and amount
of the training dataset. These will consist of randomizing the frames’ cropping,
rotation, translation, and flipping. The model’s ability to recognize sign gestures
in a variety of situations will be strengthened with the aid of data augmentation.
The CNN layer is the most significant; it builds a convolved feature map by
applying a filter to an array of picture pixels. I developed a CNN with three layers,
each layer using convolution, ReLU, and pooling. Because CNN does not handle
rotation and scaling by itself, a data augmentation approach was used. A few
samples have been rotated, enlarged, shrunk, thickened, and thinned manually.
Convolution filters are applied to the input using 1D Convolutions to extract
the most significant characteristics. The kernel glides in one dimension in 1D
convolution, which exactly suits the spatial properties. Convolution sparsity, when
14
Figure 4.2: CNN Architecture
used with pooling for location invariant feature detection and parameter sharing,
lowers overfitting.
ReLU layer is a layer where data travels through each layer of the network, the
ReLU layer functions as an activation function, ensuring non-linearity. Without
ReLU, the dimensionality that is desired would be lost. It introduces non-linearity,
accelerates training, and reduces computation time.
Pooling layer is a layer that gradually decreases the dimension of the feature
and variation of the represented data. Decreases dimensions and computation,
speeds up processing by reducing the number of parameters that the network must
compute, reduces overfitting by reducing the number of parameters, and makes the
model more tolerant of changes and distortions. Pooling strategies include max
pooling, min pooling, and average pooling; I tried max pooling. The maximum
input of a convolved feature is used in max pooling.
Flatten is used to transform the data into a one-dimensional array for input to
the next layer.
Dense the weights are multiplied by a matrix-vector multiplication of the input
tensors, followed by an activation function. Apart from an activation function, the
essential argument that we define here is units, which is an integer that we use to
select the output size.
Dropout layer is a regularisation approach that eliminates neurons from layers
15
at random, along with their input and output connections. As a consequence,
generality is improved, and overfitting is avoided.
N M
1 XX
Categorical Cross-Entropy = − yij log(pij ) (4.1)
N i=1 j=1
16
CHAPTER 5
IMPLEMENTATION PLAN
Weeks
Task 1 2 3 4 5 6 7 8 9 10 11 12
Literature Review
Proposal Defense
Implementation
Mid-term Defense
Testing and Debugging
Validation and Analysis
Final Submission of Project
Documentation
Coordination with Supervi-
sor
17
CHAPTER 6
REQUIREMENT ANALYSIS
• CPU
• GPU
• Storage
TensorFlow
TensorFlow is a free open source library that can be used in the field of machine
learning and artificial intelligence. Including many other tasks, it can be used for
training purposes in deep learning.
Mediapipe
MediaPipe offers cross-platform, customizable machine learning solutions for live
and streaming media i.e. real time videos. Its features are End to end acceleration,
Build once deploy anywhere, ready to use solution, and Free and open source.
18
Figure 6.1: Hand Landmarks
The user requirement for this system is to make the system fast, feasible, less prone
to error, save time and improve the communication gap between normal people
and deaf people
19
CHAPTER 7
CNN model has been trained in 8 epochs with an accuracy of 97.52%. We have
used the following parameters for training our model.
After training our model we have visualized the accuracy and the loss curve.We
have found that the accuracy at epoch 1 was 0.7227 and validation accuracy was
0.9026, 0.9086 accuracy and 0.9407 validation accuracy at epoch 2 and goes on
increasing as shown in figure 7.1. Also we have found that the loss at epoch 1 was
0.9629 and validation loss was 0.2303, 0.2372 loss and 0.1435 validation loss at
epoch 2 and goes on decreasing as shown in figure 7.2.
20
Figure 7.2: Epoch Loss
Recall, F1-Score, and Error Rate. Accuracy refers to how closely the measurements
TP + TN
Accuracy = (7.1)
TP + FN + FP + TN
TP
P recision = (7.2)
TP + FP
TP
Recall = (7.3)
TP + FN
21
their harmonic mean. It provides a single value for comparison, with
higher values indicating better performance.
2 ∗ P recision ∗ Recall
F 1 − Score = (7.4)
P recision + Recall
Output from our model is shown below where ”Hello, no, I Love You
”, This kind of hand gestures input has been provided and output
can be seen on screen.
22
Figure 7.4: Output from Model
The model has been evaluated using both CNN and LSTM model
for 13 epochs and it was found that CNN model out-performs LSTM
model as shown in figure 7.5. So we have chosen CNN model over
LSTM model.
23
CHAPTER 8
EPILOGUE
24
Furthermore, a deliberate increase in kernel size and the number of
filters in the CNN model was undertaken. This adjustment facilitated
a broader scope of feature extraction, enabling the model to discern
more complex spatial hierarchies within the input data. As a result,
the model’s capacity for recognizing subtle patterns in images was
greatly amplified. Collectively, these refinements, encompassing
dataset expansion, augmentation of model architecture, and fine-
tuning of hyperparameters, culminated in a substantial increase in
the overall efficiency of both the LSTM and CNN models, positioning
them as formidable tools in the realm of predictive analytics and
image processing.
25
REFERENCES
26
[7] R. M. Abdulhamied, M. M. Nasr, and S. N. Abdulkader, “Real-
time recognition of american sign language using long-short term
memory neural network and hand detection,” 2023.
27