Professional Documents
Culture Documents
Project report submitted in partial fulfillment of the requirement for the degree of
BACHELOR OF TECHNOLOGY
IN
By
DECLARATION i
ACKNOWLEDGEMENT ii
LIST OF ACRONYMS AND ABBREVIATIONS iii
LIST OF SYMBOLS iv
LIST OF FIGURES v
LIST OF TABLES vi
ABSTRACT vii
CHAPTER-1: INTRODUCTION 1
1.1 Section Title1 1
1.1.1 Sub-Section Title1 2
1.1.2 Sub-Section Title2 3
1.2 Section Title2 5
1.2.1 Sub-Section Title1 5
1.2.2 Sub-Section Title2 6
We hereby declare that the work reported in the B.Tech Project Report entitled “SIGN LANGUAGE
authentic record of our work carried out under the supervision of PROF. ANUJ MAURYA. We have
not submitted this work elsewhere for any other degree or diploma.
-------------------------- -------------------------
ARCHITA GUPTA SHRUTI SHARMA
171021 171051
This is to certify that the above statement made by the candidates is correct to the best of my knowledge.
-------------------------
ANUJ MAURYA
We take this opportunity to express our gratitude to our supervisor Prof. Anuj Maurya, for his insightful
advice, motivating suggestions, invaluable guidance, help and support in successful completion of this
project and also for his constant encouragement and advice throughout our project.
The in-house facilities provided by the department throughout the project are also equally
acknowledgeable. We would like to convey our thanks to the teaching and non-teaching staff of the
Electronics and Communication Engineering Department for their invaluable help and support.
LIST OF ACRONYMS AND ABBREVIATIONS
LIST OF SYMBOLS
LIST OF FIGURES
LIST OF TABLES
ABSTRACT
Sign language is a natural language used by hearing or speech impaired people to communicate.
It uses hand gestures instead of sound to convey meaning. More than 2 million people in India
are deaf. They find it difficult to communicate with the normal people because normal people
cannot understand sign languages. There arises a need for sign language translators who can
However, the availability of such translators is limited, costly and does not work for a deaf
person's entire life. This led to development of sign language recognition system which can
automatically translate signs into text or voice. n our method, the hand is first passed through a
filter and after the filter is applied the hand is passed through a classifier which predicts the class
of the hand gestures. In our project we basically focus on producing a model which can
recognize. Fingerspelling based hand gestures in order to form a complete word by combining
each gesture.
CHAPTER 1
INTRODUCTION
Sign language is a form of communication used by people with impaired hearing and speech. People use
sign language gestures as a means of non-verbal communication to express their thoughts and emotions.
But non-signers find it extremely difficult to understand, hence trained sign language interpreters are
needed during medical and legal appointments, educational and training sessions. Over the past five
years, there has been an increasing demand for interpreting services.
The SLR architecture can be categorized into two main classifications based on its input: data gloves-
based and vision-based. Chouhan et al use smart gloves to acquire measurements such as the positions
of hands, joints orientation, and velocity using microcontrollers and specific sensors, i.e.,
accelerometers, flex sensors, etc. There are other approaches to capturing signs by using motion sensors,
such as electromyography (EMG) sensors, RGB cameras, Kinect sensors, leap motion controllers or
their combinations. The advantage of this approach is having higher accuracy, and the weakness is that it
has limited movement. In recent years, the involvement of vision-based techniques has become more
popular, of which input is from camera (web camera, stereo camera, or 3D camera). Sandjaja and
Marcos [10] used color-coded gloves to make hand detection easier. A combination of both architectures
is also possible, which is called the hybrid architecture . While these are more affordable and less
constraining than data gloves, the weakness of this approach is lower accuracy and high computing
power consumption.
The architecture of these vision-based systems is typically divided into two main parts. The first part is
the feature extraction, which extracts the desired features from a video by using image processing
techniques or the computer vision method. From the extracted and characterized features, the second
part that is the recognizer should be learning of the pattern from training data and correct recognition of
testing data on which machine algorithms were employed. Most of the studies mentioned above focus on
translating the signs typically made by the hearing- impaired person or the signer to word(s) that the
hearing majority or non-signer can understand. Although these studies proved that technology is useful
in so many ways, their proponents think that these are intrusive to some hearing–impaired individuals.
Instead, the proponents proposed a system that will help those non-signers who want to learn basic static
sign language and not being intrusive at the same time. It is also important to mention that there are
applications implemented on mobile phones that help the non-signer to learn sign language through
several videos installed on the apps. However, most of these apps require a large amount of storage and
good internet connection.
The proposed study aims to develop a system that will recognize static sign gestures and convert them
into corresponding words. A vision-based approach using a web camera is introduced to obtain the data
from the signer and can be used offline. The purpose of creating the system is that it will serve as the
learning tool for those who want to know more about the basics of sign language such as alphabets,
numbers, and common static signs. The proponents provided a white background and a specific location
for image processing of the hand, thus, improving the accuracy of the system and used Convolutional
Neural Network (CNN) as the recognizer of the system. The scope of the study includes basic static
signs, numbers and ASL alphabets (A–Z). One of the main features of this study is the ability of the
system to create words by fingerspelling without the use of sensors and other external technologies.
LITERATURE REVIEW
Literature review the problem shows that there have been several approaches to address the issue of
gesture recognition in video using several different methods. One of the messages used Hidden Markov
Models (HMM) to recognize facial expressions from video sequences combined with Bayesian Network
Classifiers and Gaussian Tree Augmented Naive Bayes Classifier. Francois also published a paper on
human posture recognition in a video sequence using methods based on 2 D and 3 D appearance. The
work mentions using PCA to recognize silhouettes from a static camera and then using 3 D to model
posture for recognition. This approach has the drawback of having intermediary gestures which may
lead to ambiguity in training and therefore lower accuracy in prediction.
Let's approach the analysis of video segments using neural networks which involves extracting visual
information in the form of feature vectors. Neural networks do face issues such as tracking of hands,
segmentation of subject from the background and environment, illumination, variation, occlusion,
movement and position. The paper splits the dataset into segments, extracts features and classifies using
Euclidean distance and K-nearest neighbor.
Work done by blank defines how to do continuous Indian sign language recognition. The paper proposes
frame extraction from video data, preprocessing the data, extracting key frames from the data followed
by extracting other features, recognition and finally optimization. Preprocessing is done by converting
the video to a sequence of RGB frames. Each frame having the same dimensions. Skin color
segmentation is used to extract skin regions with the help of AHS we gradient. The images of obtained
were converted to binary form. Food keyframes were extracted by calculating a gradient between the
frames. And features were extracted from the keyframes using an orientation histogram. Classification
was done by Euclidean distance, Manhattan distance, chess board distance and Mahalanobis distance.
In a paper by Jie et al. [2], the authors recognized problems in SLR such as problems in recognition
when the signs are broken down to individual words and the issues with continuous SLR. They decided
to solve the problem without isolating individual signs, which removes an extra level of preprocessing
(temporal segmentation) and another extra layer of post-processing because they believed that temporal
segmentation is crucial to SLR and without its errors propagate into subsequent steps. Combined with
the strenuous labelling of individual words adds a huge challenge to SLR without temporal
segmentation. They addressed this issue with a new framework called Hierarchical Attention Network
with Latent Space (LS-HAN), which eliminates the preprocessing of temporal segmentation. The
framework consists of a two-stream CNN for video feature representation generation, a Latent Space for
semantic gap bridging and a Hierarchical Attention Network for space-based recognition.
CHAPTER 2
MACHINE LEARNING
Machine Learning is an application of artificial intelligence that provides the systems the ability to
automatically learn and improve from experience without being explicitly programmed.
Machine learning focuses on the development of computer programs that can access data and
The process of learning begins with observations and data, such as examples, direct
experience or instruction, in order to look for patterns in data and make better decisions in the
future based on the examples that we provide. the primary aim is to allow the computers to
learn automatically without human intervention or assistance and adjust actions accordingly.
Machine learning Life Cycle is defined as a cyclical process which involve three phase process
(Pipeline development, Training phase, and Inference phase) acquired by the data scientist and the
data engineers to develop, train and serve the models using the huge amount of data that are
involved in various applications so that the organization can take advantage of artificial intelligence
SUPERVISED LEARNING:
given set of predictors (independent variables). Using these sets of variables, we can generate a
Based on the type of target variable, supervised learning problems can further be divided into
two groups:
UNSUPERVISED LEARNING:
In this learning we do not have any outcome variable or target to predict. It is mainly used for
Any machine learning model development can broadly be divided into six steps:
comprehensive way. We identify the purpose of the problem and the prediction target variable.
derive some essential data parameters that have a significant correlation with the
prediction target.
3) DATA COLLECTION is gathering the data from relevant sources regarding the
converting it in the required form. It helps in detecting outliers and missing values.
a. Reading the data: We read the raw data available into analysis system/ software.
dependent or independent
Continous or discrete
c. Univariate analysis: here we explore one variable at a time, summarize it, make out the
Summary.
d. Bivariate analysis: here we study the empirical relationship between two variables.
that variable.
Create dataset for predictive model: we divide the dataset into two groups :
TRAINING DATA
TESTING DATA
TRAINING DATA: The observations in the training set form the experience that the
The part of data we use to train our model. This is the data which your model actually
TESTING DATA: The test set is a set of observations used to evaluate the performance
of the model using some performance metric. It is important that no observations from the
training set are included in the test set. If the test set does contain examples from the
training set, it will be difficult to assess whether the algorithm has learned to generalize
Once our model is completely trained, testing data provides the unbiased evaluation. When we
feed in the inputs of Testing data, our model will predict some values (without seeing actual
output). After prediction, we evaluate our model by comparing it with actual output present in the
testing data. This is how we evaluate and see how much our model has learned from the
experiences feed in as training data, set at the time of training.
CHAPTER 3
NEURAL NETWORKS
A neural network is, simply put, a series of algorithms that is extremely good at recognizing underlying
relationships (correlations) in a set of data through a process that mimics the way the human brain
operates.
As humans, we have the exceptional ability to notice patterns in our everyday lives. Think of every time
you solved a puzzle, or when you instantly recognized a song within a few seconds of it playing, or when
you look anywhere and immediately recognize the thing that you are looking at. Or even when you speak.
How were you able to achieve these extraordinary things without even having to think about it? This is
thanks to our powerful brain, which gives us the ability to recognize patterns and notice correlations and
has been the entire inspiration for the research behind Deep learning, with the hopes that we can create
even more powerful machines by trying to replicate and even improve what humans are already able to
do.
Neural networks have endless applications in today’s world. From solving many business problems such
as sales forecasting, customer research, data validation, and risk management, to image and voice
recognition in the world of medicine, to self-driving cars, the applications are truly endless.
WORKING
An ANN is a model that solves a super complex math problem using a super complex math function. We
give it a problem with a bunch of data describing it (the input layer), and it is able to find out the optimal
solutions (the output layer, it is what you want to predict) by computing a complex function.
between neurons, and are what allows the model to become more accurate over time (by updating
The input layer: What the machine always knows. Ex: The banking behavior of a customer.
The output layer: What the machine will predict Ex: Whether or not the customer will quit within the
next 6 months.
Gradient descent: The algorithm that allows us to get more and more accurate data as the model
Weights: These are the things that get updated by the model to become more accurate after every
iteration. They are represented by the connections formed between each neuron. Each connection has a
different weight.
we get a predicted value y. Forward propagation is the process by which we multiply the input node by a
Repeats these steps until the error is minimized sufficiently, by finding the optimal weights.
Image classification is the task of taking an input image and outputting a class or a probability of classes
that best describes the image. In CNN, we take an image as an input, assign importance to its various
aspects/features in the image and be able to differentiate one from another. The pre-processing required in
Unlike regular Neural Networks, in the layers of CNN, the neurons are arranged in 3 dimensions: width,
height, depth. The neurons in a layer will only be connected to a small region of the layer (window size)
before it, instead of all of the neurons in a fully-connected manner. Moreover, the final output layer
would have dimensions (number of classes), because by the end of the CNN architecture we will reduce
1.Convolution Layer : The main objective of convolution is to extract features such as edges, colours,
corners from the input. As we go deeper inside the network, the network starts identifying more complex
features such as shapes,digits, face parts as well.
In convolution layer we take a small window size [typically of length 5*5] that extends to the depth of
the input matrix. The layer consist of learnable filters of window size. During every iteration we slid the
window by stride size , and compute the dot product of filter entries and input values at a given position.
As we continue this process well create a 2-Dimensional activation matrix that gives the response of that
matrix at every spatial position. That is, the network will learn filters that activate when they see some
type of visual feature such as an edge of some orientation or a blotch of some color.
At the end of the convolution process, we have a featured matrix which has lesser
parameters(dimensions) than the actual image as well as more clear features than the actual one. So, now
we will work with our featured matrix from now on.
2. Pooling Layer : We use pooling layer to decrease the size of activation matrix and ultimately reduce
the learnable parameters. This layer is solely to decrease the computational power required to process the
data. It is done by decreasing the dimensions of the featured matrix even more. In this layer, we try to
extract the dominant features from a restricted amount of neighborhood
There are two type of pooling :
a) Max Pooling : In max pooling we take a window size [for example window of size 2*2], and only
take the maximum of 4 values. Well lid this window and continue this process, so well finally get a
activation matrix half of its original Size.
3. Fully Connected Layer : In convolution layer neurons are connected only to a local region, while in a
fully connected region, well connect the all the inputs to neurons.
4. Final Output Layer : After getting values from fully connected layer, well connect them to final layer
of neurons[having count equal to total number of classes], that will predict the probability of each image
to be in different classes.
1. Provide the input image into convolution layer.
PROPOSED PROJECT
Sign language recognition (SLR) system takes an input expression from the hearing
impaired person, gives output to the normal person in the form of text or voice.
Our project goal is to take the simple step in connecting the social and communication
bridge between regular people and the disabled people with the help of Sign Language.
Data acquisition
Data preprocessing
Feature extraction
Gesture classification
DATA ACQUISTION: The different approaches to acquire data about the hand gesture can be
1.Use of sensory devices:: It uses electromechanical devices to provide exact hand configuration,
and position. Different glove based approaches can be used to extract information. But it is
2. Vision based approach: In vision based methods computer camera is the input device for
observing the information of hands or fingers. The Vision Based methods require only a camera,
thus realizing a natural interaction between humans and computers without the use of any extra
devices. These systems tend to complement biological vision by describing artificial vision systems
that are implemented in software and/or hardware. The main challenge of vision-based hand
detection is to cope with the large variability of human hand’s appearance due to a huge number of
hand movements, to different skin-colour possibilities as well as to the variations in view points,
DATA PREPROCESSING
As images are not captured in a controlled environment and they have different resolutions and
sizes, so preprocessing on image is required. It is a method to digitalize images and extract some
This phase contains three steps which are image segmentation (skin masking),skin detection, edge
detection. From the raw image skin mask is generated by converting the image to HSV color space.
Using the skin mask, skin can be segmented. Finally, the Canny Edge technique is used to detect
and recognize the presence of sharp discontinuities in an image, thus detecting the edges of the
image
FEATURE EXTRACTION
Feature extraction is one of the most important step in sign language recognition, because it gives
feature vector as output which is used by classifier as an input. Feature extraction techniques used
to find objects and shapes must be reliable and robust without depending on orientation,
The features can be obtained using different techniques like texture features , orientation histogram
etc. In some cases, the Principal Component Analysis (PCA) is used to reduce dimensionality to get
CLASSIFICATION
Once the dataset is generated, the next step is classification . Before going to classification, it is
Once the data is ready, the next step is to feed the training data to machine learning model. During
testing phase, trained identified class corresponding to signs and give output in text or audio format.
Some of the common used classifiers are Artificial Neural Network (ANN), K-Nearest Neighbour
An artificial neural network involves artificial neurons that show complex behavior determined by
connections between elements and its parameters. ANN is used to infer a function from given inputs
Organizing Map. It was used to classify sign languages gestures of the alphabets.
Two most used networks of supervise leaning are Feed Forward Back Propagation Network (BPN),
and Radial Basis Function Neural Network (RBFNN). RBFNN was used in or static gesture
K-nearest neighbor (KNN) classifier classifies objects based on feature space using supervised
An object is classified to the class which is most common among its K nearest neighbors. K nearest
neighbors is a simple algorithm that stores all available cases and classifies new cases based on a
similarity measure.
input data and predict that which two possible classes generate output.
Support Vector Machines are based on decision hyperplanes that define decision boundaries. A
decision plane separates two set of objects having different class membership. Support Vector
TESTING
To verify the accuracy of the letter/number gestures recognition, the number of the correctly
recognized letters/numbers that appeared on the screen was added and divided by the product of the
If the system generates the equivalent letter/number beyond 15 seconds, it is not included in the
METHODOLOGY
The objective of this project is to identify the symbolic expressions through images so that the
communication gap between a normal and hearing impaired person can be easily reduced.
b) To segment the skin part from the image, as the remaining part can be regarded as noise w.r.t
c) To extract relevant features from the skin segmented images which can prove significant for
d) To use the extracted features as input into various supervised learning models for training
PREREQUISITES
First we define the nodes of the computation graph, then inside a session, the
Keras: Keras is a high-level neural networks library written in python that works as a
wrapper to TensorFlow. It is used in cases where we want to quickly build and test the
neural network with minimal lines of code. It contains implementations of commonly used
neural network elements like layers, objective, activation functions, optimizers, and tools
programming functions used for real-time computer-vision. It is mainly used for image
processing, video capture and analysis for features like face and object recognition. It is
written in C++ which is its primary interface, however bindings are available for Python,
Java, MATLAB/OCTAVE
Jyupter notebook: The Jupyter Notebook is an open-source web application that allows you to
create and share documents that contain live code, equations, visualizations and narrative text.
Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and much more.
The system will be implemented through a desktop with a 1080P Full-HD web camera. The camera will
capture the images of the hands that will be fed in the system. Note that the signer will adjust to the size of
the frame so that the system will be able to capture the orientation of the signer’s hand. When the camera
has already captured the gesture from the user, the system classifies the test sample and compares it in the
stored gestures in a dictionary, and the corresponding output is displayed on the screen for the user.
A. Data collection
Gathering of datasets for static SLR was done through the use of continuous capturing of images using
Python. Images were automatically cropped and converted to a 50 ×50 pixels black and white sample. Each
class contained 1,200 images that were then flipped horizontally, considering the left-handed signers.
For improved skin color recognition, the signer was advised to have a clear background for the hands, which
will make it easier for the system to detect the skin colors. Skin detection took place by using cv2.cvtColor.
Images were converted from RGB to HSV. Through the cv2.inRange function, the HSV frame was supplied,
with the lower and upper ranges as the arguments. The mask was the output from the cv2.inRange function.
White pixels in the mask produced were considered to be the region of the frame weighed as the skin.
Although black pixels are disregarded, cv2.erode and cv2.dilate functions remove small regions that may
represent a small false-positive skin region. Then, two iterations of erosions and dilations were done using
this kernel. Lastly, the resulting masks were smoothened using a Gaussian blur.
C. Network Layers
The goal of this study is to design a network that can effectively classify an image of a static sign language
gesture to its equivalent text by a CNN. To attain specific results, we used Keras and CNN architecture
containing a set of different layers for processing of training of data. The convolutional layer is composed of
16 filters, each of which has a 2 × 2 kernel. Then, a 2 × 2 pooling reduces spatial dimensions to 32 × 32.
From 16 filters of the convolutional layers, filters are increased to 32, whereas that of the Max Pooling
filters is increased to 5 × 5. Then, the number of filters in the CNN layers is increased to 64, but max
pooling is still at 5 × 5. Dropout(0.2) functions with randomly disconnecting each node from the current
layer into the next layer. The model is now being flattened or is now converted into a vector; then, the dense
layer is added. The fully connected layer is being specified by the dense layer along with rectified linear
activation. We finished the model with the SoftMax classifier that would give the predicted probabilities for
each class label.
The training for character and SSL recognition was done separately; each dataset was divided into two:
training and testing. This was done to see the performance of the algorithm used. The network was
implemented and trained through Keras and TensorFlow as its backend using a Graphics Processing Unit
GT-1030 GPU.
CHAPTER 6
WORK DONE TILL NOW
PUBLICATIONS
PLAGIARISM REPORT