You are on page 1of 48

COMMUNICATION SYSTEM FOR SPECIALLY ABLED PEOPLE

A PROJECT PHASE II REPORT

SUBMITTED IN THE PARTIAL FULFILLMENT OF THE REQUIREMENT


FOR THE AWARD OF THE DEGREE OF

BACHELOR OF TECHNOLOGY

IN

INFORMATION TECHNOLOGY ENGINEERING

BY

MR. SUBODH RAVINDRA JANGALE (10303320181124610009)


MR. SHREEYANSH DEEPAK BANDAL (10303320181124613001)

UNDER THE GUIDANCE OF

PROF. A.R.BABHULGOANKAR

DEPARTMENT OF INFORMATION TECHNOLOGY


DR. BABASAHEB AMBEDKAR TECHNOLOGICAL UNIVERSITY
LONERE-402103, TAL.-MANGAON, Dist.-RAIGAD (M.S.) INDIA.
2021-2022
DECLARATION

I undersigned hereby declare that the industrial training report on COMMUNICA-


TION SYSTEM FOR SPECIALLY ABLED PEOPLE , submitted for partial fulfillment
of the requirements for the award of degree of B. Tech. of the Dr. Babasaheb Ambed-
kar Technological University, Lonere is a bonafide work done by me under supervision
of Prof.A.R.Babhulgaonkar. This submission represents our ideas in our own words and
where ideas or words of others have been included, we have adequately and accurately
cited and referenced the original sources. I also declare that we have adhered to ethics
of academic honesty and integrity and have not misrepresented or fabricated any data or
idea or fact or source in our submission. I understand that any violation of the above will
be a cause for disciplinary action by the University and can also evoke penal action from
the sources which have thus not been properly cited or from whom proper permission has
not been obtained.

Mr. Subodh Ravindra Jangale (10303320181124610009)


Mr. Shreeyansh Deepak Bandal (10303320181124613001)

Place: Dr Babasaheb Ambedkar Technological University Lonere


Date: 12 July 2022

i
DR BABASAHEB AMBEDKAR TECHNOLOGICAL UNIVERSITY

CERTIFICATE

This is to certify that the Project Phase II report on ”Communication System for Differ-
ently Abled People” is submitted Mr. Subodh Ravindra Jangale (10303320181124610009)
Mr. Shreeyansh Deepak Bandal (10303320181124613001) for the partial fulfilment of the
requirements of the degree of Bachelor of Technology in Information Technology of the
Dr.Babasaheb Ambedkar Technological University,Lonere is a bonafide work carried out
during the academic year 2021-2022.

Prof. A.R.Babhulgoankar Dr. Sanjay R. Sutar


(Guide) (Head of Department)
Department of Information Technology Department of Information Technology

EXAMINER:
1.

2.
Date:12 July 2022
Place:Lonere

ii
ACKNOWLEDGEMENT

This work is just not an individual contribution till its completion. We take this
opportunity to thank all for bringing it close to the conclusion. A special thanks goes to
my Guide Prof. A. R. Babhulgoankar, for leading me to many new insights, encourag-
ing and teaching me how to get down to the root of the problem.

We wish to express my sincere thanks to the Head of the department of Informa-


tion Technology, Dr. S. R. Sutar. We are also grateful to the department faculty and staff
members for their support.

We would also like to thank all my friends and well wishers for support during
the demanding period of this work. We would also like to thank my wonderful colleagues
and friends for listening my ideas, asking questions and providing feedback and sugges-
tions for improving my ideas.

Mr. Subodh Ravindra Jangale


Mr. Shreeyansh Deepak Bandal

iii
ABSTRACT

According to the World Health Organization (WHO), 466 million people across
the world have disabling hearing loss (over 534 million are children. There are only
about 250 certified sign language interpreters in India for a deaf population of around 7
million. With these significant statistics, the need for developing a tool for smooth flow of
communication between abled and people with speech/hearing impairment is very high.
Our application promises to secure a two way conversation, as it deploys machine learning
and deep learning models to convert sign language to speech/text. The opposite receiver
can either speak or text his response, which will then be visible to the disabled person in
the form of text. The client can make use of the tutorials and learn the basic functioning
of the application and ASL. This system eliminates the need of an interpreter and the
traditional methods of pen and paper can also be discarded. This application ensures the
automation of communication and thereby provides a solution to the hurdles faced by
hearing/speech impaired people.

iv
Contents

1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Literature Survey 3

3 Problem Definition 5

4 Proposed System 6

5 Existing Systems 8

6 System Specification 10
6.1 System Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6.1.1 Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . 10
6.1.2 Software Requirements . . . . . . . . . . . . . . . . . . . . . . . 10
6.2 System Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

7 System Design 12
7.1 Modules in the system . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
7.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
7.3 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.4 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

8 Implementation 16
8.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
8.2 Proposed Hand Gesture Recognition System . . . . . . . . . . . . . . . . 17
8.2.1 Camera module . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
8.2.2 Detection module . . . . . . . . . . . . . . . . . . . . . . . . . . 18
8.2.3 Interface module . . . . . . . . . . . . . . . . . . . . . . . . . . 19

v
8.3 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
8.3.1 MediaPipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
8.3.2 Noise removal and Image smoothening . . . . . . . . . . . . . . 21
8.3.3 Long Short Term Memory(LSTM) . . . . . . . . . . . . . . . . . 22
8.3.4 Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
8.3.5 Contour Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 24
8.3.6 Convex hull and Convexity defects . . . . . . . . . . . . . . . . . 25
8.3.7 Haar Cascade Classifier . . . . . . . . . . . . . . . . . . . . . . 26
8.3.8 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.3.9 Firebase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

9 Conclusion and Future Scope 30


9.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
9.2 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

References 32

vi
List of Figures

4.1 ASL gestures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

7.1 System Architecture for Sign Language Recognition Using Hand Ges-
tures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
7.2 Use Case Diagram for Sign Language Recognition Using Hand Gestures. 14
7.3 Activity Diagram for Sign Language Recognition Using Hand Gestures. . 15

8.1 Back End Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 18


8.2 Proposed method for our gesture recognition system. . . . . . . . . . . . 20
8.3 Hand landmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
8.4 MediaPipe hands solution graph . . . . . . . . . . . . . . . . . . . . . . 21
8.5 Process of cropping and converting RGB input image to grey scale . . . . 22
8.6 LSTM working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
8.7 Front end window that shows the thresholded version of the input gesture 25
8.8 Contour extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8.9 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8.10 The static gestures used in the gesture recognition system . . . . . . . . . 28

vii
Chapter 1

Introduction

1.1 Overview
Some of the major problems faced by a person who are unable to speak is they cannot
express their emotion as freely in this world. Utilize that voice recognition and voice
search systems in smartphone(s).Audio results cannot be retrieved. They are not able
to utilize (Artificial Intelligence/personal Butler) like google assistance, or Apple’s SIRI
etc.because all those apps are based on voice controlling.
There is a need for such platforms for such kind of people. American Sign Lan-
guage (ASL) is a complete, complex language that employs signs made by moving the
hands combined with facial expressions and postures of the body. It is the go-to language
of many North Americans who are not able to talk and is one of various communication
alternatives used by people who are deaf or hard-of-hearing.
While sign language is very essential for deaf-mute people, to communicate both
with normal people and with themselves, is still getting less attention from the normal
people.The importance of sign language has been tending to ignored, unless there are
areas of concern with individuals who are deaf-mute. One of the solutions to talk with the
deaf-mute people is by using the mechanisms of sign language.
Hand gesture is one of the methods used in sign language for non-verbal com-
munication. It is most commonly used by deaf & dumb people who have hearing or
talking disorders to communicate among themselves or with normal people.Various sign
language systems have been developed by many manufacturers around the world but they
are neither flexible nor cost-effective for the end users.

1.2 Scope
One of the solutions to communicate with the deaf-mute people is by using the ser-

1
vices of sign language interpreter. But the usage of sign language interpreters could be
expensive.Cost-effective solution is required so that the deaf-mute and normal people can
communicate normally and easily.
Our strategy involves implementing such an application which detects pre-defined
American sign language (ASL) through hand gestures. For the detection of movement of
gesture, we would use basic level of hardware component like camera and interfacing is
required. Our application would be a comprehensive User-friendly Based system built on
PyQt5 module.
Instead of using technology like gloves or kinect, we are trying to solve this prob-
lem using state of the art computer vision and machine learning algorithms.
This application will comprise of two core module one is that simply detects the
gesture and displays appropriate alphabet. The second is after a certain amount of interval
period the scanned frame would be stored into buffer so that a string of character could
be generated forming a meaningful word.
Additionally, an-addon facility for the user would be available where a user can
build their own custom-based gesture for a special character like period (.) or any delim-
iter so that a user could form a whole bunch of sentences enhancing this into paragraph
and likewise. Whatever the predicted outcome was, it would be stored into a .txt file.

1.3 Objectives
1. Eliminate the need of an interpreter.

2. Ease the communication flow for hearing/speech impaired people through our model
predictions and text to speech system.

3. Ability to create new signs for any text or sentence in the browser (client side).

2
Chapter 2

Literature Survey

There are many classification methods are available as well there are many paper
published on Sentimental analysis and aspect based analysis. Based on those a literature
survey got carried away. We have search papers in IEEE transactions and gone through it.

We have studied the following papers:

[1] ”Sign Language Recognition System Using Deep Neural Network,”,S. Suresh, H. T.
P. Mithun and M. H. Supriya, 2019 5th International Conference on Advanced Computing
& Communication Systems (ICACCS), 2019, pp. 614-618, doi: 10.1109/ICACCS.2019.
8728411.

In the current fast-moving world, human-computer- interactions (HCI) is one of the


main contributors towards the progress of the country. Since the conventional input de-
vices limit the naturalness and speed of human-computer- interactions, Sign Language
recognition system has gained a lot of importance. Different sign languages can be used
to express intentions and intonations or for controlling devices such as home robots. The
main focus of this work is to create a vision based system, a Convolutional Neural Net-
work (CNN) model, to identify six different sign languages from the images captured.

[2] “Sign Language Recognition Application Systems for Deaf-Mute People A Review
Based on Input-Process-Output”,Suharjito, Suharjito & Anderson, Ricky & Wiryana,
Fanny & Ariesta, Meita & Kusuma Negara, I Gede Putra. (2017). A Review Based on
Input-Process-Output. Procedia Computer Science. 116. 441-448. 10.1016/j.procs.2017.
10.028.
Sign Language Recognition is a breakthrough for helping deaf-mute people and has
been researched for many years. Unfortunately, every research has its own limitations

3
and are still unable to be used commercially. Some of the researches have known to be
successful for recognizing sign language, but require an expensive cost to be commercial-
ized. Nowadays, researchers have gotten more attention for developing Sign Language
Recognition that can be used commercially. Researchers do their researches in various
ways. It starts from the data acquisition methods. The data acquisition method varies
because of the cost needed for a good device, but cheap method is needed for the Sign
Language Recognition System to be commercialized. The methods used in developing
Sign Language Recognition are also varied between researchers. Each method has its
own strength compare to other methods and researchers are still using different methods
in developing their own Sign Language Recognition. Each method also has its own limi-
tations compared to other methods.

[3] Deep convolutional neural networks for sign language recognition”G. A. Rao, K.
Syamala, P.V.V. Kishore and A. S. C. S. Sastry Conference on Signal Processing And
Communication Engineering Systems(SPACES), 2018, pp. 194-197, doi:10.1109/SPACE
S.2018.8316344.Extraction of complex head and hand movements along with their con-
stantly changing shapes for recognition of sign language is considered a difficult problem
in computer vision. This paper proposes the recognition of Indian sign language gestures
using a powerful artificial intelligence tool, convolutional neural networks (CNN). Selfie
mode continuous sign language video is the capture method used in this work, where
a hearing-impaired person can operate the SLR mobile application independently. Due
to non-availability of datasets on mobile selfie sign language, we initiated to create the
dataset with five different subjects performing 200 signs in 5 different viewing angles un-
der various background environments. Each sign occupied for 60 frames or images in a
video. CNN training is performed with 3 different sample sizes, each consisting of multi-
ple sets of subjects and viewing angles. The remaining 2 samples are used for testing the
trained CNN. Different CNN architectures were designed and tested with our selfie sign
language data to obtain better accuracy in recognition.

4
Chapter 3

Problem Definition

”To design a system for sign language recognition using hand gestures.”

The traditional methods of communicating with the deaf and mute are really not con-
venient in many aspects. The alternatives that are available to break this barrier have
definite flaws. An interpreter is not always available and this method is not cost efficient
either. The pen and paper method is highly unprofessional and also time consuming. Tex-
ting and messaging are fine to a certain extent but still does not tackle the bigger problem
at hand. This has created a grave need to develop a solution to destroy the barricade of
communication effectively.
Given a hand gesture, implementing such an application which detects pre-defined
American sign language (ASL) in a real time through hand gestures and providing facility
for the user to be able to store the result of the character detected in a txt file, also allowing
such users to build their customized gesture so that the problems faced by persons who
aren’t able to talk vocally can be accommodated with technological assistance and the
barrier of expressing can be overshadowed.

5
Chapter 4

Proposed System

The proposed study aims to develop a system that will recognize static sign gestures
and convert them into corresponding words. A vision-based approach using a web camera
is introduced to obtain the data from the signer and can be used offline. The purpose of
creating the system is that it will serve as the learning tool for those who want to know
more about the basics of sign language such as alphabets, numbers, and common static
signs. The proponents provided a white background and a specific location for image pro-
cessing of the hand, thus, improving the accuracy of the system and used Convolutional
Neural Network (CNN) and Long Short Term memory (LSTM) as the recognizer of the
system. The scope of the study includes basic static signs, numbers, ASL alphabets (A–Z)
and gestures. One of the main features of this study is the ability of the system to create
simple words by fingerspelling and understanding gestures without the use of sensors and
other external technologies.

For the purpose of the study,we used some of the gestures in ASL Fig. 4.1 represents
the ASL gestures that will be fed onto the system. In the figure we can see gestures such
as hello, bye, how are you. ASL also is strict when it comes to the angle of the hands
while one is hand gesturing, the face movement maybe different for another gesture.This
would affect the accuracy of the system.

6
Figure 4.1: ASL gestures

7
Chapter 5

Existing Systems

1. The first sign-language glove to gain any notoriety came out in 2001. A high-school
student from Colorado, Ryan Patterson, fitted a leather golf glove with 10 sensors
that monitored finger position, then relayed finger spellings to a computer which
rendered them as text on a screen. In 2002, the public-affairs office of the National
Institute on Deafness and Other Communicative Disorders effused said about Pat-
terson, The glove doesn’t translate anything beyond individual letters, certainly not
the full range of signs used in American Sign Language, and works only with the
American Manual Alphabet.

2. MotionSavvy is building a tablet which detects when a person is using ASL and
converts it to text or voice. The software also has voice recognition through the
tablet’s mic, which allows a hearing person to respond with voice to the person
signing. It then converts their voice into text, which the hearing-impaired receiver
can understand.

3. The application, Lingo jam only translates alphabetically. The manual sign lan-
guage or fingerspelling is followed and not the universal sign language . Each letter
is translated as it is, and displayed over text only.

4. A Netherlands-based start-up has developed an artificial intelligence (AI) powered


smartphone app for deaf and mute people, which it says offers a low-cost and su-
perior approach to translating sign language into text and speech in real time. The
easy-to-use innovative digital interpreter dubbed as ”Google translator for the deaf
and mute” works by placing a smartphone in front of the user while the app trans-
lates gestures or sign language into text and speech.

5. CAS-PEAL face database [14] was developed by Joint Research and Development
Laboratory (JDL) for Advanced Computer and Communication Technologies of

8
Chinese Academy of Sciences (CAS), under the support of the Chinese National
Hi-Tech Program and the ISVISION Tech. Co. Ltd. The construction of the CAS-
PEAL face database was aimed for providing the researchers a large-scale Chi-
nese face database for studying, developing, and evaluating their algorithms. The
CAS-PEAL large-scale face images with different sources of variations, like Pose,
Expression, Accessories, and Lighting (PEAL) were used to advance the state-of-
the-art face recognition technologies. The database contains 99, 594 images from
1040 individuals (595 males and 445 females). For each subject of the database,
nine cameras with equal spaced in a horizontal semicircular layer were setup to
capture images across different poses in one shot. The person who was used to per-
form sign gestures also asked to look up and down to capture 18 images in another
two shots. The developers also considered five kinds of expressions, six kinds ac-
cessories (three goggles, and three caps), and fifteen lighting directions, also with
varying backgrounds, distance from cameras, and aging.

9
Chapter 6

System Specification

6.1 System Requirement

6.1.1 Hardware Requirements

• Intel Core i3 3rd gen processor or later

• 512 MB disk space.

• 512 MB RAM.

• Any external or inbuild camera with minimum pixel resolution 200 x 200 (300ppi
or 150lpi) 4-megapixel cameras and up.

6.1.2 Software Requirements


• Microsoft Windows XP or later / Ubuntu 12.0 LTS or later /MAC OS 10.1 or later.

• Python Interpreter (3.6).

• TensorFlow framework, Keras API.

• PyQT5, Tkinter module.

• Python OpenCV2, scipy, qimage2ndarray, winGuiAuto, pypiwin32, sys, keyboard,


pyttsx3, pillow libraries.

6.2 System Features

• User-friendly based GUI built using industrial standard PyQT5.

• Real time American standard character detection based on gesture made by user.

10
• Customized gesture generation.

• Forming a stream of sentences based on the gesture made after a certain interval of
time.

11
Chapter 7

System Design

7.1 Modules in the system

• Data Pre-Processing – In this module, based on the object detected in front of


the camera its binary images is being populated. Meaning the object will be filled
with solid white and background will be filled with solid black. Based on the pixel’s
regions, their numerical value in range of either 0 or 1 is being given to next process
for modules.

• Scan Single Gesture – A gesture scanner will be available in front of the end user
where the user will have to do a hand gesture. Based on Pre-Processed module
output, a user shall be able to see associated label assigned for each hand gestures,
based on the predefined American Sign Language (ASL) standard inside the output
window screen.

• Create gesture – A user will give a desired hand gesture as an input to the system
with the text box available at the bottom of the screen where the user needs to type
whatever he/she desires to associate that gesture with. This customize gesture will
then be stored for future purposes and will be detected in the upcoming time.

• Formation of a sentence – A user will be able to select a delimiter and until that
delimiter is encountered every scanned gesture character will be appended with the
previous results forming a stream of meaning-full words and sentences.

• Exporting –A user would be able to export the results of the scanned character
into an ASCII standard textual file format.

12
7.2 System Architecture

Figure 7.1: System Architecture for Sign Language Recognition Using Hand Gestures.

13
7.3 Use Case Diagram

Figure 7.2: Use Case Diagram for Sign Language Recognition Using Hand Gestures.

14
7.4 Activity Diagram

Figure 7.3: Activity Diagram for Sign Language Recognition Using Hand Gestures.

15
Chapter 8

Implementation

The basic goal of Human Computer Interaction is to improve the interaction between users
and computers by making the computer more receptive to user needs. Human Computer
Interaction with a personal computer today is not just limited to keyboard and mouse
interaction. Interaction between humans comes from different sensory modes like gesture,
speech, facial and body expressions. Being able to interact with the system naturally is
becoming ever more important in many fields of Human Computer Interaction. Both non-
vision and vision based approaches have been used to achieve hand gesture recognition.
An example of a non-vision based approach is the detection of finger movement with a
pair of wired gloves. In general vision based approaches are more natural as they require
no hand devices. Theoretically the literature classifies hand gestures into two types static
and dynamic gestures. Static hand gestures can be defined as the gestures where the
position and orientation of hand in space does not change for an amount of time. If there
are any changes within the given time, the gestures are called dynamic gestures. Dynamic
hand gestures include gestures like waving of hand while static hand gestures include
joining the thumb and the forefinger to form the “Ok” symbol.

8.1 Related work


The literature survey conducted provides an insight into the different methods that can
be adopted and implemented to achieve hand gesture recognition. It also helps in under-
standing the advantages and disadvantages associated with the various techniques. The
literature survey is divided into two main phases i.e. the camera module and the detection
module. The camera module identifies the different cameras and markers that can be used.
The detection module deals with the pre-processing of image and feature extraction.
The commonly used methods of capturing input from the user that has been observed
are data gloves, hand belts and cameras. The approach of gesture recognition uses in-

16
put extraction through data gloves. A hand belt with gyroscope, accelerometer and a
Bluetooth was deployed to read hand movements are used. The author used a creative
Senz3D Camera to capture both colour and depth information and used a Bumblebee2
stereo camera. A monocular camera was used. Cost efficient models like [8], [9] and
have implemented their systems using simple web cameras. The methods make use of a
kinect depth RGB camera which was used to capture colour stream. As depth cameras
provide additional depth information for each pixel (depth images) at frame rate along
with the traditional images. Most technologies allow a hand region to be extracted ro-
bustly by utilizing the colour space. These do not fully solve the background problem.
This background problem was resolved by using a black and white pattern of augmented
reality markers (monochrome glove). While inbuilt webcams do not give depth informa-
tion, they require less computing costs. Hence in our model, we used a webcam available
in the laptop without the use of any additional cameras or hand markers such as gloves.
A large number of methods have been utilized for pre-processing the image which in-
cludes algorithms and techniques for noise removal, edge detection, smoothening fol-
lowed by different segmentation techniques for boundary extraction i.e. separating the
foreground from the background. The authors used a morphology algorithm that per-
forms image erosion and image dilation to eliminate noise. Gaussian filter was used to
smoothen the contours after binarization. To perform segmentation, in [6] a depth map
was calculated by matching the left and right images with the SAD (Sum of Absolute Dif-
ferences) algorithm. In [6], the Theo Pavildis Algorithm which visits only the boundary
pixels was used to find the contours. This method brings down the computational costs.
In the biggest contour was chosen as the contour of the hand palm after which the contour
was simplified using polygonal approximation. Classification is a process in which indi-
vidual items are grouped based on the similarity between the items. The approach uses
Euclidean distance based classifier to recognise 25 hand postures. Support Vector Ma-
chine (SVM) classifier was used in. We deviate from other traditional methods without
using any hand markers such as gloves for gesture recognition. In our model, we used a
webcam available in the laptop without the use of any additional cameras by making the
system cost effective. Thus our system finds applications in day to day system.

8.2 Proposed Hand Gesture Recognition System


The overall system consists of two parts, back end and front-end. The back end system
consists of three modules: Camera module, Detection module and Interface module as
shown in Fig. They are summarized as follows:

17
Figure 8.1: Back End Architecture.

8.2.1 Camera module


This module is responsible for connecting and capturing input through the different types
of image detectors and sends this image to the detection module for processing in the
form of frames. The commonly used methods of capturing input are data gloves, hand
belts and cameras. In our system, we use the webcam inbuilt which is cost efficient to
recognize both static and dynamic gestures. The system has suitable provision to allow
input from a USB based webcam as well but this would require some expenditure from
the user. The image frames obtained are in the form of a video.

8.2.2 Detection module


This module is responsible for the image processing. The output from camera module
is subjected to different image processing techniques such as colour conversion, noise
removal, thresholding following which the image undergoes contour extraction. If the
image contains defects, then convexity defects are found according to which the gesture
is detected. If there are no defects, then the image is classified using Haar cascade to detect
the gesture. In the case of dynamic gestures, the detection module does the following; If
Microsoft PowerPoint has been launched with a slideshow being enabled and the webcam
detects palm in movement, for 5 continuous frames then the dynamic gesture swipe is
detected.

18
8.2.3 Interface module
This module is responsible for mapping the detected hand gestures to their associated ac-
tions. These actions are then passed to the appropriate application. The front end consists
of three windows. The first window consists of the video input that is captured from the
camera with the corresponding name of the gesture detected. The second window dis-
plays the contours found within the input images. The third window displays the smooth
thresholded version of the image. The advantage of adding the threshold and contour
window as a part of the Graphical User Interface is to make the user aware of the back-
ground inconsistencies that would affect the input to the system and thus they can adjust
their laptop or desktop web camera in order to avoid them. This would result in better
performance.

19
8.3 Proposed method
We propose a marker less gesture recognition system,that follows a very efficient method-
ology as shown in fig.

Figure 8.2: Proposed method for our gesture recognition system.

8.3.1 MediaPipe
Hands signs are tracked with the help of Mediapipe. MediaPipe powers revolutionary
products and services we use daily. Unlike power-hungry machine learning Frameworks,
MediaPipe requires minimal resources. It is so tiny and efficient that even embedded IoT
devices can run it.
MediaPipe is a Framework for building machine learning pipelines for processing
time-series data like video, audio, etc. This cross-platform Framework works in Desk-
top/Server, Android, iOS, and embedded devices like Raspberry Pi and Jetson Nano.

Graphs

The MediaPipe perception pipeline is called a Graph. Let us take the example of the
first solution, Hands. We feed a stream of images as input which comes out with hand
landmarks rendered on the images.
The flow chart below represents MediaPipe hand solution graph.

20
Figure 8.3: Hand landmarks

Figure 8.4: MediaPipe hands solution graph

Calculators

The packets of data ( Video frame or audio segment ) enter and leave through the ports in
a calculator. When initializing a calculator, it declares the packet payload type that will
traverse the port. Every time a graph runs, the Framework implements Open, Process, and
Close methods in the calculators. Open initiates the calculator; the process repeatedly runs
when a packet enters. The process is closed after an entire graph run.
The calculator, ImageTransform, takes an image at the input port and returns a trans-
formed image in the output port. On the other hand, the second calculator, ImageToTen-
sor, takes an image as input and outputs a tensor.

8.3.2 Noise removal and Image smoothening


The input image, which is in RGB color space, is cropped to a size of 300 * 300 pixels. It
is then converted into a gray scale image. This process is shown in Fig.

21
Figure 8.5: Process of cropping and converting RGB input image to grey scale

Noise in images can be defined as a random variation of brightness or colour information


that is usually produced during the image acquisition process from the webcam. This
noise is an undesirable aspect of the image and needs to be removed. In order to do this,
Gaussian filter is applied. Gaussian filtering is performed by the convolution of Gaussian
kernel with each point in the input array. These are then added to produce the output
array. A 2D Gaussian kernel can be represented mathematically as:

−(x−µx )2 −(y−µy )2
2 + 2
G0 (x, y) = Ae 2σx 2σy

8.3.3 Long Short Term Memory(LSTM)


LSTMs are a type of recurrent neural network, but instead of simply feeding its outcome
into the next part of the network, an LSTM does a bunch of math operations so it can
have a better memory. For example, if you had a neural net that predicted an output (y)
based on (x), normally (y) would be outputted and never used again by the network. But
RNNs (recurrent neural networks)continue using past information, to help increase the
performance of its model.

Figure 8.6: LSTM working

An LSTM has four “gates”: forget, remember, learn and use(or output)
Step 1: When the 3 inputs enter the LSTM they go into either the forget gate, or learn
gate. The long-term info goes into the forget gate, where, shocker, some of it is forgotten
(the irrelevant parts). The short-term info and “E” go into the learn gate. This gate decides

22
what info will be learned. Bet you didn’t see that one coming!!!!
Step 2: information that passes the forget gate (it is not forgotten, forgotten info stays at
the gate) and info that passes learn gate (it is learned) will go to the remember gate (which
makes up the new long term memory) and the use gate (which updates short term memory
+is the outcome of the network).

8.3.4 Thresholding
Thresholding, which is a simple segmentation method, is then carried out. Thresholding
is applied to obtain a binary image from the gray scale image. Thresholding technique
compares each pixel intensity value (I) with respect to the threshold value (T). If I<T, the
particular pixel is replaced with a black pixel and if I>T, it is replaced with a white pixel.
A threshold value (T) of 127 is used in our work which classifies the pixel intensities in
the gray scale image. Maximum value of 255 is the pixel value used if any given pixel in
the image passes the threshold value. The two types of thresholding that are implemented
are Inverted Binary Thresholding and Otsu’s Thresholding. Inverted Binary Thresholding
inverts the colors, to be white image in a black background. This thresholding operation
can be expressed as shown in Eqn.

 0, if src(x, y) ≥ T
Dest(x, y) =
max(255) otherwise

So, if the pixel intensity src(x, y) is greater than the threshold value T, then the new
intensity of the pixel is initialized to 0. Otherwise, the pixels are set to maxVal.Nobuyuki
Otsu has given us the Otsu’s method[20]. Clustering-based image thresholding is achieved
from this method. Otsu binarization automatically calculates a threshold value from im-
age histogram for a bimodal image, which is an image whose histogram has two peaks.
In Otsu’s method we try to find the threshold that minimizes the intra-class variance (the
variance within the class), defined as a weighted sum of variances of the two classes as
seen in Eqn.
σb2 (t) = ω0 (t)σ02 (t) + ω1 (t)σ12 (t)
t−1
X
ω0 (t) = p(i)
i=0

L−1
X
ω1 (t) = p(i)
i=t

Otsu shows that minimizing the intra-class variance and maximizing inter-class vari-

23
ance generates the same results as seen below in Eqn.

σw2 (t) = σ 2 − σw2 (t) = ω0 (µ0 − µT )2 + ω1 (µ1 − µT )2

= ω0 (t)ω1 (t)[µ0 (t) − µ1 (t)]2

This is expressed in terms of w for probabilities and u for means. While the class mean u
0, 1, T (t) can be expressed as shown in Eqn.

t−1
X p(i)
µ0 (t) = i
i=0
ω0

L−1
X p(i)
µ1 (t) = i
i=t
ω1
L−1
X
µT = ip(i)
i=0

The following relations in Eqn. can be easily verified

ω0 µ0 + ω1 µ1 = µT

ω0 + ω1 = 1

The class probabilities and means can be computed iteratively. This can provide an
effective algorithm. Before finding contours, threshold has been applied to the binary
image to achieve higher accuracy. The below image Fig. shows the front end window that
portrays the thresholded version of the user’s gesture input.

8.3.5 Contour Extraction


Contours are a useful tool for object detection and recognition in image processing. In our
work, we have used contours, to detect and recognize the hand from the background. The
curves that link continuous points, which are of the same color, are called contours. Find-
ing the contours is the first step which is like finding white object from black background
in OpenCV. Hence, Inverted Binary Thresholding has been utilized during thresholding.

24
Figure 8.7: Front end window that shows the thresholded version of the input gesture

Thesecond step is to draw the contours which can be used to draw any shape pro-
vided the boundary points are known. Some gestures in our recognition system with their
appropriate contours are shown in the below Fig.

Figure 8.8: Contour extraction

8.3.6 Convex hull and Convexity defects


Mathematically, convex hull of a set X of points in any affine space is defined as the
smallest convex set that contains X. Any deviation of the object from this convex hull can
be considered as convexity defect. The convex hull of a finite point set S can be defined as
the set of all convex combinations of its points. In a convex combination, each point xi in
S is assigned a weight αi and these weights are used to compute an average of the points.
For each choice of weights, the resulting convex combination is a point in the convex hull.
Convex hull can be represented mathematically as shown in Eqn.

25
|s|
X |s| 
X
Convex(S) = αi xi |(∀i : αi 0) ∧ αi = 1
i=1 i=1

8.3.7 Haar Cascade Classifier


For gestures like palm and fist where there are no convexity defects, Haar cascade classi-
fier is used. A collection of positive images, a minimum of 10 original images, taken at
different lighting conditions and angles is used. Each of the original images are cropped
to contain only the object of interest. Collection of negative images, which doesn’t con-
tain the object of interest, a minimum of 1000 images is required. A description file for
negative images is created by using create samples library. Each positive image is super-
imposed on a minimum of 200 images. A vector file is created based on superimposed
images (vector file should contain a minimum of 1500 images). Haar training will utilize
a minimum of 100 images of size 20 * 20 and the training also can consist of 15 or more
stages. The generated XML file is used as cascade classifier to detect objects in OpenCV.

8.3.8 User Interface


User interface is the connecting medium between the user and machine. The user interface
should be simple, minimalist, attractive and easy to use.
Python provides the standard library Tkinter for creating the graphical user interface
for desktop based applications. Tkinter has been used in our project for desktop applica-
tion. This provides all the basic things for the basic user interface for the program.

8.3.9 Firebase
Firebase is a toolset to “build, improve, and grow an app”, and the tools it gives cover a
large portion of the services that developers would normally have to build themselves, but
don’t really want to build, because they’d rather be focusing on the app experience itself.
This includes things like analytics, authentication, databases, configuration, file storage,
push messaging, and the list goes on. The services are hosted in the cloud, and scale with
little to no effort on the part of the developer. In our Project we used firebase as database
for storing the result of the program.
Firebase Realtime Database is a cloud-hosted database. Realtime means that any
changes in data are reflected immediately across all platforms and devices within millisec-
onds. Most traditional databases make you work with a request/response model, but the

26
Figure 8.9: User Interface

Realtime Database uses data synchronization and subcriber mechanisms instead of typical
HTTP requests, which allows you to build more flexible real-time apps, easily, with less
effort and without the need to worry about networking code. Many apps become unre-
sponsive when you lose the network connection. Realtime Database provides great offline
support because it keeps an internal cache of all the data you’ve queried. When there’s
no Internet connection, the app uses the data from the cache, allowing apps to remain re-
sponsive. When the device connects to the Internet, the Realtime Database synchronizes
the local data changes with the remote updates that occurred while the client was offline,
resolving any conflicts automatically. Realtime Database is a NoSQL database. NoSQL
stands for “Not only SQL”. The easiest way to think of NoSQL is that it’s a database
that does not adhere to the traditional relational database management system (RDMS)
structure. As such, the Realtime Database has different optimizations and functionality
compared to a relational database. It stores the data in the JSON format. The entire
database is a big JSON tree with multiple nodes.

27
8.4 Results
In our gesture recognition system we have included a total of seven gestures, where six of
them are static gestures and one is a dynamic gesture. These static gestures are shown in
the figure below Fig. The captions written at the top of each gesture i.e. “1”, “2” denotes
the number of convexity defects in each gesture. In gestures that do not have any defects
i.e. fist, palm, their name has been written as a caption above the gesture.

Figure 8.10: The static gestures used in the gesture recognition system

The first gesture from the left is a “V” sign or a number two sign which launches the VLC
Media Player application as shown in Fig. The second is a number three gesture and it
launches Google home page within the user’s default browser as shown in Fig. and the
third gesture which is a number four gesture launches YouTube home page. The fourth
gesture is a number five gesture or an open palm gesture which in our system closes the
application that is running in the foreground. The fifth gesture in the above image is a
closed fist that launches Microsoft PowerPoint. The sixth and final static gesture is a
closed palm which toggles the Wi-Fi of the computing apparatus.
In addition to the above mentioned static gestures, the model also has provision for a dy-
namic gesture. When a moving closed palm gesture is recognized for 5 continuous frames,
it is considered to be a dynamic swipe motion. It is used when Microsoft PowerPoint is
running in the foreground, to swipe to the next slide within the presentation. Our first
approach to create a gesture recognition system was through the method of background
subtraction. Background subtraction, as the name suggests, is the process of separating
foreground objects from the background in a sequence of video frames. It is a widely
used approach for detecting moving objects from static cameras. When implementing
the recognition system using background subtraction, we encountered several drawbacks
and accuracy issues. Background subtraction cannot deal with sudden, drastic lighting
changes leading to several inconsistencies. This method also requires relatively many
parameters, which needs to be selected intelligently. Due to these complications faced,
we made a decision to utilize contours, convexity defects and Haar cascade to detect the
object (hand). The combination of these methods enabled us to achieve a greater range
of accuracy and overcome the challenges faced during the use of background subtrac-
tion. To compute the accuracy of our system, we conducted two sets of evaluations. In

28
the first set of evaluation, we used environments which contained different kinds of plain
backgrounds without any inconsistencies. In the second evaluation, we used backgrounds
with several inconsistencies. Each gesture was performed 10 times in both the environ-
mental setups. The average of the number of times a particular gesture was recognized
correctly was taken as its accuracy in percentage and the accuracy obtained is shown in
table 1.When implemented against any plain background, the gesture recognition system
was robust and performed with good accuracy. This accuracy was maintained irrespective
of the colour of the background, provided it is a plain, solid colour background devoid
of any inconsistencies. In cases where the background was not plain, the objects in the
background proved to be inconsistencies to the image capture process, resulting in faulty
outputs. Thus, the accuracy was not as good, in scenarios with plain background. Af-
ter observing the results produced by the gesture recognition system in different back-
grounds, it is recommended that this system be used with a plain background to produce
the best possible results and great accuracy

29
Chapter 9

Conclusion and Future Scope

9.1 Conclusion
From this project we have tried to overshadow some of the major problems faced
by the disabled persons in terms of talking. We found out the root cause of why they
can’t express more freely. The result that we got was the other side of the audience are
not able to interpret what these persons are trying to say or what is the message that they
want to convey. Thereby this application serves the person who wants to learn and talk
in sign languages. With this application a person will quickly adapt various gestures and
their meaning as per ASL standards. They can quickly learn what alphabet is assigned to
which gesture. Add-on to this custom gesture facility is also provided along with sentence
formation. A user need not be a literate person if they know the action of the gesture,
they can quickly form the gesture and appropriate assigned character will be shown onto
the screen. Concerning to the implementation, we have used TensorFlow framework,
with keras API. And for the user feasibility complete front-end is designed using PyQT5.
Appropriate user-friendly messages are prompted as per the user actions along with what
gesture means which character window.

9.2 Future Scope


• It can be integrated with various search engines and texting application such as
google, WhatsApp. So that even the illiterate people could be able to chat with
other persons, or query something from web just with the help of gesture.

• This project is working on image currently, further development can lead to detect-
ing the motion of video sequence and assigning it to a meaningful sentence with
TTS assistance.

30
• The model and text to speech can be embedded into a video calling system. Thereby
allowing the user to show the gestures and the receiver on the call will receive the
message in the form of text or speech. While the receiver responds, the message
will be relayed to the hearing/speech impaired user via text (subtitles).

31
References

[1] Shobhit Agarwal, “What are some problems faced by deaf and dumb people
while using todays common tech like phones and PCs”, 2017 [Online]. Avail-
able: https://www.quora.com/What-are-some-problems-faced-by-deaf-and-dumb-
people-while-using-todays-common-tech-like-phones-and-PCs, [Accessed April
06, 2019].

[2] NIDCD, “american sign language”, 2017 [Online]. Available:


https://www.nidcd.nih.gov/health/american-sign-language, [Accessed April
06, 2019].

[3] M. Ibrahim, “Sign Language Translation via Image Processing”, [Online]. Avail-
able: https://www.kics.edu.pk/project/startup/203 [Accessed April 06, 2019].

[4] NAD, “American sign language-community and culture frequently asked ques-
tions”, 2017 [Online]. Available: https://www.nad.org/resources/american-sign-
language/community-and-culture-frequently-asked-questions/ [Accessed April 06,
2019].

[5] Sanil Jain and K.V.Sameer Raja, “Indian Sign Lan-


guage Character Recognition” , [Online]. Available:
https://cse.iitk.ac.in/users/cs365/2015/submissions/vinsam/report.pdf [Accessed
April 06, 2019].

[6] C. Hardie and D. Fahim, ”Sign Language Recognition Using Temporal Classifica-
tion,” arXiv, 2017.

32
Appendices

33
34
35
36
37
38
39
40

You might also like