You are on page 1of 79

Abilities Infinite: We Choose Not to Put ‘DIS’ in Your Ability

Project Report

Submitted to
University School of Financial Studies,
Guru Nanak Dev University, Amritsar

In partial fulfilment of
Master’s of Business Administration (Finance)
(July, 2018 to June, 2020)

Submitted by Supervised by
Charu Verma Dr. Aparna Bhatia
MBA (Finance) Semester IV Assistant Professor
USFS USFS

Guru Nanak Dev University, Amritsar

INDEX

1|Page
Chapter number particulars Page no.

Chapter I Introduction 5- 16

Profile of organization 16-18

Problem Definition – 18-21

Rationale 21-23

Chapter II Literature Support 24-32

Chapter III Research Methodology 33-34

Chapter IV Data Collection of Indian Sign Language 35-38

Chapter V Prototype Modelling 39-71

Chapter VI Summary, Conclusion and 72-73


recommendations
Bibliography 74-77

2|Page
ACKNOWLEDGEMENT

3|Page
Working at Sabudh Foundation was a thought provoking and very captivating experience.
During these six months of internship, I came across the concept of Machine Learning and
Artificial Intelligence in Data Science.

I have to thank Dr Aparna Bhatia for supervising and advising me through-out this project
report.

I am grateful to the people at Sabudh Foundation for giving me this opportunity to make this
project a success. I especially want to thank Dr Sarabjot Singh Anand, Mr Taranjeet Singh and
Mr Mandeep Singh for being my guide all through-out the project course. Further I thank Dr
Jaspal Singh and Bhavneet Bhalla for giving me the possibility to attend Sabudh Foundation. I
would also like to express my special thanks of gratitude to our Head of the department Dr
Mandeep Kaur for her immense support on behalf of our university.

Lastly, I want to thank my fellow team members and interns at Sabudh Foundation who made
this demanding time joyful yet efficient.

Charu Verma

Chapter I

4|Page
INTRODUCTION
Stephen Hawking , Arunima Sinha , Helen Keller , Nick Vujicic , Alex Zanardi , Ralph Braun,
Marlee Matlin, Jean-Dominique , Tom Shakespeare and many more lead us an example of not
allowing the ‘dis’ to be a curse in their success story for the ABILITIES which are
INFINITE.

Specially abled people’s population in India as per the census of 2011 (updated 2016) is
approximately 2.68 Cr out of 121 Cr that counts to be 2.21% of the total population. Among
the specially abled people 1.5 Cr (56%) are males and about 1.18 Cr (44%) are females. Taking
into consideration the area about 1.86 Cr (69%) of the specially abled people’s population
resides in rural area where as only 0.81 Cr (31%). (Enabled.in August 28, 2017)

Further specially abled/ disabled people can be visualized and analysed on the basis of the
educational level they are on, disabled people as per the different age groups in India, further
type of disability along with the sex gives a base to get a better analysis and in last workers and
non- workers with different types of disability doing either economic or non-economic
activities in India gives us a better picture of Indian status for Specially Abled People.

Educational level of disabled people

In the below visualization one will see 7 basis on which population is divided as per the census
of 2011 (updated by 2016)

Table 1.1

Different level of education of Disabled people

Education Level Description

Illiterate person who is unable to read and write

Literate person who is able to read and write without proper education
facility

Literate and below person who has studied till primary class (below First
primary standard)

Primary and below middle person who has studied below or till Fifth standard

Middle but below person studied above fifth but below tenth and secondary
matric/secondary class of high school

5|Page
Matric/Secondary but person studied till secondary class but below any graduation
below graduate degree

Graduate and above person who is doing graduation or any further study like post
graduation, PhD and others

With the help of different education level one can analysis which sector to tap .Talking about
today’s era technology world is working for the specially abled literate people as with it their
lives could be better . It may come up with a better standard of living as they’ll be able to earn
their own living or even may help their families with it .This whole analysis helps one to
capture new opportunities for the business houses in order to enhance their pack in the rat race
taking into consideration the societal impact. Human spirit is one’s ability, his perseverance
and courage that no disability could take away .As it is rightly said “Whether someone is useful
only matters if you value people by their use.” (Corinne Duyvis March 2016)

Figure 1.1

Source – enabled.in, 2017

The same is presented in a tabular format as follows-

Table 1.2

6|Page
Education level of disabled people

Education level Population

Illiterate 12196641

Literate 14618353

Literate but below primary 2840345

Primary but below middle 3554858

Middle but below matric/secondary 2448070

Matric/Secondary but below graduate 3448650

Graduate and above 1246857

Source – enabled.in, 2017


As per the calculations around 12 lac people are graduated and above among the specially
abled population, further joined by 34 lac with the diligent potential to earn their living. And
1.21 Cr people are disabled and illiterate who neither can read nor write but there’s a
possibility of having a skill set. Around 24 lac people are below matric or secondary education.
As Dianne Feinstein one said ‘No Child Left’, it is required by the states and schools in the
district to ensure that all students are learning and reading to the highest point of their potential.
Here’s where technology and state could play its part.

Disabled population by Age Group in India

Taking into consideration different age groups internationally and nationally are divided in five
groups as given in table 1.3

Table 1.3

Division of Disabled People into different age groups (generalized)

Age Group Description

0-14 years Children

15-24 years Early working age

25-54 years Prime working age

55-65 years Mature working age

7|Page
65 years and above Elderly

But in our visualization we divided our age groups into 12 groups giving calculations in a more
precise way for the analysis. Whereas majority of focus lies in the age group of 10 to 49 ages
as a huge population of youth and prime working people lies in it.

Figure 1.2

Source- enabled.in, 2017

8|Page
As per graph’s analysis we can come up with following calculations as per the total of persons
and further diversification in males and female. The same is summarised in tabular format in
Table 1.4

Table 1.4

Disabled people’s population on the Basis of Age Group

Age-Group Persons Male Female

Total 26810557 14986202 11824355

0-4 1291332 690351 600981

5-9 1955539 1081598 873941

10-19 4616050 2610174 2005876

20-29 4189839 2418974 1770865

30-39 3635722 2112791 15229361

40-49 3115651 1851640 1264011

50-59 2492429 1430762 1061667

60-69 2657679 1394306 1263373

70-79 1769370 884872 884498

80-89 723585 337170 386415

90+ 225571 97409 128162

Age Not Stated 137790 76155 61635

Source- enabled.in, 2017

As per the statistics the highest numbers of disabled people are in the age group of 10 to19
years that came out to be 46.2 lac counted to be 17% of the disabled population. Second on the
list comes the age group from 20 to 29 years having 16 % of the total disabled population.
Elderly age group mainly considering the 60+ age group (60 to 90 and above age group)
together comes to be 21% of the population. Hence as per analysis work could be done for
youth so that they could come up with economically gaining activities for both economy and
themselves.

9|Page
Disabled non – workers by type of disability and by major non-
economic activity in India
In India people are involved in various economic and non-economic activities having different
titles from being data scientists, doctors, advocates, chartered accountants, engineer to farmers
and beggars. For our report we’ll take into consideration specially people involved in major
non economic activities in India. In our population preponderance of specially abled people
could be divided as following seven groups involved in major non-economic activities

Table 1.5

Non economic activities and disabled people

Non-economic activities Description

Students Person who is studying at a school, university or other place of


higher education.

Household Duties Person doing tasks such as cooking, cleaning, washing, ironing
that have to be done regularly at home mainly the household
chores

Dependent Person requiring someone or something for financial or other


support.

Pensioner Person who receives a pension, especially the retirement pension

Rentier Person who is living on income from investment or property

Beggar , Vagrants, etc A person without settled home or regular work who wanders from
place to place and lives by asking for food or money.

other All the rest of persons distinct from the one already mentioned.

10 | P a g e
Figure 1.3

Graph by amCharts

Meaning of disability as per the Cambridge English Dictionary is - “An illness, injury or
condition that makes it difficult for someone to do the thing that other people do.”

Disability can be divided into number of diversions further as per our analysis we’ll be taking
into account eight groups along with people involved in major non-economic activities.

11 | P a g e
Table 1.6

Types of disabilities

Type of Disabilities Description

In Seeing a person who is not able to see / is blind (partially or fully)

In Hearing a person who can’t hear / is deaf (partially or fully)

In Speech a person who can’t speak / is mute

In Movement A person who is able to move or a kind of disability through which he


can’t move independently without someone’s or something’s support.

Mental Retardation – It is a kind of developmental disability; it considers the person’s ability


below average and significant limitations in daily living skills. (Example
– IQ level low as per the age, no growth in body)

Mental Illness A condition causes serious disorder in a person’s behaviour or thinking


(example – depression, anxiety disorder, etc). Any kind of disorder that
affects one’s mood, thinking and behaviour.

Multiple Disability A person with multiple disabilities has a combination of two or more
serious disabilities (example – cognitive, movement, sensory, such as
mental retardation with cerebral palsy that is a grump of disorder that
affects movement and muscle tone or posture. It can damage the brain as
it develops).

Taking into consideration curable and incurable disabilities a person can still come up with
some kind of skill needed by someone either it be an intellectual or skill in an art world.

12 | P a g e
Table 1.7

Disabled non-workers by types of disability and major non-economic activities

Disability Student Household Dependent Pensioner Rentier Beggar Other


% duties % % % /vagrant %
% etc %

In seeing 28 17.3 42.7 6.7 0.2 0.4 5.6

In hearing 32.5 18.9 38.7 4.9 0.2 0.4 4.7

speech 37.2 20 33.5 3.4 0.2 0.2 4.6

movement 19.7 13.4 49.8 8.8 0.3 0.2 5.4

Mental 24.5 9.6 57.7 2.1 0.2 0.6 7.4


retardation

Mental illness 9.3 11.9 66.6 2.8 0.2 0.5 5.4

Multiple 37.4 17.7 35.4 3.2 0.2 1 8.2


disability

others 15 7.3 65.9 6.8 0.2 0.3 5.8

Source – calculations by author

Among the specially abled non-workers with disability in seeing 42.7% are the dependents
where as 28% are students as these two groups have potential to be tapped, among those with
disability in hearing 32.5% are students and 38.7% are the dependents. Similarly in disability
of speech 37.2% are students and 19.7% students with disability to move. Hence student’s
database reflects potential for new investments and societal benefit.

Disability is Not the Problem, the Accessibility is

13 | P a g e
Taking into consideration different disabilities based on their economic and non economic
activities, work needs to be done with the increasing pace of technology like machine learning
and artificial intelligence. Advances in artificial intelligence have spurred the development of
many smart devices to help people overcome cognitive and physical challenges. Artificial
intelligence has not been around for long enough to tackle every opportunity, but one of the
leading names in the IT industry, Microsoft wants to speed up this innovation process with a
program called “AI for Accessibility” that was announced in 2018 with $25 million in
funding. Its goals include developing more AI in the company and giving out grants to others
who want to build tools for disabled communities. (Snow Jackie, January 2019)

If one cannot be treated for their disability at least we can try to make every single thing
accessible to each individual who is specially abled so that he won’t feel the ‘DIS’ to be a
hurdle in his journey of having many abilities. Today technology has been opening doors for
individuals with disabilities. They are trying this for a long time from making motorized
scooters, hearing aids to artificial limbs. Further Artificial Intelligence is charging that ladder to
another level from talking machines to reading it for blind people. Machine Learning is coming
as our new tomorrow in order to make it today for us. For business houses with a target group
of customers to be one billion plus people with disabilities around the globe, there is plenty of
work to be done and a large market to tap in.

“They are our customers, our friends, everybody,” said Jenny Lay-Flurrie, the Chief
Accessibility Officer at Microsoft, who herself is deaf. Many big names like Microsoft has
started working on specially abled sections of the society using Machine Learning and
Artificial Intelligence and coming up with innumerable opportunities and mixed bag of
products for them.

Talking about products launched by Microsoft for specially abled people – 0ne of the most
used is SEEING AI

Figure 1.4

14 | P a g e
Source Microsoft, 2017

Seeing AI is an application for visual impaired people that narrate the world around the person.
With this intelligent camera application one just needs to hold his/her phone and need to hear
information about the scene or world around just like a scene description for a blind person.
This application can even speak short text as soon as it appears in front of the camera, it also
provides audio guidance to capture a printed page, and it recognizes and narrates the text along
with its original formatting. This application also came up with additional features like
scanning barcodes, identifying products, describing people around you along with their facial
expressions, currency recognizing, describing photos on screen too add to the list.

It’s not just big companies innovating and investing in this space, over 100,000 deaf and hard
of hearing individuals have used AVA, an application that allows them to participate in group
conversations either in English or French (with a limited use for Spanish, Italian, German and
Russian). Everyone engaged in that conversation opens AVA on their phones and then speaks
as usual and the application listens in. Then AVA converts the spoken word into text in nearly
real time so that each person could read what was being said by the other person too. AVA
renders each speaker’s words into different colours so that the person finds it easy to read
along.AVA application is being operated by a group of 10 people named ‘Transcene.Inc’.

AVA came up with the motto of – “We are VA and we believe total accessibility is possible.”

15 | P a g e
They are a small team from around the world and break down communication barriers between
the deaf and hearing worlds. They think of their product, their business, their company with
empathy. They ought to go beyond surface-level interactions, to go beyond the barriers.

PROFILE OF ORGANIZATION
Sabudh Foundation came as a seedling organization parented by Tatras Data Consultancy in
collaboration with BML Munjal University, Guru Nanak Dev Engineering college , Punjab
University, Prodigy Numbers (IT and Analytics) , BJFI – Bhai Jaitajee Foundation India ,
Punjab Police, Time sys, Experfy , Innsential , Twistle India , PEC – explore innovate excel ,
Punjab Engineering College(deemed to be university) Chandigarh and Punjab University ,
Patiala. Sabudh Foundation is a non-profit organization formed by the leading Data Scientists
of the industry in association with Punjab Government with an objection to bring together data
and young data scientists to work on focused, collaborative projects for social benefit.
(Sabudh.org, 2017)

They came up with an idea to enable the youth and upcoming ages use the power of AI
technologies for the greater good of our society, by working on real life problems and projects
with partnership with more non- profit making organizations and governmental agencies. In
order to tackle data- intensive high impact problems related to the fields like education , public
policies , disability or specially abled people in our society, agriculture for employment sake,
fashion world recommender system and many more aspects to be touched which are in pipeline
further. Sabudh Foundation currently is working on 5 projects some of which were already in
process whereas some started with ideas and need of our customer/ society such as
‘EDUCOLLAB’ a mobile application promoting peer to peer learning and coming up with new
ideas to give a tough fight to existing players in the market like brainly, BYJUs, khan academy,
chegg and many more. Educollab is India’s first AI driven peer to peer learning for the age of 6
to 16 years, as it was a finalist project at Young Founder Summit at Beijing in last November.
Another project in the line is ‘LINGUA FRANCA’, which focuses on diminishing language
barriers preserving and promoting culture heritage through native language. The interns were
expected to build a rich reservoir of painti akhri in conjunction with the painti akhri team. Main
focus was to build an application with different features like scene description and object
detection and giving the response in different languages having main focus on PUNJABI as a
language using multi-lingual aspect. Talking about the project working taking into

16 | P a g e
consideration agriculture aspect as ‘SAT SRI AKAL’, bringing prosperity at the bottom of
pyramids in Punjab – it mainly integrates the ideas as in how should the input from the field
staff at the village be planned to gather so as what should be design of the output in order to
help farmers. Fourth in the line is ‘INTELLIGENT TRANSPORTATION SYSTEM’ to
increase road safety. The project was linked up with Traffic Specialist cell at traffic HQ Govt
of Punjab, headed by alumni of IIT Delhi. It was to come up with an application along with the
support of machine learning and AI in improving road mapping like side lanes, cycle lanes,
parking areas, walking lanes and other aspect too of transportation systems.

Data Science is being used across a number of industries and companies to be beneficial for
the society and in order to reach all the available opportunities too. For example talking about
agriculture only there are now AGROBOTS and drones are being used to gauge the health of
the harvest that can help farmers to improve their crop yield and ultimately reduces the costs.
With the help of technologies states like Punjab which has been recognised as the food basket
of India in order to rehabilitate food security while improving its quality and crop health.
(sabudh.org, 2017) Medicine is another vertical where Artificial Intelligence has been
progressing to make the right diagnosis and detect the disease at the right time for it to be
cured. Punjab has the highest rate of cancer in India. 18 people succumb to the disease every
day, according to a recent report published by the State Government. Having machine learning
and AI algorithms in order to diagnose the fatal disease at an early age can significantly
decrease the mortality rate as from where it is currently.

PROGRAM UNDERTAKEN

Sabudh Foundation is a non profit making organization welcoming aspiring Data Scientists to
undergo six months internship and be a part of Sabudh Fellowship or alumni team. Interns get
to work on real life problems and projects having real social impact on the society. Interns get a
chance to learn Artificial intelligence and machine learning algorithms from the leading lights
of the industry and academia. This time in 2020 they planned to add six students of MBA as
project managers all given the team of B-tech students from different universities and colleges
like (PEC, Chandigarh, Punjab University, Patiala and Guru Nanak Dev Engineering College,
Ludhiana). These all students as interns get a chance to manage the projects and imply the
algorithms they study on a regular basis in their classroom study program organised by
SABUDH fellows in real life projects as per the need of the project. Interns are expected to

17 | P a g e
work closely and collaboratively with the respective team member’s onsite for the duration of
the program. Further interns will be given high intense training in advanced technologies such
as Machine Learning, AI, and IoT and cyber security. Interns are being provided with a chance
to have an extensive network with worldwide academics and industry leaders at Sabudh to
further have employment opportunities.

PROBLEM DEFINITION

India is being recognised for its blend of culture coming with balanced harmony coming
together. In our diversified country, different regions come up with entirely different
languages, scripts and traditions to be followed. Similarly Sign Language is a language used
primarily by deaf and mute people in order to communicate using signs made with hands and
other body movements like facial expression and body posture.

There are about three hundred sign languages in use around the globe today. Accurate numbers
are not known as new sign language emerges frequently and occasionally through language
planning. (en.wikipedia.org ). Further the deaf sign language list is sorted based on regions
such as contemporary deaf sign languages in countries like Africa, America, Asia/Pacific,
Europe and Middle East.

Under Asia/Pacific deaf sign languages list includes-

Table – 2.1

List of Asia/Pacific sign languages

Language Origin

Japanese sign language Japanese

Chinese sign language Chinese

Indo-Pakistani Sign language Indian


(conflicting reports on whether Indian and
Pakistani sign language are one or two different)

18 | P a g e
Philippine sign language French

Thai sign language American sign language

Amani oshima sign language Village or Idioglossia of Japan

Ghandruk sign language Village of Nepal

Hong Kong sign language Shanghai sign language

Huay Hai sign language Village of Thailand

Source – en.wikipedia.org

Sign language uses visual-manual modality to convey its meaning. They are full-fledged
natural languages with their own lexicon and grammar and foremost they are not mutually
intelligible with each other. (Bahadur Akshay, 2019)

Sign languages are different around the globe. For example ASL (American Sign Language) is
said to be universal sign language throughout the United States and is standardized as well.

But talking about India it is altogether a different case, along with diversified cultures Indian
Sign Language is different with different regions with multiple different signs depicting same
meaning. “Indian Sign Language Is a Human Right of Deaf” – as per ISLRTC (Indian Sign
Language Research and Training Centre under Department of Empowerment of Persons with
Disabilities.

Indian Sign Language (ISL) is being used by deaf and mute community all over India. But ISL
is not used by deaf and mute schools to teach deaf and mute children. As there was no program
to orient teachers towards training with specific teaching methods using ISL. Even parents of
deaf and mute children are not aware about Indian Sign Language and they don’t have the
ability and material to remove these hindrances in the way of communication with and for their
children. ISL interpreters were an urgent requirement at institutes and place where deaf and
mute people take places but India had only less than 300 certified interpreters. Therefore, an
institute that met all these needs was a necessity. Finally after a long struggle by deaf
community, the Ministry approved the establishment of ISLRTC at New Delhi on 28 th
September, 2015. (islrtc.nic.in, 2015)

19 | P a g e
ISL DICTIONARY LAUNCH BY ISLRTC

First Edition: ISLRTC launched the first Indian Sign Language Dictionary of 3000 terms on
23rd March, 2018 at Indian International Centre, New Delhi. The dictionary was released in
DVD form containing signs of everyday use and their corresponding English and Hindi words.
Specialized terms from legal, academic, medical and technical fields are explained in ISL. This
dictionary could benefit interpreters, teachers of the deaf, parents of deaf and mute children and
will also help adult deaf and mute people to learn English and Hindi too. (islrtc.nic.in, 2018)

Second Edition: ISLRTC launched the second edition of Indian Sign Language Dictionary on
27th February, 2019 at C.D.Deshmukh Auditorium, India International Centre, 40, Max Mueller
Marg, Delhi. The second edition was launched with 6000 words under the categories of
academic, legal, medical, technical and everyday terms. (islrtc.nic.in, 2019)

ISLRTC is working with an objective-

1) To develop manpower for using ISL (Indian Sign Language Sign) for teaching and
conducting researching in ISL
2) To promote the use of Indian Sign Language as an educational mode for deaf and mute
students at primary, secondary and higher education levels.
3) To orient and train various governmental officials, teachers, professionals, community
leaders and the public at large for understanding and using ISL.
4) To collaborate with different organizations for deaf and mute people along with other
institutes in the field of disability for promoting and propagating ISL and many more
objectives to accomplish.

With the coming age of AI (Artificial Intelligence), where we can come up with an application
recognizing signs of Indian Sign Language for people one doesn’t need human interpreters to
interpret for them as they can carry their own interpreters in their pocket. With Abilities
Infinite we are trying to come up with our implementation for building an Indian Sign
Language Recognition Application at SABUDH FOUNDATION.

LIMITATIONS

While working on this project we encountered some issues like ISL is not standardized yet, as
it differs from area to area taking the example of Maharashtra as a state itself differs Sign
Language in 2 cities Mumbai and Pune which are hardly 80kms away from each other. There
was no labelled and well diversified data set present there that could be used directly for model

20 | P a g e
training programmes. Though there were a couple of videos present on YouTube and some of
Smartphone applications that had pre-recorded gesture videos Of ISL. Also Indian Sign
Language is not only about hand movement and gestures but also about facial expression, body
movement and body posture too. Hence the project was divided into broader phases to depict
life of the project. Phases like data collection from scratch , then trying different algorithms and
models in order to find best fit for our project , then after making a prototype model and testing
its accuracy we’ll proceed with the model with maximum accuracy at trained and validation
data , eventually after having maximum accuracy one can get after hyper tuning project will go
on deployment road as of integrating model and application part and finally to end up with
testing the application with a smaller group first before the release of application for public use.

Figure 2.1

Project Roadmap

Source – Author

21 | P a g e
RATIONALE

Availability and affordability of an interpreter as a service has always been in demand for deaf
and mute community. Many big business houses face problems while providing such services
to deaf and mute people on high cost and maintenance charges. As per NAD (National Deaf
Association) there are around 18 million people who are deaf in India. (cio.economictimes,
2018).Our project could be used in business to business working where big companies, who
want to give employment to capable deaf and mute people .Further could use our application
for conveying messages to employees and further employee’s messages to the end consumer
too. Usually deaf and mute people do not have many options in order to communicate with a
hearing person or others, and the majority of the alternatives come with some of the flaws
along with them. Interpreters are not easily available or accessible and even could be an
expensive option for a person. Even pen and paper sometimes become very uncomfortable,
time consuming and messy at a time for both deaf and hearing people. Even using messenger
or texting around solves the problem as they do not offer an easy approach, confidence and
comfort for the person while communicating. Talking about translation software either they are
slow or too expensive to come up with. Secondly with the growing world our dictionary comes
to no end. One cannot rely on an old system without any update of new signs and words into it.

With our project we are giving this business hub a Pocket Interpreter running on superior
technology world of AI and machine learning. One only needs a camera on his/her device
facing the person communicating using Sign language.

Some of the companies plan to do their part for Indian Sign Language but the main problem
they face is Data. It indeed is a difficult task to collect data in India where there is a blend of
cultures and variety of sign languages in there. As there is no proper dataset already into
existence the way ASL (American Sign Language) have, ASL is being recognized as a
standardized language in most countries around the globe. Sabudh Foundation is a non-profit
organization promoting AI and machine learning in different sectors of our society like
healthcare, education, agriculture and similarly disability too. Through Abilities Infinite,
Sabudh Foundation came up with their part to play in order to improve lives of deaf and mute
people using AI and machine learning.

An application that could recognize and convert sign language into text in real time by
gathering a corpus of data to run is critical but need of the hour. This could tap the target
market of 18 million people who are deaf in India. Main purpose to come up with this purpose

22 | P a g e
is to remove communication barriers being faced by deaf and mute people in their daily
routine. In a country like India where deaf and mute people can hardly afford to hire an
interpreter where even the government is having no provision regarding interpreter’s fee. This
application could solve many problems like basic communication of a child with his/her
parents who lacks knowledge of Indian Sign Language. By having our application on multi-
platform like Android, IOS and windows too this will help our customer to use it as per there
comfort for free of cost. They would only need a camera in their device facing toward the
person doing sign language and our application will recognise the signs and convert it into text
on screen in real time. Just to add a little more comfort, our execution phase includes
multilingual features in this application. Targeting Punjab area text will be provided in three
major languages – English, Punjabi and Hindi. The application can be used for education
purposes for deaf and mute people in India. Using this application these children can easily
communicate and can even work for their desired companies. Business hubs can use this
application to communicate with their deaf and mute employees and further that employee can
use it to convey his/her message to the end customer with confidence and without any
hesitation.

23 | P a g e
CHAPTER 2

REVIEW OF LITERATURE
Sign language recognition/detection has gained reasonable interest by many data scientists and
researchers in the last decade of technology. In order to facilitate more accurate communication
system for deaf and mute people one need to come up with more accurate sign language
recognition system. Talking about sign languages this literature review will be considering
work done on Indian Sign Language and any other foreign Sign Language mainly American
Sign Language. Due to easy availability of data in ASL, it’s been most popular among
researchers. The process can be taken into using different algorithms and models like Neural
Network (NN), Genetic Algorithm (GA), Evolutionary Algorithm (EA), Perceptron Feed-
Forward Network (MLP-FFN), Convolutional Neural Network (CNN), Recurrent Neural
Network (RNN) and many more as per the need of the market.

Hence, in this report the empirical research on sign languages classified into 2 categories –
work done on recognition of

1- Foreign sign languages


2- Indian Sign Language

As there is no universal sign language, different countries have different sign languages like
British Sign language, American Sign Language, Chinese Sign Language, Australian Sign
Language, Indian Sign Language and many more. In India professionals believe that there is an
acute shortage of special schools for deaf people in most areas. The reality is that deaf schools
mainly do not use Indian Sign Language and hardly 5% of deaf and mute people attend deaf
schools. (Zeshan al.2005, Vasishta ET al.1998)

1. FOREIGN SIGN LANGUAGE RECOGNITION

Kramer and Leifer (1990) using Cyber Glove developed an American Sign Language (ASL)
finger spelling system. They came up with the use of neural networks for data segmentation at
first then feature classifier in each sign and then finally sign recognition. They trained there
model using tree-structured neural by classifying vector quantizer. For ASL alphabets they
developed a large neural network with 51 nodes to recognize them. There research and
experiment ended on an accuracy of 98.9% for the system. (shodhganga.inflibnet.ac.in) .Cyber

24 | P a g e
Glove was first established in 1990 and till date with updated versions is been used widely as a
data glove solution in market to its best.

Murakami and Taguchi (1991) worked on Japanese Sign Language by investigating


Recurrent Neural Network (RNN) approach. They developed a posture recognition system
which could recognize finger alphabets using neural networks of 42 symbols. Lately they
developed a system where each gesture reflected a word. As gesture recognition is more
difficult as compared to posture recognition due dynamic processes to be handled and to cover
it they used RNN to recognize a continuous gesture. They trained there model on 10 words –
father, mother, brother, sister, memorize, forget, like, hate, skilled and unskilled. They came up
with recognition rate of 96% accuracy and 80% without being augmented and filtered.

Figure- 2.2

Sign Language words used by Murakami and Taguchi

Source – dl.acm.org, 1991

25 | P a g e
Takahashi and Kishino (1991) using VPL Data Glove recognized 46 Japanese Kana
alphabets which were manual in format. Hand gestures were encoded & recognized with help
of data ranges for joint angles & hand orientations fully based on experiments. This model was
trained on 46 hand gestures in Japanese Sign Language on which while testing it could only
recognize 30 of 46 gestures and remaining 16 hand gestures/signs were not reliably recognized.
Fig 2.3 shows the VLP Data Glove which launched in market in 1987 it at that time also
supported a full body motion tracking system called DataSuit.

Figure 2.3

Front cover Scientific American, October 1987, featuring the VLP Data Glove

Source – Britannica (scientific American), 1987

26 | P a g e
Starner (1995) came up with an unencumbered way of recognizing sign language, ASL in
particular with the use of video camera. Using Hidden Markov Model (HMM) they achieved
low rate of errors on both training data set and an independent dataset without invoking
complex models of hands. System attained an accuracy of 99.2% for recognizing sentence level
American Sign Language. They used traditional dual camera one mounted on the desk and
other for tracking user’s hand.

Rung Huei Liang and Ming Ouhyoung (1998) came up with a large vocabulary sign
language interpreter made recognizing real-time continuous gestures of sign language using a
DataGlove. By solving end-point detection in a steam of gesture input, statistical analysis was
done as per four parameters in a gesture: posture, position, motion & orientation.
(researchgate.net, 1998). They implemented a prototype model with a glossary of 250
vocabularies in Taiwanese Sign Language (TWL). They came with the average recognition rate
of 80.4% based on continuously recognized vocabularies in real time using Hidden Markov
Model (HMM) for 6 orientations, 8 motion primitives and 51 fundamental postures.

Bungeroth, Stein, Dreuw, Zhaedi and Ney (2005) came up with a suitable corpus of German
Sign Language (DGS) corpus to train a statistically capable system for automatic recognition of
signs and statistical machine translation of sign language. It was a challenge at that time for the
fields of Natural Language Processing (NLP). They presented an appearance based German
sign language recognition system by using a weighted combination of different geometric
features in areas of hands, length of border of hand, x and y coordinates of centre of gravity
along with different appearance based features.

Rybach (2006) presented a vision based system approach in consideration to continuous


Automatic Sign Language Recognition (ASLR). The system came up with very promising
results on a publicly available database given by numerous speakers without any special data
acquisition tools. Database had 201 sentences and 3 signers through this data they could
achieve a 17% WER (Word Error Rate). WER is common tool for comparing different systems
of performance for machine translation and speech recognition.

Mohandes et al (2007) proposed an image based system for recognizing Arabic Sign
Language. They used Gaussian skin color model that could face of the person doing sign.
Centroid of the face is being detected which further is used as a reference in order to track the
hands movement. Hidden Markov Model (HMM) was being used at recognition stage for a
dataset of 300 signs giving about 93% of accuracy.

27 | P a g e
Hongmo Je and Jiman Kim Daijin (2007) came up with a vision based hand gesture
recognition system dealing with understanding of three tempos and four musical patterns by an
operator of computer based music play system & human conductor of robot orchestra with the
help of hand gesture recognition using Conducting Feature Point (CFP), motion history
matching & motion direction code. The research and experiment resulted about 86% on test
data set.

Chen Qing et al (2007) presented a 3-D hand tracking &analyzer system using methods like
Haar like features &Adaboost algorithms based on the stochastic parsing. Through this system
given an input gesture based on extracted posture , this composite gesture was passed and was
recognized with the set of production & primitive rules. Against complex and cluttered
backgrounds system could achieve better accuracy for the researchers.

Al-Roussan, Khaled Assaleh and A.Tala’a (2009) presented a system for recognition of
Arabic Sign Language (ArSL) by using a huge dataset of ArSL to recognize 30 isolated words
based on Hidden Markov Model (HMMs). System hade it’s working on different modes like
online, offline, signer independent & signer dependent modes. For signer dependent mode,
they achieved accuracy of 96.74% on offline test data & 93.8% on online test data. Whereas for
signer independent mode, the system achieved word recognition rate of 94.2% for offline mode
and 90.6 for online mode of test data. This system does not rely on any kind of input device
like data glove or any other as it allows the signer to perform freely and directly.

Kim and Cipolla (2009) presented a dynamic learning for subspaces in the proposed model
Canonical Correlation Analysis (CCA), a principal tool in order to inspect linear relation
between the two sets of vectors & tensors. They came up with self recorded hand gesture
dataset having 900 images sequences for 9 gesture classes having a high image quality, which
resulted as a significant model as compared to other state-of-the-art methods with respect to
accuracy, low in time complexity without any major hyper tuning parameters required.

Lahamy and Litchi (2009) presented a designed application demonstrating capability of range
camera for real-time application. Application could automatically recognize gestures using
range cameras. The results as per the confusion matrix showed the occurrence of predicted
outcomes against the actual values to be a proportion of 38%. They came up with primary
objective to bring in dynamic sign recognition application.

28 | P a g e
Youssif, Hamdy Ali and Aboutabl (2011) presented automatic Arabic Sign Language (ArSL)
using Hidden Markov Model (HMMs). They used a dataset having 20 isolated words from
standard Arabic Sign Language by conducting experiment taking into consideration parameters
like different skin tones and different clothes. They used a signer independent mode along with
signer being glove free in different clothes and having different skin tones giving an average
recognition rate of signs to be 82.22%.

Patil, Pendharkar and Gaikwad (2014) designed a system using sensor glove to capture
signs of American Sign Language performed by the signer and then translated text shown on a
screen as shown in figure 2.4. This system proposed a prototype for further experiment trained
on dataset of 26 alphabets and an additional gesture of ‘space’. Censor glove was made of latex
in material detecting the position of each finger by monitoring bending flex sensors on it.
Giving a prototype model for further experiment was the primary objective of this experiment.

Figure 2.4

Censor glove doing gesture of ‘A’ (on left) and Censor Glove (on right)

Source- ijsrp.org (research paper), 2014

29 | P a g e
Ahmed and Aly (2014) presented a system having a combination of Local Binary Patterns
(LBP) and Principal Component Analysis (PCA) in order to extract features that were fed to
Hidden Markov Model (HMMs) for recognizing the lexical of 23 isolated Arabic Sign
Language words. In signer dependent mode, they achieved 99.97% recognition rate but kind
off failed in the area of constant grey level due to threshold schemes of operator.

Ibrahim, Selim and Zayed (2017) presented an automatic visual sign language recognition
system that translated isolated Arabic Sign Language (ArSL) words from its standard version
into text. The system is in signer independent mode where it only used single camera without
any input device like glove or markers. The research had four stages: Hand Segmentation,
Hand Tracking, Hand Feature Extraction and its Classification giving on an average
recognition rate of 97% with low misclassification rate for similar signs.

Vishisht M. Tiwari (2017) presented an affordable and portable solution for sign language
recognition by coming up with an android application for recognizing alphabets of American
Sign Language. He used image processing techniques in order to distinguish hand from
background and then to identify fingertips as to recognize gesture done by the signer, by using
the distance between fingers & Centroid of hand and then finally show gesture into textual
form. He used about 100 gestures for each alphabet which helped him achieve highest
recognition rate of 100% for alphabets ‘E’ and ‘V’ and lowest recognition rate of 85% for
alphabets ‘I’ and ‘U’. Average recognition rate came out to be 90.19%.

Figure 2.5

Snapshot of final system detecting Gesture ‘B’ of ASL

Source – slideshare.net (research paper), 2017

30 | P a g e
Suharjito, Anderson, Wiryana, Ariesta and Kusuma (2017) tried to develop a pure
visionary sign language recognition system by trying out multiple methods and then comparing
their respective accuracies. They used camera as input device and Microsoft Kinect to capture
images as it removes the need of any sensory glove and results in reduction of cost. They used
different methods like HMMs (Hidden Markov Model) giving accuracy of 86.7%, 3-D CNN
(Convolutional Neural Network) giving accuracy of 94.2%, simple ANN (Artificial Neural
Network) gained accuracy of 95.1% whereas SVM (Support Vector Machine) was abled to
achieve 97.5% accuracy on the same dataset.

Yangho-Ji, Sunmok Kim, Young-Joo Kim and Ki-Baek Lee (2018) presented Human-like
sign-language method adopting deep learning methodology. For other models and system
required data size required had to be large as for single sign dozen of images were required. In
this study, human-learning process is used as much possible by trying the model to identify the
sign after watching less sequential images of gesture. Model was trained on dataset of 12
gestures each having 500 images giving recognition rate to be 99% on images captured by low-
cost RGB camera.

Dias, Fontenot, Grant, Henson and Sood (2019) presented a model trained using CNN
(Convolutional Neural Network) to recognise American Sign Language gestures. They worked
in both dynamic (sign in motion) and static (stationary or fixed sign). Dynamic dataset was
collected using LMC (Leap Motion Controller). Presented model was abled to recognize Static
signs at the accuracy rate of 94.33%. Under dynamic they had two more categories, one-
handed sign and two-handed sign. Accuracy rate for dynamic sign under one-handed signs was
88.9% and two-handed signs 79.0% having dataset of 60 classes overall.

2. INDIAN SIGN LANGUAGE RECOGNITION

India’s 11th Five year plan accounted to be 2007-2012 acknowledge the need of deaf and
mute people which were being neglected since a long time and finally in 2011 they
contemplated the need to develop “The Indian Sign Language Research and Training
(ISLRTC) as an autonomous centre of the Indira Gandhi National Open University
(IGNOU), Delhi.” (islrtc.nic)

Joyeeta Singha and Karen Das (2013) presented a system with classification technique
based on Eigen Value Weighted Euclidean Distance for recognition of Indian Sign

31 | P a g e
Language. They considered skin filtering, feature extraction, hand cropping &
classification. Dataset considered was having 24 signs each having 190 sample images.
Hence, on total of 240 images system gained recognition rate to be 97% on two hand
gestures with low computational time and no input devices such as hand glove for sensor
purpose. Presented system considered 24 alphabets of Indian Sign Language except for ‘H’
and ‘J’, images were captured using camera.

Tripathi, Baranwal and Nandi (2015) presented Continuous Indian Sign Language
Recognition System by taking ten Indian Sign Language sentences from five different
people. Each sentence was being recorded ten times, 6 times for training and four for
testing. System was developed using Principal Component Analysis, Euclidean distance,
City Block Distance, Chess Board Distance and Correlation distance giving a variety of
recognition rate for each sentence varying from 67% to 93%. Classification accuracy was
measured taking into consideration maximum number of frames matched.

Rokade and Jadav (2017) presented a system with Automatic Sign Language Recognition
using Artificial Neural Network and support vector machine on seventeen letters of English
alphabets using skin segmentation technique in order to get shape of hand region. ANN
(Artificial Neural Network) gained an average accuracy of 94.37% and SVM (Support
Vector Machine) gained an average accuracy of 92.12%. Both the models gave highest
accuracy by taking thirteen features under feature extraction process. On comparison ANN
model performed better as per SVM.

Rishabh Gupta (2018) presented the research in the field of Indian Sign Language
Recognition using different machine learning algorithms named SVM (Support Vector
Machine), Logistic Regression, K-nearest neighbours and CNN (Convolutional Neural
Networks) for detection of gestures. On dataset he tried SURF (Speeded up Robust
Feature) techniques for feature extraction, it is a feature detector used in object detection,
image registration, 3D reconstruction and classification. Accuracy was tested with and
without using SURF technique for comparison. Highest accuracy was of CNN
(Convolutional Neural Network) at 77% without using SURF .And with SURF, SVM
(Support Vector Machine) ruled the leader board with 92% of accuracy rate.

32 | P a g e
CHAPTER 3

RESEARCH METHODOLOGY
Numerous efforts have been done with respect to Foreign Sign Languages and Indian Sign
Language, as for ISL maximum of work done is done either on static/stationary signs or
alphabets for the same. Under this research videos are taken as a dataset for categorical data.
Many videos were available online platforms like YouTube and some mobile applications
helping people to learn ISL. Unfortunately that data was not enough in order to train and test a
model for recognizing a gesture done and giving the label as an output. While our research on
review literature we observed for training the model for each sign you need at least 50 videos
each whereas to achieve optimum results, an optimum number of videos were required.

PROBLEMS ENCOUNTERED

1- Classes were to be defined no random signs could be taken for training purpose. ISL
has a huge dictionary of words and one cannot directly train the model for whole sole
set. Classes were to be defined and to be taken into consideration timely.
2- Data was to be self created/recorded as per the categories/classes taken on timely basis.
As per the need of model and output required.
3- Model to be taken with instance to not only hand movement but also facial expression
and full body pose estimation as ISL is not only about two hand movement instead
involve face and pose estimation of a human body.

Whole process is divided into different phases as was shown in figure 2.1, Initial ship to sail
was of Data Collection going through different ports of data creating, data pre-processing and
data transformation. But before that our project had two main objectives first was to collect
data as for Indian Sign Language we didn’t have a compiled data classified as per the classes.
So our first objective was to record and collect data as per the classes defined. Second objective
was to build a prototype model for recognizing Indian sign language gestures and giving it
back as a text.

OBJECTIVES

1. DATA COLLECTION OF INDIAN SIGN LANGUAGE –Unlike ASL (American Sign


Language) Indian Sign Language faces a huge problem in having a synchronized
labelled dataset available as an open source for researchers to give their part in this

33 | P a g e
challenging learning process. As for every researcher dataset is found to be the
foundation of whole sort project. Our initial objective is to collect and create data of
ISL into categorical way like – week days, greeting words, regularly used terms, legal
terms, etc. For the purpose of data collection videos were to be recorded and collected
using videos available on mobile applications, YouTube and other social media
platforms. This data set collected will be in possession of Sabudh Foundation and any
researcher interested could use it with their permission as it will be available on their
website.
2. PROTOTYPE MODEL – In brief is a system development method in which a
prototype (a rudimentary working model of an information system and a product
limited to basic principles) is built, tested and then reworked until an acceptable
outcome is achieved. Through this prototype model further the product will be
formulated. (Searchcio, 2005). In this process we’ll trying out different algorithms
related to neural networks and hand gesture recognition models available in market. For
the development of a prototype model, we were required to try and test multiple tools
available in the market for pre-processing of data and for training & testing of data
different algorithms were to be applied on pre-processed data.

Through these objectives we’ll be able to come up with a prototype using evolutionary
prototype modelling and will be running in a pipeline along with adding new categories of
data. Artificial Intelligence and machine learning has been making their foot strong in the fields
of hand gesture recognition, facial expression and poses estimation using machine learning.
Coming up with a project integrated with all these amazing work of AI to be used in its best by
giving deaf and mute a boon for their communication with comfort and confidence.

34 | P a g e
CHAPTER 4

DATA COLLECTION OF INDIAN SIGN LANGAUGE


Initially, we needed to articulate the problem and then define the data type required. For static
signs we articulated the problem by defining different classes after which data was created and
recorded according to the timely need of the project. Indian Sign Language dictionary given by
Indian Sign Language Research and Training Centre has 6000 words under categories like
academic, medical, technical, legal and everyday words.

Table – 4.1

Classes for the project

CLASSES SIGNS
Monday
Tuesday
Wednesday
WEEK DAYS Thursday
Friday
Saturday
Sunday
Hi/ Hello
How are you?
Good Morning
Good Afternoon
GREETINGS
Good Night
Hello, What is your name?
What is your age?
Had your lunch?
I am fine
What is your job?

35 | P a g e
Initially we worked on ‘Week Days’ as a class, for each sign in it we recorded more than
hundred videos all thanks to our colleagues for helping us in creating data even in the times of
COVID-19.

DATA CREATION

Videos were recorded manually as raw videos using 720p as resolution where 720 stands for
horizontal scan lines in video display also known as ‘720’ pixels of vertical resolution and here
‘p’ stands for the progressive scan, which is a format to display, store and transmit moving
images where all the lines of each frame are drawn in sequence. As frame rate 30fps were
recorded, where ‘fps’ stands for Frame per Second, which is a kind of frequency rate at which
consecutive images known as frames appear on display screen.

Figure 4.1

Research Methodology Process

Source – Author

36 | P a g e
PRE-PROCESSING OF DATA

After creating and gathering data the next port for the ship to sail through was Pre-Processing
videos for further training the data on learning algorithms. Under data pre-processing raw data
is being cleaned and transformed using formatting, cleaning and sampling techniques for data
cleaning & scaling/normalization, decomposition and aggregation techniques for
transformation. Data pre-processing is an integral part in machine learning for improving the
quality of data and deriving useful information directly affecting our model’s ability to learn.

NEED OF DATA PRE-PROCESSING

Data can be in many forms like images, structured tables, audio files, videos, etc. Whereas
machines don’t understand the textual form, audios or videos as it only takes input in form of
‘0’ and ‘1’. Feature extraction is a branch of data pre-processing under which data is
transformed, encoded and features are extracted so that the model can easily learn it.

1) For Feature Extraction: Data is described as a number of features having basic


characteristics of an object like variables, fields, dimensions, attributes, points, records,
patterns, events, vectors, cases, observation, samples or entities.
2) For Quality Assessment of Data: Normally many datasets are not too accountable or
reliable so in order to improve its quality data pre-processing is important. To deal with
missing data, inconsistent values & duplicate values pre-processing is a necessity.
3) Helpful in Feature Sampling: Sample is a representative of whole data and by analysing
a sample it ultimately helps in reduction of size of dataset so that one can use better
machine algorithms and can even change pre-processing techniques on a timely basis to
save time and efforts.
4) Dimensional Reduction on data: Most real world data have a large number of features.
Like in our videos for some signs we only need to consider hand movement where
background is noise for our model. So dimension reduction aims in reducing features
not required.

Once the data is well prepared it is ready to be use for applying learning algorithms. For our
project videos were pre-processed by taking into consideration Key-Points of hand & body as it
will make our process much easier for the model to learn signs and gestures rapidly.

37 | P a g e
ANALYSIS AND RESULT REPORTING

In order to fulfil our first objective we recorded our data in these are the number of videos
recorded.

Table 4.2

Number of videos recorded for each gesture in both classes

Classes Signs Number of videos recorded


MONDAY 229
TUESDAY 264
WEDNESDAY 212
WEEKDAYS
THURSDAY 204
FRIDAY 204
SATURDAY 205
SUNDAY 255

HI/HELLO 42
HOW ARE YOU? 34
GOOD MORNING 68
GREETINGS
GOOD AFTERNOON 57
GOOD NIGHT 53
HELLO, WHAT IS YOUR 29
NAME?

WHAT IS YOUR AGE? 36


HAD YOUR LUNCH?
I AM FINE 12
WHAT IS YOUR JOB? 31
12

Source - Author

These are the total number of videos recorded for each class. Prototype model in our project
was trained on one class only i.e. weekdays. For further stages more data could be added as per
the classes and categories defined.

38 | P a g e
CHAPTER 5

PROTOTYPE MODELLING
For prototype modelling we followed the research methodology process shown in figure 3.1.

PRE-PROCESSING USING KEYPOINT DETECTION TECHNIQUES

Our problem statement involves detection of hand, body and facial expressions so in order to
come up with a better model it was more convenient to use Key-point Detection techniques as
pre-processing for making our data well prepared. Key-point detection here means detecting
people along with localizing their key-points as the interest points in our data. In our project
two techniques were used –

1) OPENPOSE 2) MEDIAPIPE

1. OPENPOSE

OpenPose is first real-time multi person system in order to jointly detect human body, facial
keypoints, hand keypoints and foot keypoints giving a total of 135 keypoints in a single image.
OpenPose is officially authorised by Gines Hidalgo, Zhe Cao, Tomas Simon, Shih-En Wei,
Hanbyul Joo and Yaser Sheikh. (Github.com, 2018)

Figure 5.1

Logo of OpenPose

Source – github.com/openpose

39 | P a g e
FUNCTIONALITY

1. 2D Real Time Multi-Person Keypoint Detection: It helps in detecting 15-18-25 key


points for body & foot estimation with running time invariant as to number of people
detected. For foot estimation 6 key-points are detected integrated with 25 body &
foot keypoints. For hand 21 keypoints are detected with respect to one hand in total
21*2=24 keypoints are detected along with running time depending on number of
people detected. For face keypoint detection in total 70-keypoints are detected along
with running time depending on number of people detected.
2. 3D Real Time Single-Person Keypoint Detection: It is compatible in 3-D
triangulation from multiple single views along with being compatible with Flir/Point
Grey camera. It also supports by handling synchronization of Flir cameras.
3. Calibration Toolbox: It acts as a calibration toolbox with easy estimation of
distortion, intrinsic and extrinsic camera parameters. (github/openpose, 2018)
4. Single-Person Tracking: It performs better as on visual smoothing & improved speed

INPUT and OUTPUT

In OpenPose user can put input in form of images, videos, webcam real time, Flir/Point Grey
and IP camera. For output, a basic image with keypoints for display and saving will be given
in form of PNG (Portable Network Graphics), JPG (Joint Photographic Expert Group) and
AVI (Audio Video Interleave) forms. For keypoints saving keypoints can be saved as array
classes or in form of JSON (JavaScript Object Notation), XML (Extensible Markup
Language) and YML (YAML Ain’t Markup Language).

Figure 5.2

Face Output Format for with 70-Keypoints: 69 for face and 70th signifies background

Source- github/openpose, 2018

40 | P a g e
Figure 5.3

Pose output format for with 25-Keypoints: 24 for body and 25th signifies background

Source - github/openpose, 2018


Figure 5.4
Real-Time Multi person Keypoint Detection

Source – github/openpose, 2018

41 | P a g e
Figure 5.5
Hand Output Format with 21-Keypoints: 20 for hand & 21st signifies background

Source-github/openpose, 2018
OpenPose is library for real time multi-person key-points detection along with multi threading
written in C++ using OpenCV and Caffe. (Github/CMU-perceptual-Computing-Lab/openpose)

PROGRAMMING LANGUAGE USED

PYTHON: Python is a high level, interpreted, functional imperative, object-oriented,


structured, reflective, general-purpose and dynamically typed multi-paradigm programming
language. Python allows it’s user for expressing powerful ideas in very few lines of coding
along with being easily readable and accessible. Python version 3 was used for this project.
Python license is issued under Python Software Foundation License.

42 | P a g e
LIBRARIES USED FOR PRE-PROCESSING IN PYTHON FOR OPENPOSE

OpenPose was used to convert our raw videos into well prepared videos along with key-points
detected for the same. In this process of converting raw data to well prepared data following
python libraries were used to solve the purposes.

1. OPERATING SYSTEM (OS): We used version 2.1.2 of Operating System, it is a robot


framework’s standard library enabling various systems related tasks to be performed in
system. It solves the purpose like creating a file, creating a directory, copy a file,
appending to file, copying directories by giving source and destination, for counting
directories in directory, counting files in directory, counting items in directory, joining
path/paths, listing directories in directory, list a directory, listing files in a directory,
moving a file/a directory, reading process output, removing files and many more.
For Project OS was used for installing OpenPose and main functions used were ‘exists’,
‘join’, ‘basename’ and ‘splitext’.

2. SHUTIL: Shutil module offers a number of high level operations on individual and
collection of files too. ‘OS’ library mainly supports individual files only, for dealing
with collection of files and data ‘Shutil’ came in action. It’s functions fulfil the purpose
of copying content of source file to destination file or directory, ignoring argument,
ignoring files and directories, auditing an event, changing owner ‘user’/ ‘group’ of the
given path and many more archiving operations.
For project Shutil solved the purpose of moving files from drive to the main memory,
converting to back videos using openpose and for collectively moving converted videos
from output folder to drive.

3. OPENCV: OpenCV is a library of Python being designed to solve computer vision


problems. OpenCV here refers to Open Source Computer Vision Library build to
accelerate use of machine learning in commercial products. This library supports more
than 2500 optimized algorithms supported with the set of both classic and state of art
computer vision and machine learing algorithms. These algorithms solves the purpose
of detecting and recognizing face, identifying objects, extracting 3-D models of
objects, classifying human actions in videos, tracking moving objects, producing 3D
point clouds using stereo cameras, stitching images together in order to produce a high

43 | P a g e
resolution images, following eye movement, recognizing scenery and establishing
markers for overlaying it with augmented reality.
For project we used ‘cv2’ module for the working of OpenPose.

4. GLOB: Glob is a module used to retrieve files and pathnames matching a specified
pattern. This module is built-in within python no external need of installation of glob
externally. It is also predicted that as per the benchmarks it is faster than other methods
in order to match pathnames in directories.

5. NUMPY: NumPy is recognized as a core library for scientific computing in Python. It


provides high-performance based multi-dimensional array objects and tools for working
with these arrays. (Github.io/python_numpy_tutorial). NumPy array is a grid of values
being all of the same type and is being indexed by tuple of non-negative integers. It
solves the purpose of array indexing for integers, booleans and other data types. For the
project NumPy was used for reshaping videos at end before applying learning
algorithms.

6. PANDAS: Pandas is a high-level data manipulation tool developed by Wes McKinney,


It is built on NumPy package and its key data structure is ‘DataFrame’. Functions
offered under Pandas include indexing data, selecting data, reshaping & pivot tables,
working with missing data, working with text data, visualizing, splitting data, applying
data, combining data, enhancing performance and scaling to large datasets .. etc.
For project pandas function ‘test_train_split’ was used for splitting data for further test
data, train data and validation test data.

PIPELINE FORPRE-PROCESSING OF DATA IN OPENPOSE

Our data is a collection of videos; firstly we dealt with category 1 of 7 days of week, each sign
having more than 100 videos giving a total of 788 videos manually recorded at 720p resolution
and 30 frames per second using a Smartphone camera. Initially a video recoded is taken as a
raw video which further was processed before applying learning algorithms. Figure 3.7 shows
the pipeline followed for the same.

44 | P a g e
Figure 5.6

Pipeline of Pre-Processing a Video using OpenPose

Source – Author

Whole process was run through using ‘Google Collab’ as a jupyter notebook. After installing
and importing needful libraries, OpenPose was installed for the functionality of converting the
videos and for detecting the hands and arms along with its key-points and for converting
background color to black.

45 | P a g e
Initially alternate frames of videos are taken and combined in order to increase the dataset.
Alternate frames here means even and odd frames were combined together to get two videos
from one; with this our dataset got doubled. Next step was to apply OpenPose in order to
extract hands and arms from the videos along with its keypoints. In our project initially main
focus was on Hand Keypoint detection, which is a process of detecting joints of fingers as well
as finger tips in given dataset. The produces 21 keypoints, 20keypoints are of hands while 21 st
point signifies background. Figure 3.8 shows the raw video before any pre-processing for the
‘MONDAY’ as a sign.

Figure 5.7

Snapshot of video before any Pre-Processing

Source- Author

Figure 3.9 represents the snapshot of same video after using OpenPose for the detection of
keypoints on hands, arms, body and face. Each video recorded was named with the label as of
sign that video had. Using the commands from libraries videos were renamed in same pattern
so as to run the commands and process symmetrically.

46 | P a g e
Figure 5.8

Snapshot of video after applying OpenPose

Source- Author

In figure 3.9 all the keypoints were detected but as per the class taken into, before going to next
step we removed the background and turned it into black and removed facial key-point
detection as shown in figure 3.10

Figure 5.9

Snapshots of video after removing background & facial keypoints

Source- Author

47 | P a g e
After the removal of background and facial keypoints, signs were mainly concerned with the
movement of hands and arms, while other were just noise for model as per the classes selected.

Then OpenPose was again used this time only to extract hands and arms in the videos,
respectively as needed for better results and more ease in making learn algorithms for
recognizing signs precisely.

Figure 5.10

Snapshot of video depicting ‘Monday’ as a sign with extracted hands only

Source - Author

After the removal of noise and background in order to avoid over-fitting, next steps was to
apply Morphological operations on the extracted frames. Morphology is a broad set of image
processing operations for processing images based on their shape. In morphological operations
the value of each pixel in output image is based on a comparison of corresponding pixel in
input image along with its neighbour. Most basic morphological operations are dilation and
erosion, dilation adds pixels to the boundaries of objects in an image, where as erosion removes
pixels on object boundaries0. (mathworks.com)

Under dilation the value of output pixel is the maximum value of all pixels in the
neighbourhood. Morphological dilation makes objects more visible and fills in small holes in
objects whereas under erosion the value of output pixel is minimum value of all pixels in

48 | P a g e
neighbourhood. Morphological erosion removes islands & small objects so that only
substantive objects remain. (mathworks.com). For our project dilation was used, as basic effect
of operator on extracted frames was to gradually enlarge the boundaries of regions of
foreground white pixels.

After dilation of frames, these frames were aggregated into videos using FFmpeg command.
FFmpeg is a free and open-source project consisting of a vast software suite of libraries and
programs for handling videos, audio and other multimedia files and streams. FFmpeg is
designed for command-line based processing video and audio files, widely used for format
transcoding, basic editing like trimming & concatenation, video scaling, video post-production
effects and standard compliance. In our project FFmpeg command was used for aggregating
and concatenation of frames into video. After the aggregation of frames into a video, zoom was
applied with more focus on hands by tunning weights in the videos. Once the videos are ready,
these videos are then converted into JSON (JavaScript Object Notation) files; it is a lightweight
format that is being used for data interchanging. It is based on subset of JavaScript; this file is
realized as an array, vector, list or sequence. Computer doesn’t understand sentences hence
JSON is an efficient way of communicating using key value pairs and arrays.

Figure 5.11

Snapshot of JSON file having keypoints for both the hands

Source- Author

49 | P a g e
Then finally Keras Video Generator was applied to read the final video well prepared for
further modelling and testing purpose.

2. MEDIAPIPE

MediaPipe is graph based framework for building multi-modal including video, sensor and
audio is applied into machine learning pipelines. It is multi-platform running on mobile
devices, workstations & servers, supported by mobile GPU acceleration. Sensory data such as
video and audio streams enter the graph and returns the perceived description such as object
localization & face landmarks streams exit the graph. MediaPipe is designed for machine
learning (ML) practitioners including students, researchers and software developers
implementing production ready machine learning applications, accompanying research work,
publishing code and building technology prototype models. (mediapipe.readthedocs.io)

Figure 5.12

Hand Tracking – 21 landmarks in 3D with multi hand support

Source- mediapipe.dev

Figure 3.13 depicts hand tracking under MediaPipe with 21 landmarks in 3D with multi hand
support, based on high-performance palm detection and hand landmark model.

50 | P a g e
MediaPipe is free & open source, available under Apache 2.0 with an ecosystem of reusable
components. It is most widely re-usable and shared libraries for media processing within
Google too.

FUNCTIONALITY

1. Multi Hand Tracking: MediaPipe performs multi hand tracking with the help of
TensorFlow Lite on GPU (Graphics Processing Unit) with 21 landmarks in 3D support
on each hand.
2. Object Detection and Tracking: Object detection and tracking is performed using
TensorFlow Lite on GPU while tracking is using CPU. It provides instance based
tracking that is object ID is maintained across frames. It enables running heavier
detection models that are more accurate while keeping the pipeline lightweight and real
time on mobile devices as shown in figure 5.13
Figure 5.13
Object Detection and Tracking

Source – github.com/mediapipe

51 | P a g e
3. Face Detection: It works as an Ultra light-weight face detector with 6 landmarks and
multi face support. Face detection is performed using TensorFlow Lite on GPU
(Graphics Processing Unit).
4. Hair segmentation & 3D Object Detection: MediaPipe performs super realistic real-
time hair colouring the way it happens in ‘Snapchat’. Further detection and 3D pose
estimation of everyday objects like shoes& chairs is possible using mediapipe.

In our project we’ll be using multi-hand tracking with MediaPipe. The ability to perceive the
shape & motion of hands could be the vital component for improving user experience for a
variety of technological platforms & domains. Like Sign Language understanding and hand
gesture controlling is enabled to overlay the digital content & information in the world of
augmented reality. (ai.googleblog.com)

MEDIAPIPE – MULTI-HAND TRACKING

MediaPipe is an open source cross platform and framework used for building pipelines to
process perceptual data of different modalities like video & audio. This approach promotes and
provides the high fidelity hand & finger tracking employing Machine Learning (ML) in order
to infer 21 3-D key-points of hand from single frame.

Figure 5.14

Computing 3D Keypoints of hand

Source- ai.googleblog.com, 2019

52 | P a g e
LIBRARIES USED UNDER MEDIAPIPE:

Basic libraries are same as of used for OpenPose like NumPy, Pandas, Shutil, OpenCV and
Operating System. Only TensorFlow was an addition for the working of MediaPipe.

TENSORFLOW: TensorFlow helps in implementing, training, developing, deploying,


evaluating &productionization of machine learning models. It was developed by Google Brain
team and got open-source platform in November 2015.mThis library provides a collection of
tools and functions that enables a machine learning model to be implemented for a variety of
environments & purposes. Collection of tools includes TensorBoard used for visualization and
tracking too for metrics and data associated with machine learning model.
(towarddatascience.com). Another one is Collab, an online Jupyter notebook environment
used for implementing machine learning model and this only environment is used to perform
our whole sought project and then published on Github.

TENSORFLOW LITE: In order to access Machine Learning capabilities on devices like


mobile/smart phone or any other embedded devices, machine learning model are required to be
hosted on cloud server & accessed using restful API (Application Program Interface) services.
And TensorFlow Lite is a solution this problem which enables machine learning model within
mobile devices. TensorFlow Lite takes into existing TensorFlow model and then c9onverts
them into an optimized & efficient version, and this model now is small enough to be stored on
devices & sufficiently accurate enough in order to conduct suitable inference.

Advantages of TensorFlow Lite:

1. Easily converts TensorFlow models into TensorFlow Lite models as mobile optimized
models.
2. enables offline inference on mobile devices
3. By training TensorFlow model and then converting it into TensorFlow Lite model,
using this model can convert application into android application or iOS application.
4. One can perform typical Machine Learning models on these embedded devices without
using any external API or servers.

Under our project certain pipeline was followed in order to prepare our dataset for modelling
and further testing. Figure 5.15 shows the pipeline followed. MediaPipe Multi-Hand Tracking
function was used for detecting key-points of hands in the frame.

53 | P a g e
Figure 5.15

Pipeline of pre-processing a video using MediaPipe

Source - Author
Initial step is same as followed for OpenPose; dataset is doubled by taking alternate frames in
order to double the data using OpenCV. Next step was to apply MediaPipe – Multi hand
tracking function to the dataset created and doubled. Figure 5.16 shows the output of
mediapipe on video. The multi hand tracking detects the hands in input video stream and
returns 3-d dimensional landmarks locating features in each hand. Under hand detector,
bounding boxes are used around rigid objects like palm & fists which ultimately easier the
detection of hands with articulated fingers.

54 | P a g e
Figure 5.16
Snapshot of video after applying MediaPipe

Source - Author
After hand and palm detector hand landmark model performs the précised key-point
localization of 21 3D hand coordinates. This methodology was followed under the Hand
landmark model developed by Valentin Bazarevsky and Fan Zhang, the research engineers
under Google Research. (ai.googleblog.com, 2019)

After extraction of 42 landmarks (21*2) for each frame extracted, these landmarks were then
combined into text file for each video. As only text file data was needed as an input for further
process. [INPUT+PATH] function was used for the creation of ‘pickle’ file. Pickle file is used
processing a lot of data at once. Pickling is a way of converting any python object into a
character stream. This character stream contains all the information needed for reconstruction
of object in any python script. Pickle serializes the object before writing it to the file. Pickling
is a process of conversion of a python object into byte stream for storing it in a file or database
for maintaining program state across sessions and transporting data over network. After the
extraction of text file data is then normalized to 100 frames each, using padding. Data
normalization is process of ‘structuring a relational database in accordance with the series of
normal forms in order to reduce data redundancy and improving data integrity’. For
normalizing the video into 100 frames each, padding was used as a function. Padding is the
space between the content and border. Padding creates an extra space within an element, it is

55 | P a g e
done using ‘ljust ()’, ‘rjust ()’and ‘center ()’ functions in python. After the normalization of
data, it was further converted into ‘csv’ files. CSV (Comma Separated Values) files allow the
data to be saved into a tabular form. CSV files looks like garden variety spreadsheet but with
‘.csv’ extension at the end. Figure 5.17 shows the snapshot of CSV file for one our ‘Monday’
video from the dataset with the hand-landmarks detected using mediapipe.

Figure 5.17

Snapshot of CSV file for ‘Monday’ video from our database

Source - Author

CSV files are widely used by consumers, businesses and scientific applications as it is most
commonly used data format between which natively operates on incompatible formats. Finally
when data reached the last step of prep-processing using mediapipe after the conversion of
whole dataset into a CSV file, the data was well prepared for next phase of model execution.

56 | P a g e
MACHINE LEARNING ALGORITHMS USED:

Once the data is prepared we applied the processed data on two different learning algorithms,
RNN LSTM and RCNN CONV LSTM 2D MODEL. Hand gesture recognition and detection
for human-computer interaction is an active area of research in machine learning & computer
vision. Basic function of gesture recognition is to create a model, which can identify specified
human signs/gestures and can be used to convey information required. “Machine Learning is
the task of programming computers to optimize a performance criterion using data or past
experience” – machine learning is used to solve many real-life problems of computer vision
with tasks like recognition, prediction, classification & detection. For research methodology we
used neural networks and deep learning algorithms.

NEURAL NETWORKS: It is a set of algorithms, having models made loosely after human
brain which are designed to recognize patterns. Sensory data is interpreted using machine
perception and labelled/ clustered input. Neural network helps in clustering and classifying of
data which can be either labelled or unlabeled to be trained on. On labelled dataset we apply
supervised learning. For example input data is set of questions and labels are the correct
answers to those questions. And our algorithm will be taking a guess each time as an answer to
that question and that guess will be checked against the correct answer of that question.
Whenever an incorrect guess is made that will be detected and our algorithm will adjust itself
in order to make it guess better and this learning part of algorithm is known as machine
learning. Deep learning is also known as ‘Stacked neural networks’ which means networks are
composed of several layers (Pathmind.com/neural-network). The layers in neural network are
made of nodes; node is a place where all computation happens similar to the neuron in a human
brain, which fires when encountered with sufficient stimuli. “Node combines input from data
with a set of coefficients and weights that either amplify or dampen that input thereby
assigning significance to inputs with regard to task which the algorithm is trying to learn. A
node layer is a row of that neuron like switches that turn on or off as the input is fed through
the net. Each layer’s output is simultaneously the subsequent layer’s input, starting from an
initial input layer receiving your data.” Under neural networks we’ll study CNN
(Convolutional Neural Network), RNN (Recurrent Neural Network) and R-CNN (Region-
based CNN).

1. CNN (Convolutional Neural Network) is an ordinary neural networks that can learn
the given weights and biases. Convolutional means to convolve or roll together. It is

57 | P a g e
primarily used for the classification of images which means “name what they see”. It is
widely used in the field of computer vision; these networks are usually composed of an
input layer, several hidden layers some of which are Convolutional and an output layer.
CNN is being recognized now days for giving impressive results in the field of image
recognition; it allows the user to encode needed properties through the architecture in
order to recognize specific elements in images. CNN applies a series of filters needed to
raw pixel data of image given as an input, to extract & learn high-level features which
will be used further for classification. Convolutional layer detects visual features like
lines, edges, color drops, etc. With each Convolutional layer in the network it learns
complex patterns and abstract visual concepts efficiently.
Figure 5.18
A Neural Network predicting ‘dog’ and ‘cat’ images

58 | P a g e
Source – medium.com
2. RNN (Recurrent Neural Networks) are first of its kind state of art algorithms which
can memorize & remember previous inputs in its memory as when a huge set of
sequential data is given into. In any traditional neural network like CNNs it is assumed
that all inputs/ outputs are considered to be independent of each other but if we take
into consideration videos, audios pr any other format it is a very bad approach. For
example if we want to predict next word in any sentence, model must know which word
came before. RNNs was made to use sequential information, word ‘recurrent’ means
they perform same task for each element in sequence with output being dependent on
previous computation along with the memory which was captured as an information
calculated so far.
In a traditional neural network we see an input layer, hidden layer and output layer
which process the information independently with no relation to the previous one, Also
weights and bias is given in hidden layers having no chance to memorize the
information. Whereas in RNNs hidden layers are given same weights & bias throughout
the process, it gives them a chance to memorize information processed calculated
through the model.
ADVANTAGES OF RNNs
1. RNN models the sequence of data, each sample taken is assumed to be dependent
upon previous one.
2. RNNs can even use Convolutional layers in order to extend effective pixel
neighbourhood.

DISADVANTAGES OF RNNs

59 | P a g e
1. RNN cannot solve gradient vanishing & exploding problems.
2. Training a Recurrent Neural Network is very difficult
3. RNN cannot process an interminable sequence if one uses ‘tanh’ and ‘relu’ as
activation function.

Recurrent Neural Networks suffers from short term memory, if sequence of data is long
enough, they can’t carry information from earlier time steps to the later ones. Then
comes LSTM (Long Short Term Memory) in the picture, it is a kind of RNN only
capable of learning long term dependencies. LSTM, by their default behaviour is
capable of remembering the information for a long period of time. Typical sequential
data can be found on social networks and other formats could be video calls, movie
&trailers, satellites pictures and security cameras. LSTM was created as solution to the
problem of short term memory, with the use of its internal mechanism known as ‘gates’
which can regulate the flow of information. (towardsdatascience.com)

3. R-CNN (Region based Convolutional Neural Network): R-CNN was proposed


by Ross Girshik and others in 2013 to deal with the challenge of object detection.
CNN acts as a feature-extractor and output layer consists of features extracted
which are further fed to SVM for classification the object within the region
proposed. RCNN identifies different regions that form an object like varying colors,
scales, enclosure and textures. R-CNN uses selective algorithms which can generate
approximately 2000 region proposals. These regions proposed are smaller regions
of input/image that contains objects user is searching for in input/image. These 2000
region proposals are then provided further to CNN that computes CNN features
extraction or feature vectors representing the input in smaller dimensions using
CNN. R-CNN takes a lot of time to train the model reason being 2000 candidate
proposals. As users have to train multiple steps separately this makes the
implementation slow.
Figure 5.19
Workflow of R-CNN Model

60 | P a g e
Source- towardsdatascience

ConvLSTM2D LAYER: LSTM is used to detect the correlations over time and Convolutional
2D serves the purpose for capturing images & spatial features.

LIBARARIES USED UNDER MODEL TRAINING PHASE:

Tensor flow, NumPy& Pandas are same as of used for pre-processing.

1. KERAS: Keras is open source neural network library written in python, designed to
enable the fast experimentation with the help of deep neural networks focused on being
user-friendly, modular & extensible. It promotes easy & fast prototyping and supports
both recurrent neural networks, Convolutional neural networks & its combination too
i.e. R-CNN. Under Keras some libraries like NumPy, Sequential model types from
Keras, Core layers from Keras, CNN layers from Keras and utilities are imported which
will help in building the neural network architecture. Keras supports many functions
used in neural network building like layers, activation function, objectives, optimizers
& tools to work with text and images like data for simplifying deep neural network
code.
2. SCIKIT-LEARN: Scikit-learn also known as sklearn is open-source software machine
learning library written in python. It includes various algorithms based on regression,

61 | P a g e
clustering & classification such as random-forests, support vector machine, gradient
boosting & k-means. Scikit-learn were initially developed by David Cournapeau as a
google summer of code project. (en.wikipedia.org/Scikit-learn). Sklearn is written in
python, using NumPy for high-performance linear algebra & array operations.
3. MATPLOTLIB: Matplotlib is plotting library written in python & its numerical
mathematics extension NumPy. Matplotlib was originally written by John D. Hunter in
2003. This library is mainly used for visualization of data using graphs like- line plot,
histogram, scatter plot, 3D plot, image plot, contour plot & polar plot. Along with it
several toolkits are available to extend Matplotlib functionality like basemap, carttopy,
excel tools, GTK tools, Mplot3d, Natgrid and matplitlib2tikz.

ALGORITHMS USED FOR MODEL TRAINING IN OUR PROJECT:

For training purpose we applied two algorithms – RNN and R-CNN with ConvLSTM2D layer.

1. RNN with LSTM


2. R-CNN with ConvLSTM2D

In our project data was in form of videos further taken as frames to extract features in order to
make our model learn and classify the signs according to the labels given. RNN (recurrent
neural network) was used above CNN (Convolutional Neural Network) reason being in order
to predict a sign model needed to memorise the previous part also and CNN was not able to
fulfil that. CNN only considers current input whereas RNN considers current input &
previously received inputs too as it can memorize previous inputs in its internal memory. RNN
is primarily designed to recognize patterns in sequential data like text, handwriting, numerical
time series data and spoken word with LSTM cells. Our problem statement is about sign
classification based on videos/frames. For better classification RCNN was proposed primarily
for object detection as it helps in extracting 2000 regions foe each image given as an input on
selective search. These features are extracted using CNN, and then linear SVM is applied for
object identification and regression model for tightening the bounding boxes.

Figure 5.20

Visualization of Images getting trained under RCNN model

62 | P a g e
Source - Author

ANALYSIS & RESULT REPORTING

For testing & training purpose of our prototype model we had broken down into three distinct
datasets. The datasets were training set, validation set and test set.

Training Dataset: The training dataset is set of data used to train the model. In our model
during each epoch, model will be trained over on this dataset only and will be learning about
features extracted from this dataset. It is the data used in order to fit the model, in other words
model sees & learns through the training data only. Training and validation dataset is used in
model building process.

Test Dataset: The test dataset is used to evaluate the model. It is used once the model is
completely trained using train and validation dataset. Test dataset is independent of training
dataset. The test set is used to obtain the performance characteristics like – accuracy.

Validation Dataset: It is a sample dataset which is used to provide an unbiased evaluation for
a model to fit on the training dataset while tunning model hypermeters. The model occasionally

63 | P a g e
sees this data but never learns from this. Validation set is actually regarded as a part of training
dataset because it is used in order to build our model. Validation set is mainly used for tunning
the parameters of a model whereas test dataset is used for the performance evaluation.

Figure 5.21

Visualisation of dataset splits

Source – towardsdatascience, 2017

For evaluating a classification model different metrics are used, our models are evaluated on
basis of classification report computed from confusion matrix. It summarises the number of
correct & incorrect predictions with count values, and broken down by each class classified by
the model. The classification report is a visualize computing the precision, recall, F1 and
support score for the model trained.

CONFUSION MATRIX: It is a matrix known as a summary of predicted results of a


classification problem. It is often used to evaluate and describe the model’s performance on the
test data for which true values are known and it gives a visualization of the performance of the
model. The metrics in the matrix are defined as true & false positive and true & false negative.

Figure 5.22

Confusion Matrix

64 | P a g e
Source – geeksforgeeks.org

Above confusion matrix is for binary classification model. Where Class 1 is positive and Class
2 is negative. Definitions of the terms in the matrix are:

TP (True Positive): Observation is positive and is predicted to be positive.

FN (False Negative): Observation is positive but predicted negative.

TN (True Negative): Observation is negative and predicted to be negative.

FP (False Positive): Observation is negative but predicted positive.

Using confusion matrix one can compute following metrics for further evaluation of the model.

Classification Accuracy Rate: It is calculated by dividing the number of correct predictions


by total number of predictions.

Accuracy Rate = TP+TN / TP+TN+FP+FN

The best accuracy rate is 1.0, whereas the worst is computed to be 0.0.

Precision: It is ability to compute- When it predicts yes? How often is it correct? For each
class it is defined as the ratio of true positives to the sum of true positive and false positive.

Precision = TP / TP+FP

Recall: It is ability to compute- When it is actually yes, how often it predicts yes? For each
class it is defined as the ratio of true positives to the sum of true positive and false negatives.

Recall = TP / TP+FN

65 | P a g e
High Recall and Low Precision: It means maximum of positive observations are correctly
recognized (low FN) but there are a lot of false positive (high FP).

Low Recall and High Precision: It means that model missed a lot of positive examples as
they are being predicted negative (high FN) but those that were predicted as positive are indeed
positive (low FP).

F-measure / f1 score: It is not easy to compare and give analysis between two models with
high/low precision and low/high recall. In order to compare models, we use F-score because it
helps to measure precision & recall at same time. F-measure uses harmonic mean in place of
arithmetic mean as it punishes extreme values more and it will be nearer to smaller value of
precision or recall.

Figure 5.23

F- Measure formula

Source – towardsdatascience.com

For analysing and evaluating our model as mentioned above we used accuracy score for each
dataset train, test and validation on both the models RCNN and RNN on weekdays (seven signs
for seven days). Before applying the algorithms, the two types of pre-processing methods used
(OpenPose and Mediapipe) were analysed and OpenPose came out to be a more optimal
approach as mediapipe was not able to detect arms and other body key-points for detection
purpose as compared to OpenPose.

Analysis for Model 1: RCNN with ConvLSTM2D

Data was split into 3 parts training, validation and testing datasets.

Table 5.2

Dataset Splits along with its percentage and number of videos

Datasets Training Validation Testing

66 | P a g e
Split Percentage 66% 32% 2%
Number of videos 1019 496 35
Source – Author

For training 66% of data was used, 32% for validation purpose whereas 35 videos were used
for testing purpose which computes to be 2% of the data. In figure 4.1 number of videos used
for each class is depicted.

Figure 5.24

Number of videos used for training the model for each class

Source – Author
Confusion Matrix : We had 7 classes as 7 days of a week, so we had a confusion matrix with
N*N dimensions i.e. 7*7 matrix, with the left axis showing the True Class and below axis
showing the class assigned to an item with respect to true class. The diagonal elements are the
correctly predicted gestures. For each gesture we had 5 videos, ‘Monday’:0, ‘Tuesday’:1,
‘Wednesday’:2, ‘Thursday’:3, ‘Friday’:4, ‘Saturday’:5 and ‘Sunday’:6.

Figure 5.25

Computation of Confusion Matrix fo RCNN model

67 | P a g e
Source - Author

In confusion matrix as we can see for Monday it predicted 4 videos correctly predicted as it
was Monday and model predicted that it was Monday, except for one which was predicted to
be Thursday but actually it was Monday. For Tuesday, Wednesday and Thursday model
predicted all the 5 videos each correctly. For Friday model performed really low as only 2
videos out of 5 were predicted correctly and other three were not predicted correctly by the
model. Based on confusion matrix accuracy was computed.

Accuracy on test dataset came out to be 83% as shown in figure 5.26

As per classification report weighted average accuracy taken is 82% for RCNN model for
Indian sign language detection.

Figure 5.26

Accuracy on test dataset for RCNN model

68 | P a g e
Source - Author

Figure 5.27

Classification Report for RCNN model

Source – Author

From classification report we can interpret that ‘Tuesday’:1 has been classified much well than
other taking into consideration f1-Score and ‘Friday’:4 have lowest f1 score which compute to
be lowest on the classification scale of the model. And weighted average accuracy for the
model came out to be 83%.

Analysis for Model 2: RNN with LSTM

69 | P a g e
For this model dataset was split into 2 parts: Train and testing/validation. For this model testing
and validation is same variable.

Table 5.3

Dataset Splits along with its percentage and number of videos

Datasets Training Testing/ Validation

Split Percentage 80% 20%

Number of videos 1246 312

Source – Author

80% of data was used for training purpose and rest 20% for testing purpose counted to be in
total 1246 videos as training dataset and 312 videos as testing dataset.

Figure 5.28

Number of videos used for training and testing the model for each class

Source – Author

Under this model each class is given a code such as ‘Sunday’: 1, ‘Monday’: 2, ‘Tuesday’ : 3,
‘Wednesday’ : 4, ‘Thursday’ : 5, ‘Friday’ : 6, ‘Saturday’ : 7. For confusion matrix we had 7*7
dimensions similar to RNN model.

70 | P a g e
In Confusion matrix we could see for ‘Sunday’ model predicted 48 correct out of 49 videos, for
‘Monday’ model predicted 42 correct out of 50 videos, for ‘Tuesday’ out of 59 videos 77
videos were correctly recognised, for ‘Wednesday’ only 12 videos were correctly predicted out
of 40 videos, for ‘Thursday’ 25 videos were correctly predicted out of 41 videos, for ‘Friday’
only 27 videos correctly predicted out of 40 videos and for ‘Saturday’ out of 33 videos only 21
were correctly predicted.

The diagonal in confusion matrix depicts the correctly predicted videos. In the confusion
matrix the left axis shows the True Class and below axis showing the class assigned to an item
with respect to true class.

Figure 5.29

Confusion Matrix for RNN model

Source – Author

71 | P a g e
Figure 5.30

Accuracy on Test dataset for RNN model

Source- Author

Accuracy for test dataset for model RNN came out to be 74.36% . From classification report
we can compute that ‘Tuesday’:3 is on top on classification scale based on f1-score that
computed to be 0.95 and lowest for ‘Wednesday’:4 computing 0.44 on the classification scale.

Final accuracy computed came to be 74% for RNN model with LSTM layer.

Figure 5.31

Classification Report for RNN model

Source – Author

Final Interpretation: Based on accuracy score and classification scale of F1 score RCNN model
with ConvLSTM2D layer performed better than RNN model with LSTM. As accuracy for
RNN model came out to be 83% and for RNN it came to 74%. RNN model is trained and
tested without hyper-parameter tunning as it is still in progress whereas RCNN model is our
evolutionary prototype model all set to be deployed in next phase.

72 | P a g e
CHAPTER - 6

SUMMARY, CONCLUSION AND RECOMMENDATIONS

SUMMARY

Sign language is known as visual language used by the people with hearing and speech
disabilities for their daily conversation and communication activities. Sign language enables
deaf and mute people in order communicate during their daily routine, in which instead of
using speech or sound virtual transmission of gestures take place by combining hand shapes,
orientation and movement of hands, arms, body movement and facial expression to express the
speaker’s thoughts. Indian Sign Language (ISL) gesture recognition system where both hands
are used for performing gestures, recognizing a sign language gestures from continuous
gestures is a very challenging research issue. Our team came up with a solution by using Key
frame extraction method. In this research paper Sign Language Recognition System (SLRS) is
taken as one of the human computer interaction application, in which sign language is
converted to text or voice of oral language.

The proposed paper firstly we had four main stages: data collection, data pre-processing,
prototype modelling and then finally model evaluation. Our major objective was to develop a
prototype model for transliteration of 7 weekdays sign and to present the statistical result
evaluation with the techniques used. For data pre-processing Mediapipe and Openpose
techniques were used, from the openpose came to be handier for the research whereas
Mediapipe will be take into recommendation section for future scope as mediapipe take less of
a run time as compared to Openpose. Openpose was used to detect body- keypoints and then
finally giving the algorithm the input as a JSON file. In prototype modelling phase two
algorithms were used RCNN with Conv2DLSTM and RNN with LSTM giving an average
accuracy score of 83% and 74% respectively. For RCNN with ConvLSTM2D dataset was split
into the ratio of 66: 32:2 for training, validation and testing purpose respectively. And for
RNN with LSTM dataset was split into the ratio of 80:20 for training and testing only. As per
the final interpretation, on the basis of accuracy score & classification scale of F1 score RCNN
model with ConvLSTM2D layer out performed as compared to RNN model with LSTM. And
for analysis confusion matrix and classification report was used. Finally in the end our first
prototype model was developed and was all set for deployment phase which will be taken into
consideration as a future scope of the project.

73 | P a g e
CONCLUSION AND RECOMMENDATIONS

Our team had demonstrated Indian Sign Language Recognition System as effective,
considering the nature of the data used as videos and number of features considered. Sign
Language uses the visual-manual modality in order to convey the meaning. It is full-fledged
natural language with its own grammar and lexicon. We built our dataset by recording videos
for seven weekdays and lately greetings using 720p and 30 fps (frame per second). Our
prototype model is developed as a glove free with varying clothes and skin color by using
Openpose and mediapipe as they detect keypoints of the body. For prototype modelling RCNN
and RNN were taken into consideration as machine learning techniques giving average of 83%
accuracy with hyper-parameter tunning and 74% accuracy for RNN without hyper-parameter
tunning. On comparative RCNN outperformed as accuracy was reasonably high considering
the number of features used.

In the near future, this project will be taken forward by the next team at Sabudh Foundation
with an aim to achieve higher recognition rates with probably a larger dataset. Mediapipe will
also be explored more for the next evolutionary prototype model. And after needful changes
this evolutionary prototype will finally be deployed into a mobile application. Another aspect
to be taken is to build a Real Time Indian Sign Recognition Model on same dataset used
earlier.

Recommendation:

1. Trying new algorithms under the umbrella of RCNN and RNN of machine learning.
2. Adding new datasets according to the need of the hour.
3. Trying hands on Real-Time sign recognition Prototype.
4. Along with gesture recognition work can be done on sentence formation too.

74 | P a g e
Bibliography

James P. Kramer, Larry J .Leifer. (1990). A ‘’’’talking glove’’’ for nonverbal deaf individuals
https://www.semanticscholar.org/paper/A-''''talking-glove''''-for-nonverbal-deaf-Kramer-
Leifer/152d83f35b17a1e8b7782138d4c806fb53d190d4

Kouichi Murakami, Hitomi Taguchi. (1991)- Gesture recognition using recurrent neural
networks.

https://www.semanticscholar.org/paper/Gesture-recognition-using-recurrent-neural-networks-
Murakami-Taguchi/d6d5da11a092dbbbf240273c988b39c824ff791c

Takahashi and Kishino (1991) Recognition of sign language gestures using Neural Networks.
https://www.researchgate.net/publication/2309382_Recognition_of_Sign_Language_Gestures_
Using_Neural_Networks

Thad Starner (1995). Visual recognition of American Sign Language using Hidden Markov
Models.

https://www.researchgate.net/publication/
33835956_Visual_Recognition_of_American_Sign_Language_Using_Hidden_Markov_Model
s

Rung Huei Liang and Ming Ouhyoung (1998) Interactive Hand Pose Estimation Using a starch
sensing soft glove.

https://cims.nyu.edu/gcl/papers/2019-Glove.pdf

Bungeroth, Stein, Dreuw, Zhaedi and Ney (2005).A German sign language corpus of the
domain weather report.

http://www.lrec-conf.org/proceedings/lrec2006/pdf/673_pdf.pdf

David Rybach (2006). Appearance-based features for automatic continuous sign language
recognition.

http://thomas.deselaers.de/teaching/files/rybach_diploma.pdf

Mohamed Ahmed Mohandes, S.I Quadri, Mohamed Deriche (2007). Arabic Sign Language
Recognition an Image-based approach.

75 | P a g e
https://www.researchgate.net/publication/
4250210_Arabic_Sign_Language_Recognition_an_Image-Based_Approach

Hongmo Je and Jiman Kim Daijin (2007).Vision-based hand gesture recognition for
understanding musical time pattern and tempo.

https://www.researchgate.net/publication/224307321_Vision-
Based_Hand_Gesture_Recognition_for_Understanding_Musical_Time_Pattern_and_Tempo

Chen Qing, Nicolas D. Georganas, Fellow, IEEE and Emil M. Petriu, (2007). Hand Gesture
Recognition using haar-like features and a stochastic context free grammar.

http://citeseerx.ist.psu.edu/viewdoc/
download;jsessionid=0F2EE010D9481584DFC1ECDADD30814A?
doi=10.1.1.384.1888&rep=rep1&type=pdf

Al-Roussan, Khaled Assaleh and A.Tala’a (2009). Video based signer independent Arabic sign
language recognition using hidden markov models.

https://www.researchgate.net/publication/223400372_Video-based_signer-
independent_Arabic_sign_language_recognition_using_hidden_Markov_models

Kim and Cipolla (2009). Canonical Correlation Analysis of video volume tensors for action
categorization and detection.

http://mi.eng.cam.ac.uk/~cipolla/publications/article/2008-PAMI-CCA-action-recognition.pdf

Lahamy and Litchi (2009).Real-Time hand gesture recognition using range cameras.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.222.3074&rep=rep1&type=pdf

Youssif, Hamdy Ali and Aboutabl (2011).Arabic Sign Language (ArSL) recognition system
using HMM.

https://www.researchgate.net/publication/
267410465_Arabic_Sign_Language_ArSL_Recognition_System_Using_HMM

Patil, Pendharkar and Gaikwad (2014). American Sign Language Detection

http://www.ijsrp.org/research-paper-1114/ijsrp-p3566.pdf

Ahmed and Saleh Aly (2014). Arabic Sign language recognition using spatio-temporal local
binary patterns and support vector machine.

76 | P a g e
https://www.researchgate.net/publication/
269763215_Arabic_Sign_Language_Recognition_Using_Spatio-
Temporal_Local_Binary_Patterns_and_Support_Vector_Machine

Ibrahim, Selim and Zayed (2017).An automatic Arabic sign language recognition system
(ArSLRS).

https://www.sciencedirect.com/science/article/pii/S1319157817301775

Vishisht M. Tiwari (2017). Android application for sign language recognition

https://www.slideshare.net/VishishtTiwari/android-application-for-american-sign-language-
recognition

Suharjito, Ricky Anderson, Fanny Wiryana, Meita Chandra Ariesta and Kusuma (2017). Sign
Language Recognition Application System for deaf mute people: A review based on input-
process-output.

https://www.researchgate.net/publication/
320402323_Sign_Language_Recognition_Application_Systems_for_Deaf-
Mute_People_A_Review_Based_on_Input-Process-Output

Yangho-Ji, Sunmok Kim, Young-Joo Kim and Ki-Baek Lee (2018).Human like sign language
learning method using deep learning.

https://onlinelibrary.wiley.com/doi/full/10.4218/etrij.2018-0066

Dias, Fontenot, Grant, Henson and Sood (2019). American Sign Language gesture recognition

https://towardsdatascience.com/american-sign-language-hand-gesture-recognition-
f1c4468fb177

Joyeeta Singha and Karen Das (2013).Recognition of Indian Sign Language in live video.

https://www.researchgate.net/publication/
237054175_Recognition_of_Indian_Sign_Language_in_Live_Video

Tripathi, Baranwal and Nandi (2015).Continuous Indian sign language gesture recognition and
sentence formation.

77 | P a g e
https://www.researchgate.net/publication/
283184719_Continuous_Indian_Sign_Language_Gesture_Recognition_and_Sentence_Formati
on

Yogeshwar Ishwar Rokade and Prashant Jadav (2017). Indian sign language recognition
system
https://www.researchgate.net/publication/318656956_Indian_Sign_Language_Recognition_Sy
stem

Rishabh Gupta (2018) Indian-Sign-Language-Recognition.

https://github.com/imRishabhGupta/Indian-Sign-Language-Recognition/commits

https://github.com/imRishabhGupta/Indian-Sign-Language-Recognition

Continuous Indian Sign Language Gesture Recognition and Sentence Formation

Sudha.pdf

(PDF) American Sign Language Recognition System: An Optimal Approach

An Automatic Arabic Sign Language Recognition System (ArSLRS) - ScienceDirect

(PDF) Arabic Sign Language (ArSL) Recognition System Using HMM

Prototyping Model in Software Engineering: Methodology, Process, Approach

Human pose estimation using OpenPose with TensorFlow (Part 1)

Preparing datasets for machine learning | machine learning |

Keras Documentation

Data Preprocessing: Concepts - Towards Data Science

1912-ROS-wrapper-for-real-time-multi-person-pose-estimation-with-a-single-camera.pdf

Recurrent Neural Networks and LSTM explained - purnasai gudikandula - Medium

https://github.com/google/mediapipe/blob/master/mediapipe/docs/
hand_tracking_mobile_gpu.md

https://mediapipe.readthedocs.io/en/latest/

https://www.learnopencv.com/tag/openpose/

https://github.com/CMU-Perceptual-Computing-Lab/openpose

https://github.com/ildoonet/tf-pose-estimation

78 | P a g e
https://github.com/aarac/openpose

79 | P a g e

You might also like