Professional Documents
Culture Documents
CLASSIFICATION TECHNIQUES
Vinothini A 1, Prathiksha M2, Padmashree J3
#
Computer Science and Engineering Department, Rajalakshmi Engineering college,
Rajalakshmi Nagar, Thandalam, Chennai.
1
vinothini.a@rajalakshmi.edu.in
2
prathiksha.m.2018.cse@rajalakshmi.edu.in
3
padmashree.j.2018.cse@rajalakshmi.edu.in
Abstract— The goal of this study is to conduct a comparative learning, and artificial intelligence (AI) can considerably
experimental evaluation of computer vision-based techniques for assist in bridging the gap and allowing these impaired persons
sign language recognition. A thorough experimental investigation to communicate with others more readily.
has been carried out by looking at the most promising machine
learning and deep neural network approaches in this field. Each II. METHODOLOGY
of the papers in this study has its own set of advantages and
disadvantages. Hand gestures are the most popular means of This study examines and revises its many methods for
communication for the speech and hearing-impaired population recognising and translating sign language for improved
to transmit their thoughts to regular people in public places, and comprehension.
the ordinary community finds it difficult to interpret the
conveyed information. This problem can be solved by developing A. SENSORS
a real-time hand gesture recognition system that converts sign Dynamic hand gesture recognition can either be vision
language to text on a word-by-word basis. Vision-based sensors,
based or motion based. Motion based hand gesture recognition
motion-based sensors, image recognition techniques, object
detection algorithms, and other methods are used to accomplish can be achieved easily using sensors like leap motion sensor,
this. The goal of this study is to focus on various methods for data gloves and Microsoft kinect sensor.Usually, techniques
classifying sign language. like SVM, HMM or neural networks are used on the data
Keywords— computer vision, sign language recognition, deep collected by sensors for classification.Glove-based sensors
neural networks, image recognition and vision-based sensors are the two major types of sensors
being used for hand gesture recognition[14]
I. INTRODUCTION
1) Leap Motion Sensor: Leap Motion Controller with
Touch, signalling, and even smell are used by every species
Two-Layer Bidirectional Recurrent Neural Network was
in the world to interact with their counterparts. In the case of
introduced in [16]. It can be known that Leap motion
humans, it is speaking. For deaf and dumb individuals,
controller, by its name, is used for motion based dynamic
however, this is not the case. Signing is the only means of
gesture recognition. It also has infrared cameras. The
communication for the deaf. For all those who rely on it to
following are the stages: Feature extraction, data collection
give wings to their thoughts, it is a sign of support, cultural
and processing and then Bi directional Recurrent neural
and linguistic identity. People with hearing and speech
network(BRNN). The Leap motion sensor can note details
difficulties use sign language all around the world. Each
about the hand, palm and wrist if kept at a distance of 25 cm
country has its own sign language, such as American sign
to 600 cm from the sensor, hence extracting the hand
language (ASL), British sign language (BSL), Chinese sign
features.About 26 Features related to angles, positions, and
language (CSL), French sign language (LSF), Indian sign
distances between fingers are extracted and used. Leap motion
language (ISL), and so on. The unfortunate aspect of sign
sensor does not capture images but notes the positions and
language is that it is unknown to the general public and has
features in the form of values, helping it achieve better
never been viewed as a skill to be learned. There are around
prediction levels. During data collection, the binocular RGB
63 million persons in India who are deaf or have speech
high-definition camera comes into play as it filters images and
difficulties. Although sign language interpreters can translate
eliminates the background. It is important to specify the start
sign language, their scarcity is a drawback. Only 250 qualified
and end of the gesture during data collection. Inconsistent
sign language interpreters are available to assist a deaf
shapes and range sizes due to interference of external elements
population of 7 million people. Others lack the means to use
are eliminated using this preprocessing method. Bidirectional
these interpreters. Hand movements, face emotions, and body
recurrent neural network is used in order to train and predict
language make up sign language. Machine learning, deep
hand gestures. An LSTM provides input to an RNN and if it is
a single directional RNN then it can either be forward or Moore–Neighbor technique once the binary pictures of the
backward. In our case, they have been combined with the hand areas have been discovered. The K-cosine corner
structures to form a bidirectional RNN. The model is trained detection algorithm computes the fingertip points based on the
with various ASL datasets using cross entropy loss function, coordinate values of the identified hand contours after
Adam gradient descent and variable learning rates. The model retrieving the hand contour. The 3D CNN model achieves
is verified using 5-fold, 10-fold and leave-one-out cross 92.6% accuracy, while SVM and CNN are 60.50% and
validation techniques and an average accuracy of 98% is 64.28%, respectively. This paper also concludes that the
obtained. ensemble learning method outperforms the single 3D CNN in
terms of video classification with an ensemble of 15 3D CNN
2) Data Gloves: H. S. Anupama et. al [8] introduces how models, which achieves 97.12% accuracy.
physical properties like gloves can be used to recognize sign
language. A number of sensors are sewed onto a glove which
is connected to a computer system. The sensors are connected
to an Arduino UNO board through the breadboard using
jumper wires. There are 5 Flex sensors and aluminium foil is
used as a contact sensor. Flex and contact sensors collect
values of the finger position, the bend of fingers and contact
between fingers. A gyroscope senses rotation. This setup is
used and the input data set is collected in the form of
numerical values. Hence, no images are used. The model is
trained using KNN algorithm with k value as 4. It could detect
alphabets, numbers and certain basic words. The output is
displayed in the form of text on the monitor and a google
speech to text API is used as well. The accuracy achieved here
is 93%. Even though this method is cheaper than using a Fig. 2 Sensor based techniques
kinect camera, everyone cannot use a physical device
connected to a computer all the time to translate sign
language.
B. MACHINE LEARNING
1) Data Pre-processing: K. Shenoy et. al [1] proposed that for
eliminating background details, the pre-processing begins with
face removal and stabilisation using HOG features and a linear
SVM classifier to reduce false positive rates, and skin colour
segmentation using YUV and RGB color space, followed by
morphological operations to minimise noise. Lastly, Grid
based fragmentation technique was used for feature
extraction.The benefit of this method is that the characteristics
created change depending on the orientation of each hand
posture.
In [2], stages such as dataset collection, segmentation and
Fig. 1 Data glove method
feature extraction are involved. An image dataset of 9
3) RGB-D Vision based Camera: Vision-based techniques numbers and 26 English alphabets is collected. After dataset
utilise only a camera to provide human–computer interface collection, image pre-processing contains various stages. The
interaction without the use of any other equipment. The images are resized, converted to grayscale from RGB, median
standard camera (RGB)-based systems and depth sensor blur is applied, skin masking and detection is done and canny
(RGB-D)-based systems, respectively, are the two different edge detection is used to detect sharp edges in an image and
forms of vision-based sensors. [14] utilises the Microsoft SURF( speeded up robust features) is used for feature
Kinect v2, a RGB-D Camera, to capture the gestures and a 3D extraction.
Convolutional Neural Network to train and test the data. The Subhalaxmi et. al [19] created two self-made datasets one for
centre of the palm and the hand region of interest are retrieved single handed gestures and another for double handed
from depth data given by the Kinect skeletal tracker, then gestures. The captured image is stored in the form of
converted to binary images. A border-tracing technique is coordinate values for each landmark and stored in a CSV file
used to extract and characterise the hand shapes. Median and hence, the image data is stored in the form of numbers and
filtering and morphological processing are used to reduce
unwanted noise. The hand contours are generated using the
labels. For the double handed dataset, the euclidean distance
between the similar landmarks of the two hands are required. Mediapipe API is a high resolution, finger and hand tracking
and mapping tool which has 21 3D landmarks in the hand and
These various pre-processing techniques prepare the data to palm (from 0 to 20). This is used for hand detection in the
be trained using different machine learning algorithms for webcam feed and the obtained datasets are trained using
classification. various machine learning models such as SVM model, an
2) Classification: In [1] for classification, an algorithm that SVM model, Random Forest classifier model, KNN classifier,
can effectively identify clustered data is required. The Decision Tree, Naive Bayes and Logistic Regression.[19]
K-Nearest Neighbour (K-NN) algorithm was discovered to be 3) Results: After classification, the result is sent back to the
suitable for this type of data distribution. Using the previously user as text. Although this method [1] provides high accuracy
stated Grid-based fragmentation, hand features from each for real-time gesture recognition, it can only detect
frame are recovered in real-time. The Hidden Markov Model single-handed gestures. It has a 99.7% accuracy rate in
(HMM) is used to handle the variations in dynamic gestures. classifying all 33 ISL hand poses. With an average accuracy
of 97.23%, the algorithm was also able to classify 12 motions.
Currently, skin colour segmentation is used to extract the hand
gestures from each frame. For accurate recognition, the
subject must be wearing a full-sleeved shirt which is not
feasible in all situations. This method also necessitates that the
lighting be ideal – neither too dark nor too bright.
The worst performers are naive bayes and logistic regression
while the best ML model is Support vector machine, which
gave an accuracy of 99%. The significant advantage is that
there are no background restrictions and it can be used in the
future in a smartphone as well, because of lower
computational complexity.[19]
On the screen, the output is shown as text. After testing the
models in real-time live recognition, the SVM classifier
achieved an accuracy of 99.5%, CNN produced an accuracy of
88.89% and RNN produced a maximum average testing
accuracy of 82.3%.[2]
Precision 94.88%
Recall 98.66%