A Review of Sign Language Classification Techniques

A REVIEW OF SIGN LANGUAGE
CLASSIFICATION TECHNIQUES
Vinothini A 1, Prathiksha M2, Padmashree J3
#
Computer Science and Engineering Department, Rajalakshmi Engineering college,
Rajalakshmi Nagar, Thandalam, Chennai.
1
vinothini.a@rajalakshmi.edu.in
2
prathiksha.m.2018.cse@rajalakshmi.edu.in
3
padmashree.j.2018.cse@rajalakshmi.edu.in
Abstract— The goal of this study is to conduct a comparative learning, and artificial intelligence (AI) can considerably
experimental evaluation of computer vision-based techniques for assist in bridging the gap and allowing these impaired persons
sign language recognition. A thorough experimental investigation to communicate with others more readily.
has been carried out by looking at the most promising machine
learning and deep neural network approaches in this field. Each II. METHODOLOGY
of the papers in this study has its own set of advantages and
disadvantages. Hand gestures are the most popular means of This study examines and revises its many methods for
communication for the speech and hearing-impaired population recognising and translating sign language for improved
to transmit their thoughts to regular people in public places, and comprehension.
the ordinary community finds it difficult to interpret the
conveyed information. This problem can be solved by developing A. SENSORS
a real-time hand gesture recognition system that converts sign Dynamic hand gesture recognition can either be vision
language to text on a word-by-word basis. Vision-based sensors,
based or motion based. Motion based hand gesture recognition
motion-based sensors, image recognition techniques, object
detection algorithms, and other methods are used to accomplish can be achieved easily using sensors like leap motion sensor,
this. The goal of this study is to focus on various methods for data gloves and Microsoft kinect sensor.Usually, techniques
classifying sign language. like SVM, HMM or neural networks are used on the data
Keywords— computer vision, sign language recognition, deep collected by sensors for classification.Glove-based sensors
neural networks, image recognition and vision-based sensors are the two major types of sensors
being used for hand gesture recognition[14]
I. INTRODUCTION
1) Leap Motion Sensor: Leap Motion Controller with
Touch, signalling, and even smell are used by every species
Two-Layer Bidirectional Recurrent Neural Network was
in the world to interact with their counterparts. In the case of
introduced in [16]. It can be known that Leap motion
humans, it is speaking. For deaf and dumb individuals,
controller, by its name, is used for motion based dynamic
however, this is not the case. Signing is the only means of
gesture recognition. It also has infrared cameras. The
communication for the deaf. For all those who rely on it to
following are the stages: Feature extraction, data collection
give wings to their thoughts, it is a sign of support, cultural
and processing and then Bi directional Recurrent neural
and linguistic identity. People with hearing and speech
network(BRNN). The Leap motion sensor can note details
difficulties use sign language all around the world. Each
about the hand, palm and wrist if kept at a distance of 25 cm
country has its own sign language, such as American sign
to 600 cm from the sensor, hence extracting the hand
language (ASL), British sign language (BSL), Chinese sign
features.About 26 Features related to angles, positions, and
language (CSL), French sign language (LSF), Indian sign
distances between fingers are extracted and used. Leap motion
language (ISL), and so on. The unfortunate aspect of sign
sensor does not capture images but notes the positions and
language is that it is unknown to the general public and has
features in the form of values, helping it achieve better
never been viewed as a skill to be learned. There are around
prediction levels. During data collection, the binocular RGB
63 million persons in India who are deaf or have speech
high-definition camera comes into play as it filters images and
difficulties. Although sign language interpreters can translate
eliminates the background. It is important to specify the start
sign language, their scarcity is a drawback. Only 250 qualified
and end of the gesture during data collection. Inconsistent
sign language interpreters are available to assist a deaf
shapes and range sizes due to interference of external elements
population of 7 million people. Others lack the means to use
are eliminated using this preprocessing method. Bidirectional
these interpreters. Hand movements, face emotions, and body
recurrent neural network is used in order to train and predict
language make up sign language. Machine learning, deep
hand gestures. An LSTM provides input to an RNN and if it is
a single directional RNN then it can either be forward or Moore–Neighbor technique once the binary pictures of the
backward. In our case, they have been combined with the hand areas have been discovered. The K-cosine corner
structures to form a bidirectional RNN. The model is trained detection algorithm computes the fingertip points based on the
with various ASL datasets using cross entropy loss function, coordinate values of the identified hand contours after
Adam gradient descent and variable learning rates. The model retrieving the hand contour. The 3D CNN model achieves
is verified using 5-fold, 10-fold and leave-one-out cross 92.6% accuracy, while SVM and CNN are 60.50% and
validation techniques and an average accuracy of 98% is 64.28%, respectively. This paper also concludes that the
obtained. ensemble learning method outperforms the single 3D CNN in
terms of video classification with an ensemble of 15 3D CNN
2) Data Gloves: H. S. Anupama et. al [8] introduces how models, which achieves 97.12% accuracy.
physical properties like gloves can be used to recognize sign
language. A number of sensors are sewed onto a glove which
is connected to a computer system. The sensors are connected
to an Arduino UNO board through the breadboard using
jumper wires. There are 5 Flex sensors and aluminium foil is
used as a contact sensor. Flex and contact sensors collect
values of the finger position, the bend of fingers and contact
between fingers. A gyroscope senses rotation. This setup is
used and the input data set is collected in the form of
numerical values. Hence, no images are used. The model is
trained using KNN algorithm with k value as 4. It could detect
alphabets, numbers and certain basic words. The output is
displayed in the form of text on the monitor and a google
speech to text API is used as well. The accuracy achieved here
is 93%. Even though this method is cheaper than using a Fig. 2 Sensor based techniques
kinect camera, everyone cannot use a physical device
connected to a computer all the time to translate sign
language.
B. MACHINE LEARNING
1) Data Pre-processing: K. Shenoy et. al [1] proposed that for
eliminating background details, the pre-processing begins with
face removal and stabilisation using HOG features and a linear
SVM classifier to reduce false positive rates, and skin colour
segmentation using YUV and RGB color space, followed by
morphological operations to minimise noise. Lastly, Grid
based fragmentation technique was used for feature
extraction.The benefit of this method is that the characteristics
created change depending on the orientation of each hand
posture.
In [2], stages such as dataset collection, segmentation and
Fig. 1 Data glove method
feature extraction are involved. An image dataset of 9
3) RGB-D Vision based Camera: Vision-based techniques numbers and 26 English alphabets is collected. After dataset
utilise only a camera to provide human–computer interface collection, image pre-processing contains various stages. The
interaction without the use of any other equipment. The images are resized, converted to grayscale from RGB, median
standard camera (RGB)-based systems and depth sensor blur is applied, skin masking and detection is done and canny
(RGB-D)-based systems, respectively, are the two different edge detection is used to detect sharp edges in an image and
forms of vision-based sensors. [14] utilises the Microsoft SURF( speeded up robust features) is used for feature
Kinect v2, a RGB-D Camera, to capture the gestures and a 3D extraction.
Convolutional Neural Network to train and test the data. The Subhalaxmi et. al [19] created two self-made datasets one for
centre of the palm and the hand region of interest are retrieved single handed gestures and another for double handed
from depth data given by the Kinect skeletal tracker, then gestures. The captured image is stored in the form of
converted to binary images. A border-tracing technique is coordinate values for each landmark and stored in a CSV file
used to extract and characterise the hand shapes. Median and hence, the image data is stored in the form of numbers and
filtering and morphological processing are used to reduce
unwanted noise. The hand contours are generated using the
labels. For the double handed dataset, the euclidean distance
between the similar landmarks of the two hands are required. Mediapipe API is a high resolution, finger and hand tracking
and mapping tool which has 21 3D landmarks in the hand and
These various pre-processing techniques prepare the data to palm (from 0 to 20). This is used for hand detection in the
be trained using different machine learning algorithms for webcam feed and the obtained datasets are trained using
classification. various machine learning models such as SVM model, an
2) Classification: In [1] for classification, an algorithm that SVM model, Random Forest classifier model, KNN classifier,
can effectively identify clustered data is required. The Decision Tree, Naive Bayes and Logistic Regression.[19]
K-Nearest Neighbour (K-NN) algorithm was discovered to be 3) Results: After classification, the result is sent back to the
suitable for this type of data distribution. Using the previously user as text. Although this method [1] provides high accuracy
stated Grid-based fragmentation, hand features from each for real-time gesture recognition, it can only detect
frame are recovered in real-time. The Hidden Markov Model single-handed gestures. It has a 99.7% accuracy rate in
(HMM) is used to handle the variations in dynamic gestures. classifying all 33 ISL hand poses. With an average accuracy
of 97.23%, the algorithm was also able to classify 12 motions.
Currently, skin colour segmentation is used to extract the hand
gestures from each frame. For accurate recognition, the
subject must be wearing a full-sleeved shirt which is not
feasible in all situations. This method also necessitates that the
lighting be ideal – neither too dark nor too bright.
The worst performers are naive bayes and logistic regression
while the best ML model is Support vector machine, which
gave an accuracy of 99%. The significant advantage is that
there are no background restrictions and it can be used in the
future in a smartphone as well, because of lower
computational complexity.[19]
On the screen, the output is shown as text. After testing the
models in real-time live recognition, the SVM classifier
achieved an accuracy of 99.5%, CNN produced an accuracy of
88.89% and RNN produced a maximum average testing
accuracy of 82.3%.[2]
Fig. 3 Preprocessing and classification using Grid based techniques,

Hidden Markov Model and KNN
The extracted features from the previous steps are fed as

inputs to the different algorithms such as support vector
machine (SVM), Convolutional Neural Network (CNN) and
Recurrent Neural Network (RNN). Support Vector Machine
(SVM), which is a supervised machine learning algorithm, is
used along with K-means clustering classifier, an
Fig. 4 SVM VS CNN VS RNN Accuracy Comparison
unsupervised machine learning algorithm, and BoV (Bag of
Visual Words) to achieve better accuracy. The K-means
clustering is used to classify similar data points to ‘k’ number
of clusters where each cluster is a class. The output from this
is fed to the BoV for classification of images based on the C. YOLO
count of visual words that occur in an image. The output from Traditional algorithms such as Adaboost, Fast R-CNN, and
this stage is fed to the SVM classifier, which is the main Faster R-CNN have enhanced their accuracy even more.
classifier where training and testing of data is done. The However, the poor speed of these detectors has always been a
performance of CNN and RNN is also noted.[2] barrier. YOLO and SSD presented one-stage detectors to
tackle the speed problem and can be utilised for real-time features from the hand and recognized hand gestures with the
detection applications in order to speed up the detection accuracy, precision, recall, and F-1 score of 97.68, 94.88,
method. In a single assessment, YOLO employs a single 98.66, and 96.70%, respectively. This model provides high
neural network to predict bounding boxes and class accuracy for gesture recognition of numbers from 1 to 5 but is
probabilities straight from the input photos. The method not tested against complex backgrounds and other sign
separates the supplied image into sections and then predicts language gestures.[7]
each cell's coordinates and classifies accordingly. The network
backbone is darknet-53, which has 53 convolutional layers. A
feature extraction network and three detection networks make TABLE I. YOLO PARAMETERS AND RESULTS
up the architecture[10] Parameters Accuracy
Accuracy 97.68%
Precision 94.88%
Recall 98.66%
F-1 Score 96.70%
D. DEEP LEARNING-NEURAL NETWORK

Fig. 5. YOLO and Darknet Architecture
1) 2D CNN: A Deep learning approach using a Convolutional
1) Data Pre-Processing: The images of a dataset were taken Neural Network model has been developed in [9] to recognize
and labelled using a software according to the YOLO format static hand gestures. Images in the dataset are rescaled and the
before training. In order to fine tune this model, another RGB images are first passed into the CNN model. Feature
dataset was captured in a complex background and the extraction is done by the ReLu and max pooling layer. There
network was trained. This will help in better sign language are 3 convolution layers that follow cascaded discrete
recognition in real world scenarios.[18] The dataset used in convolution. Training was done with 20 epochs and 20% of
[7] consists of images classified into 5 classes which are the images were used for testing. The performance of the
labelled in YOLO and pascal VOC format. Data augmentation model has been evaluated using five fold cross validation and
techniques such as flipping it horizontally and sometimes results have been recorded. Also the model can automatically
taking the respective images of the hands are done as a derive the potential features that discriminates the hand
preprocessing step postures even having very small intra class variations. The
2) Classification: Three things are considered for sign average accuracy, precision, recall, and F1-score values were
language recognition: hand segmentation, handshape used to assess the classification outcome. The accuracy was
classification, and hand shape translation which is achieved by 94.7% for Dataset using Statistical Measures and an accuracy
Deep neural networks. A DNN architecture was chosen with of 99.96% was achieved.
highest precision after trying different DNN architectures and A dataset containing 49000 images has been captured using
3 different sizes of images. YOLOv3-tiny and YOLOv3 were a webcam and used in [5]. This dataset has ISL english
tested and the YOLOv3 model was chosen as it offered a alphabets from A to Z and numbers from 1 to 9 in RGB
better response time.[18] After the data has been categorised format. During preprocessing, all these RGB images are
in [7], they give it to the DarkNet-53 model, which has been converted to 1D grayscale images. Canny edge detection helps
trained to some specifications. With a good accuracy the in feature extraction, firstly by reducing the noise in these
developed hand gesture recognition system recognises images by smoothing followed by angle and edge detection
real-time objects and movements from video frames. resulting in crisp and clear edges. The detected edges are then
3) Results: The YOLOv3 model was chosen as it offered a matched with the images in the dataset so that the camera can
better response time and a precision of 81.74%. A spelling detect the signs when shown on the camera in real time. The
correction dataset is trained using a bidirectional LSTM in model is trained using a CNN classifier consisting of 15 layers
which commonly used words’ spellings are present resulting giving a training accuracy of 95.9% and validation accuracy
in an accuracy of 98.07%. While testing in real time, the signs of 98.5%. Using Canny edge detection proves to do better
were predicted as per the dataset and displayed. Due to edge detection when compared to other methods like classical,
similarities in certain signs for alphabets and timing issues, zero crossing and LoG but it is time consuming and
letters might get repeated or misread and the generated word computationally expensive.[5]
will not make sense. [18] The YOLOv3 model trained using This [4] method makes use of the American Sign Language
Darknet-53 is compared with other models like VGG16, SGD Dataset from MNIST Kaggle consisting of 25 classes. This
and SSD. This comparison of the YOLOv3 model with dataset is pre-processed and data cleaning is done. For
different methods achieved better results by extracting classification using Convolutional Neural Networks, the single
layer CNN model was trained with different filter sizes to find A deep learning model is implemented to recognise sign
out the optimal filter size. The optimal value of filter size for language gestures. Two datasets are considered for evaluation
32 filters is found to be 8x8 where minimum error is 2.656%. in [15]. One, a large number of ISL images is collected using
To discover the best filter size for the CNN, the double layer a RGB camera. Second, a publicly available ASL dataset is
CNN model was trained with filter sizes ranging from 5 to 12. used. VGG-11 and VGG-16 were trained and evaluated in this
The optimal value of filter size for 32 filters is 8x8 where paper in order to evaluate the efficacy of the proposed model.
error is 1.419%. In conclusion, the accuracy of the single layer In this study, a sample of static gestures was gathered from
CNN is 97.34%, while the accuracy of the double layer CNN various interpreters, and then images were pre-processed to
is 98.58%. retrieve gesture data effectively. In pre-processing, each image
is cropped, resized and labelled. This data is trained using a
2) 3D CNN: A 3D Convolutional Neural Network deep learning model and the feature extraction is done. In this
classification system is used to classify the gestures. The 3D work, the proposed model is termed as Gesture-CNN
Convolutional Neural Network has one benefit over the (G-CNN) and it consists of 4 convolutional layers, 3 pooling
two-dimensional Convolutional Neural Networks (2D CNN) layers, 2 dropout layers, 2 fully-connected layers, and 1
is that it captures motion information using the convolution softmax layer, with 12 layers in total. For ISL letters of the
process both in time and space. The model may readily be alphabet and static words, the G-CNN model achieves a final
expanded to big datasets because the inputs are whole frames accuracy of 94.83 % and 99.96 %, respectively. The
of clips that do not require any pre-processing. This model is classification accuracy of the VGG-11 model is 93.60 % for
also extended to gesture control for applications. In category-1 and 97.87 % for category-2, respectively. For the
conclusion, DL algorithms showed higher accuracy than the same dataset, the accuracy of VGG-16 is 93%and 97%. The
traditional ML model. results show that G-CNN exceeds the VGG-16 and VGG-19
3-D CNNs, a more advanced successor to Convolutional in terms of performance.
Neural Networks (CNNs), are presented in [6] to recognise In [11] Halvardsson et. al elucidates how Swedish sign
patterns in volumetric data such as videos. Boston ASL video language recognition has been done using transfer learning
dataset is trained for classification of 100 words. For methods of a convolutional neural network model called
pre-processing, the frames are extracted from this dataset and InceptionV3. The dataset is collected for every alphabet using
each frame is processed individually. After grayscale a webcam in the form of videos with slight variations in
conversion of these frames, median filtering is used to remove rotation and background is present for each class. These video
undesirable noise and spots in the image. Histogram frames are taken as images for the dataset. Image
equalisation is used to balance out the lighting differences in augmentation is done to these images using Keras class
the frame. To reduce the computation, each frame size is ImageDataGenerator where the images’ height and width are
reduced. The frames are manually cropped in such a way that modified, sheared, zoomed and even lighting adjustments are
the video sequences for training CNNs exclusively contain made. The pre-trained model is taken in which the first 20
hand gestures and motions. The first layer learns to recognise layers are the same, followed by the addition of new layers.
patterns in edges, the second layer combines those patterns to Three convolutional layers that use ReLu are added and then
make motifs, the third layer learns to combine motifs to form average pooling layers and a softmax layer in the end. Mini
patterns in parts, and the fourth and final layer learns to batch gradient descent is used as the optimization algorithm.
distinguish objects from the parts identified in the previous This new model is trained using 50% of the acquired dataset
layer. In terms of precision (3.7%), recall (4.3%), and with 30 epochs and batch size 32. This model can classify sign
f-measure, the suggested work beats existing state-of-the-art language alphabets with an accuracy of 85%. The letters that
models (3.9%) have dynamic signing have been omitted and so have facial
expressions.
4) CNN AND LSTM
In [3] V. Adithya et. al present a dataset that is the first
publicly available dataset of emergency ISL hand gestures.
This dataset is prepared by collecting RGB videos of eight
emergency sign language gestures from 26 people, with skin
color variations. The videos were recorded with a black
background. It was resized and cropped by removing
unwanted background. The cropped dataset's ISL motions
Fig. 6. Steps for 3D-CNN were analysed by utilising a multiclass support vector machine
(SVM) to identify them using a traditional feature-driven
3) TRANSFER LEARNING approach. It is widely used where the training data is less and
this produces more accurate results. This research used a
multiclass SVM and achieved a classification accuracy of motion of the object and they are extracted with Optical Flow
90% on average. The deep learning strategy combines a (OF) and Scene Flow (SF) methods. Then hand feature
pre-trained convolutional neural network (CNN) model, extraction is performed using the AlexNet model with some
GoogleNet, with a long term short term memory (LSTM) added parameters. This is followed by hand pose feature
network for gesture classification. The LSTM network extraction done using custom CNN models. All these
classification model is composed of a sequence input layer, a extracted features are fused together and given as an input to
bidirectional LSTM layer with 2000 hidden units, and a an LSTM model and GRU model in order to extract temporal
dropout layer. The classification model's performance was features and it is inferred that LSTM performs better. The
evaluated using test videos, and it attained an average above spatial features have been grouped and fused in
classification accuracy of 96.25%. The precision, recall, and different ways, including and omitting some features to obtain
F-score values related to each gesture class were used to various results. This is trained in a powerful system with 90
evaluate the classification performances of both approaches. GB RAM and NVIDIA GPU. Various accuracies have been
The techniques in [12] describe the data augmentation obtained for the 4 different datasets, the best being
methods used for sign language dataset for dynamic gesture Montalbano II with 99.08%.
recognition. By merging one or two of the videos to be
trained, camera angle conversion, finger length conversion, 5) CNN AND DNN
and random keypoint removal, data augmentation was To begin with, in [13] the dataset is collected using a
completed. The enhanced data is then sent into openpose, mobile camera. The dataset consists of 34 different sign
which converts the keypoint video into a frame picture. The language gestures belonging to south indian sign languages
feature was extracted from the frame picture data by running it such as Kannada and Telugu. Then the dataset is resized,
through a convolutional neural network (CNN) for each filtered and converted to grayscale from RGB as a
frame, and the derived feature was then fed into the long pre-processing step The pre-processing additionally includes
short-term memory models (LSTM) to understand what the image thresholding. Image Segmentation is done using OTSU
image meant. In addition, all data augmentation processes algorithms. After applying Otsu's thresholding, they used a
included a pre-processing phase that changed a video to a morphological operation to remove undesirable regions and
video made of keypoints, in order to distinguish solely sign provide the segmented region an appropriate form, followed
language movements. The upper torso, arms, and fingers of by a bitwise operation to extract the colour hand region with
both hands were included in the video, which were used to respect to the segmented picture. Then, 2 approaches such as
recognise the sign language. The rationale for learning from CNN and DNN are compared in this paper. The results for the
the preceding video is that if openpose was performed parameters like Accuracy and Validation Accuracy for DNN
effectively, the outcome was not much influenced by the are 98.6%, 99.69% and for CNN are 99.7% and 100%
surrounding surroundings. The random keypoint are removed respectively. This shows that CNN provides better accuracy
using the random keypoint removal method. Before learning, over DNN. The input images are segmented and classified
data augmentation is performed by converting the camera using both CNN and DNN. The comparison of these two
angle and removing random key points at the same time. The approaches is analyzed based on three metrics.
original data are augmented approximately 5 times using
camera angle conversion, and this is doubled using the III. CONCLUSION
random keypoint removal approach. Through this, it showed We looked over 15 research publications about Sign
various performance improvements from the lowest accuracy Language Recognition for this research. The Sign Language
of 41.9% to the highest of 96.2%. Recognition technique may be separated into five typical parts
The main focus of [17] ison sign language recognition on based on the research: data collection, pre-processing,
video. Four public video datasets have been used, which are segmentation, feature extraction, and classification. The most
Montalbano II, MSR Daily Activity 3D, CAD-60 and isoGD. typical method for gathering data is to utilize a standard video
The project has been done on two modalities of input videos: camera to capture hand photos from a variety of people,
RGB and depth videos from which Spatial and temporal angles, lighting, backdrops, and sizes. Other researchers
features are used for recognition. Spatial features include pixel choose to work with data that they have found on the internet.
level features, flow features, deep hand features, and hand CNN is found to be the most often used classifier, followed by
pose features. Temporal features are extracted with the help of SVM. CNN is well-known for its accuracy, which may
LSTM. Firstly the hands are detected using an object detection achieve 90% or more accuracy for SLR tasks. Several studies
algorithm called fine tuned Faster-R-CNN and this framework had certain flaws that we observed. Some of the most accurate
uses a Region Proposal Network (RPN). The next step is models only accept static picture input. Some alphabets in ISL
feature extraction and this is where the spatial and temporal involve hand movement, necessitating the use of a real-time,
features are extracted. Pixel level features are extracted using live interpreter. Overall, these research studies have been
the AlexNet framework as it performs better than VGG16 and educational since they demonstrate many ways to translate
VGG19 models. Flow features give us so much detail on the sign language translation.
[10] Chen, Weijun, Hongbo Huang, Shuai Peng, Changsheng Zhou, and
REFERENCES Cuiping Zhang. "YOLO-face: a real-time face detector." The Visual
Computer 37, no. 4 (2021): 805-813.
[1] K. Shenoy, T. Dastane, V. Rao and D. Vyavaharkar, "Real-time Indian
[11] Halvardsson, Gustaf, Johanna Peterson, César Soto-Valero, and Benoit
Sign Language (ISL) Recognition," 2018 9th International Conference
Baudry. "Interpretation of Swedish sign language using convolutional
on Computing, Communication and Networking Technologies
neural networks and transfer learning." SN Computer Science 2, no. 3
(ICCCNT), 2018, pp. 1-9, doi: 10.1109/ICCCNT.2018.8493808.
(2021): 1-15.
[2] D. S, K. H. K B, A. M, S. M, D. S and K. V, "An Efficient Approach
[12] Park, Chan-Il, and Chae-Bong Sohn. "Data Augmentation for Human
for Interpretation of Indian Sign Language using Machine Learning,"
Keypoint Estimation Deep Learning based Sign Language
2021 3rd International Conference on Signal Processing and
Translation." Electronics 9, no. 8 (2020): 1257.
Communication (ICPSC), 2021, pp. 130-133, doi:
[13] Ramesh M. Badiger1, Dr. Dharmanna L2. “Recognition of South
10.1109/ICSPC51351.2021.9451692.
Indian Sign Languages for Still Images Using Convolutional Neural
[3] V. Adithya, R. Rajesh, “Hand gestures for emergency situations: A
Network”, International Journal of Future Generation Communication
video dataset based on words from Indian sign language”, Data in
and NetworkingVol. 14, No. 1, (2021), pp. 832–843
Brief, Volume 31, 2020, 106016, ISSN 2352-3409,
[14] Tran, Dinh-Son, Ngoc-Huynh Ho, Hyung-Jeong Yang, Eu-Tteum
https://doi.org/10.1016/j.dib.2020.106016.
Baek, Soo-Hyung Kim, and Gueesang Lee. "Real-time hand gesture
[4] Jain, V., Jain, A., Chauhan, A. et al. American Sign Language
spotting and recognition using RGB-D camera and 3D convolutional
recognition using Support Vector Machine and Convolutional Neural
neural network." Applied Sciences 10, no. 2 (2020): 722.
Network. Int. j. inf. tecnol. 13, 1193–1200 (2021).
[15] Sharma, Sakshi, and Sukhwinder Singh. "Vision-based hand gesture
https://doi.org/10.1007/s41870-021-00617-x
recognition using deep learning for the interpretation of sign language."
[5] Brahmankar, V., Sharma, N., Agrawal, S., Ansari, S., Borse, P., &
Expert Systems with Applications 182 (2021): 115657.
Alfatmi, K. (2021). Indian Sign Language Recognition Using Canny
[16] Yang, Linchu, Ji’an Chen, and Weihang Zhu. "Dynamic hand gesture
Edge Detection. International Journal, 10(3).
recognition based on a leap motion controller and two-layer
[6] Sharma, S., Kumar, K. ASL-3DCNN: American sign language
bidirectional recurrent neural network." Sensors 20, no. 7 (2020): 2106.
recognition technique using 3-D convolutional neural networks.
[17] Rastgoo, Razieh, Kourosh Kiani, and Sergio Escalera. "Hand pose
Multimed Tools Appl 80, 26319–26331 (2021).
aware multimodal isolated sign language recognition." Multimedia
https://doi.org/10.1007/s11042-021-10768-5
Tools and Applications 80, no. 1 (2021): 127-163.
[7] Mujahid, Abdullah, Mazhar J. Awan, Awais Yasin, Mazin A.
[18] Rivera-Acosta, Miguel, Juan Manuel Ruiz-Varela, Susana
Mohammed, Robertas Damaševičius, Rytis Maskeliūnas, and Karrar H.
Ortega-Cisneros, Jorge Rivera, Ramón Parra-Michel, and Pedro
Abdulkareem 2021. "Real-Time Hand Gesture Recognition Based on
Mejia-Alvarez. "Spelling Correction Real-Time American Sign
Deep Learning YOLOv3 Model" Applied Sciences 11, no. 9: 4164.
Language Alphabet Translation System Based on YOLO Network and
https://doi.org/10.3390/app11094164
LSTM." Electronics 10, no. 9 (2021): 1035.
[8] H. S. Anupama, B. A. Usha, S. Madhushankar, V. Vivek and Y.
[19] Chakraborty, Subhalaxmi, Nanak Bandyopadhyay, Piyal Chakraverty,
Kulkarni, "Automated Sign Language Interpreter Using Data Gloves,"
Swatilekha Banerjee, Zinnia Sarkar, and Sweta Ghosh. "Indian Sign
2021 International Conference on Artificial Intelligence and Smart
Language Classification (ISL) using Machine Learning." American
Systems (ICAIS), 2021, pp. 472-476, doi:
Journal of Electronics & Communication 1, no. 3 (2021): 17-21.
10.1109/ICAIS50930.2021.9395749.
[9] Adithya, V., and Reghunadhan Rajesh. "A deep convolutional neural
network approach for static hand gesture recognition." Procedia
Computer Science 171 (2020): 2353-2361.

A Review of Sign Language Classification Techniques

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Review of Sign Language Classification Techniques

Uploaded by

Copyright:

Available Formats

A REVIEW OF SIGN LANGUAGE

Fig. 3 Preprocessing and classification using Grid based techniques,

The extracted features from the previous steps are fed as

F-1 Score 96.70%

D. DEEP LEARNING-NEURAL NETWORK

You might also like