You are on page 1of 25

PAPER

Indonesian Sign Language (BISINDO) Real-Time Detection using


MediaPipe and Computer Vision

Najma

Anindya Apriliyanti Pravitasari

Universitas Padjadjaran
Jl. Dipati Ukur No.35, Bandung City, West Java

najma19001@mail.unpad.ac.id

STATISTICS STUDY PROGRAM


PADJADJARAN UNIVERSITY
2022
Indonesian Sign Language (BISINDO) Real-Time Detection using
MediaPipe and Computer Vision

Najma1, Anindya Apriliyanti Pravitasari2


1
Department of Statistics, Universitas Padjadjaran
2
Jl. Dipati Ukur No.35, Bandung City, West Java

najma19001@mail.unpad.ac.id, anindya.apriliyanti@unpad.ac.id

*Corresponding Author
Received: day month 202x, Revised: day month 202x, Accepted: day month 202x
Published online: day month 202x

Abstract: Barriers to communication are generally experienced by the deaf. In communicating, the deaf need a
language that suits their needs, usually by using sign language. This study aims to help the deaf by building a neural
network system that can help translate sign language gestures into text for one-way communication between the deaf
and other people. The purpose of this study is to determine the optimal neural network architecture in classifying
sign language gestures. This study used MediaPipe Holistic to help detect body and hand landmarks as well as
extract key points from the detected body and hands. The Long Short-term Memory (LSTM) model will be
compared with the Gated Recurrent Unit (GRU) model with three optimizers (AdaGrad, Adam, AdaMax) for both
models. From the experimental work, it was found that the AdaMax optimizer is the best optimizer for the LSTM
model and the Adam optimizer is the best optimizer for the GRU model. By comparing the two models, the LSTM
model with AdaMax optimizer is the most powerful model with a validation accuracy rate of 100% and a loss of
3.22e-04. For further research, the researcher recommends doing other methods to classify sign language gestures,
such as the Convolutional Neural Network (CNN) Method.

Keywords: Deaf; Sign language; MediaPipe; Long Short-Term Memory; Gated Recurrent Unit

1. Introduction
Deaf is someone who has lost the ability to hear caused by damage to hearing function either partially
or completely so that it has a complex impact on his life [1]. Barriers to communication are generally
experienced by the deaf. The inability of the deaf to hear and the lack of mastered vocabulary causes
difficulties in understanding and conveying a message. The inability to convey thoughts, feelings, ideas,
needs and desires to other people also has the effect of not being completely satisfied and triggering stress
[2].
In communicating, the deaf need a language that suits their needs, usually by using sign language. In
the development of sign language among the deaf, the sign language is divided into 2, namely SIBI
(Indonesian Language Sign System) and BISINDO (Indonesian Sign Language) [3]. BISINDO is a
language promoted by the Gerakan Kesejahteraan Tunarungu Indonesia (Gerkatin) and developed by the
deaf community themselves, so that BISINDO becomes a practical and effective communication system
for deaf people in Indonesia [4]. There are several pros and cons in choosing the use of the two types of
sign language, but BISINDO is believed to have its own advantages in providing understanding language
knowledge to students with hearing impairments [5].
To take advantage of technological advances, this research will focus on building a neural network
that can help translate BISINDO gestures into text for one-way communication between the deaf and
other people. The purpose of this study is to determine neural networks in classifying BISINDO
movements. Thus, the model can be used for someone to communicate with the deaf. The data used is
video data from five BISINDO gestures, namely aku, kamu, maaf, terima kasih, and tolong.
Based on this explanation, the researcher created a neural network with the help of MediaPipe to
detect the five movements of BISINDO by utilizing Action Recognition. To achieve the benefits and the
realization of the BISINDO motion detection system, research is needed. Tests in this study are used to
determine the level of system accuracy in detecting motion. From the problems that have been described
previously, the formulation of the problem is how to integrate Action Recognition as a motion classifier
in a BISINDO motion detection system and how the level of system accuracy is suitable for the detection.

2. Related Works/Literature Review


Putri et al [6] proposed the Long short-term memory (LSTM) and MediaPipe Holistic methods to
detect skeletons on hands and bodies. The objects used in this study are 30 BISINDO sign vocabularies
that are often used by Deaf friends. From the evaluation results of real-time detection, this study obtained
an accuracy of 65% for the 30 class model with two LSTM layers, 500 epoch, 64 hidden layer, and 64
batch size.
Anam [7] proposed the Pre-trained Convolutional Neural Network ResNet-50 model for the symbol
classification of the Indonesian Language Sign System (SIBI). The author also uses the MediaPipe
Holistic technology developed by Google to detect the pose and hand position of each gesture symbol.
Using 1200 images consisting of six classes as a dataset, the addition of MediaPipe resulted in 88%
training accuracy and 87% validation accuracy. The calculation of performance metrics yields a
performance value of 86% precision , 86% recall , and an F1-score 85%.
Lahiani et al [8] proposed a study entitled "Hand gesture recognition method based on HOG-LBP
features for mobile devices" which combines the Histograms of Oriented Gradient (HOG) feature to
obtain texture information, and Local Binary Pattern (LBP) to obtain contour information, so that it can
recognize hand movements accurately. The test was carried out using the front camera with different
lighting and background. Each move uses 50 images as a dataset. The average results obtained using
HOG are 89.4%, using LBP 88%, and HOG - LBP have higher results, that is 91.6%.
Kurnia et al [9] proposed the Deep Gated Recurrent Unit (GRU) method that can read hand gestures
in Indonesian sign language videos. This study uses three classes. The stages carried out in this research
consist of video processing. The experiment was carried out on 45 training videos and 36 testing videos
resulted in an accuracy value of 88%.
Rao et al [10] proposed the Convolutional Neural Networks (CNN) method in classifying Indian sign
languages. Using data in the form of videos of 200 sign language gestures in a selfie, the CNN model
training resulted in an accuracy of 92.88%.
Dong et al [11] proposed the Random Forest on a segmented hand configuration. The study classified
24 American sign language (ASL) alphabets using a low-cost depth camera, the Microsoft Kinect. The
segmented hand configuration was first obtained using the depth contrast feature based on a per-pixel
classification algorithm. Then, a hierarchical mode search method was developed and implemented to
localize the hand joint position under kinematic constraints. Finally, Random Forest was built to
recognize ASL signs using the junction angle. This study got above 90% accuracy in recognizing 24 static
ASL alphabet signs.
Das et al [12] proposed the Convolutional Neural Network (CNN) method with the Inception V3
model in classifying 24 characters from the American Sign Language (ASL) alphabet. The dataset is a
static sign language image captured on an RGB camera. Pre-processing is done on the image used as the
model input. The model consists of several convolution filter inputs and produces a validation accuracy of
90%.
Rastgoo et al [13] proposed the Restricted Boltzmann Machine (RBM) method to perform hand sign
language recognition automatically from visual data. This study creates a model with two modalities,
RGB and Depth, and in three forms: using the original image, the cropped image, and the noise. The
cropped image includes the hand from the hand detected using the Convolutional Neural Network (CNN).
The three types of models are used for each modality and input to RBM. The proposed multi-modal
model was trained on all American alphabets and resulted in classification accuracies of 99.31%, 97.56%,
90.01%, and 98.13% for the Massey dataset, ASL fingerspelling A dataset, NYU dataset, and Dataset.
ASL fingerspelling from the University of Surrey, respectively.
Konstantinidis et al [14] proposed a method of extracting and processing hand and body skeleton data
from video sequences LSA64: Dataset for Argentine Sign Language. The VGG-19 network and Multi-
stage CNN were used as feature extractors for the detection of hand skeletons and body skeletons. Long
short-term memory (LSTM) is also used to classify 64 sign language classes. This study resulted in an
accuracy of 93.91% for the model using the body frame, 91.64% for the model using the hand frame, and
98.09% for the model using the hand and body framework. The body skeleton is a slightly better
representation than the hand skeleton for sign language recognition as it achieves a 2.27% increase in sign
language recognition in the LSA64 dataset. This is due to the fact that the joints of the body are more
reliably and strongly detected than the joints of the hands. However, using the skeletal features of the
hand and body together is more beneficial.
Kadhim et al [15] introduce real-time ASL fingerspelling recognition that is built with the multi-
classification system based on convolutional neural networks (CNNs) with a deep learning data structure
that had been used for multi-class recognition, that is the VGG -Net algorithm using real coloring images.
It comprises a total of 26 alphabets and two classes for space and delete. The system achieved a
maximum accuracy of about 98.53% for training and 98.84% for the validation set.

No. Researcher Year Title Data Method Results

1. Putri et al [6] 2022 PENDETEKSIAN The data used are Long short- MediaPipe improves
BAHASA ISYARAT primary and term accuracy results by
INDONESIA secondary data. In memory 65% for 30 classes.
the form of 3,000 (LSTM),
SECARA REAL-
data with 30 MediaPipe
TIME BISINDO Holistic
MENGGUNAKAN vocabulary gestures
LONG SHORT- that are used daily.
TERM MEMORY 95% of the data is
(LSTM). used as training and
the rest as testing.

2. Anam, Nofal [7] 2022 Sistem Deteksi Using 1,200 image Pre-trained Produces 88%
Simbol pada SIBI data of the Convolution training accuracy and
(Sistem Isyarat Indonesian al Neural 87% validation
Language Sign Network accuracy, with a
Bahasa Indonesia)
System (SIBI) with with the performance value of
Menggunakan six classes as ResNet-50 precision 86%, recall
Mediapipe dan datasets. Used 960 model 86%, and 85% F1-
Resnet-50. data as training, score.
180 as validation
data, and 60 as
testing.

3. Lahiani et al [8] 2018 Hand gesture The NUS I hand Histograms By combining the
recognition method posture dataset of Oriented HOG and LBP
based on HOG-LBP consists of 10 Gradient features, the detection
features for mobile posture classes with (HOG) & rate is 91.6%. This
devices. 24 image samples Local Binary combination gives
per class. Pattern better results than the
(LBP) results obtained when
using only the LBP
feature of 88% or
HOG of 89.4% in
terms of the
recognition rate.

4. Kurnia et al [9] 2022 Deteksi Tangan The data is in the Deep Gated The results of the
Otomatis Pada Video form of BISINDO Recurrent data testing phase
Percakapan Bahasa videos taken from Unit (GRU) using 10 epochs
student volunteers resulted in an
Isyarat Indonesia
at Nurul Jadid accuracy value of
Menggunakan University with a 88%.
Metode Deep Gated total of 81 videos
Recurrent Unit (45 training data,
(GRU). 36 testing data) for
three daily sign
language
movements.

5. Rao et al [10] 2018 Deep convolutional Dataset with five Convolution Different CNN
neural networks for different subjects al Neural architectures were
sign language performing 200 Networks designed and tested
recognition. signs in 5 different (CNN) to obtain better
viewing angles accuracy in
under various recognition. The
background study achieved a
environments with 92.88% recognition
selfie mode. rate compared to
other classifier
models reported on
the same dataset.

6. Dong et al [11] 2015 American Sign The dataset Random The system achieved
language alphabet contained 24 ASL Forest a mean accuracy of
recognition using fingerspelling signs 92% on a dataset
Microsoft kinect. performed by 5 containing 24 static
subjects, where 500 alphabet signs.
samples of each
sign were recorded
for every subject.
The subjects were
asked to sign facing
the sensor and to
move their hand
around while
keeping the hand
shape fixed. This
dataset was
generated using
Kinect V1.

7. Das et al [12] 2018 Sign language The dataset consists Convolution The validation
recognition using of 24 American al Neural accuracy obtained
deep learning on Sign Language Network was above 90%.
custom processed (ASL) alphabetical (CNN) with
static gesture images. characters captured the Inception
on an RGB camera. V3 model.

8. Rastgoo et al 2018 Multi-modal deep Using Massey Restricted Model yielded an


[13] hand sign language University Gesture Boltzmann accuracy of 99.31%,
recognition in still Dataset 2012, ASL Machine 97.56%, 90.01%, and
images using a Fingerspelling A (RBM), 98.13% for the
Restricted Boltzmann Dataset, NYU Convolution Massey dataset, ASL
Machine. datasets, and The al Neural fingerspelling A
Networks dataset, NYU dataset,
ASL
(CNN) and ASL
Fingerspelling fingerspelling dataset
Dataset from the from the University
University of of Surrey,
Surrey. The respectively.
model uses three
types of data,
namely original
data, trimmed
data without a
background, and
trimmed data
with a
background.

9. Konstantinidis 2018 SIGN LANGUAGE Video sequences VGG-19, Accuracy of 93.91%


et al [14] RECOGNITION from LSA64: Multi-stage for the model using
BASED ON HAND Dataset for CNN, and the body frame,
AND BODY Argentine Sign Long short- 91.64% for the model
SKELETAL DATA. Language term using the hand frame,
memory and 98.09% for the
model using the hand
(LSTM)
and body framework.

10. Kadhim et al 2020 A Real-Time American Sign Convolution Training accuracy is


[15] American Sign Language alphabets al Neural 98.53% and
Language from A to Z and a Networks validation is 98.84%
Recognition System class for space and (CNN) with for 26 alphabets and
using Convolutional delete. the VGG- two additional
Neural Network for Net model. classes, namely space
Real Datasets and delete.

3. Materials & Methodology


3.1. Data
The data used in this study is primary data. Primary data is data taken directly by researchers without
going through intermediaries so that the data obtained are in the form of raw data [16]. Data was obtained
from Special School (SLB) teachers in Sukabumi City in the form of videos of 5 BISINDO vocabulary
gestures used daily with each label in the form of 50 videos. The sign vocabulary is aku, kamu, maaf,
terima kasih, and tolong. The dataset is divided into training and testing with 90% training data and 10%
testing data divided.

3.2. Method
Classification is a technique where we categorize data into a given number of classes. The main goal
of a classification problem is to identify the category/class to which a new data will fall under [17]. In this
research design, the researcher uses a classification method on a dataset of five BISINDO gestures and
their categories, while the flow of the methodology used in this study can be seen in Figure 1 as follows.

Figure 1. Methodology Flow

It can be seen that this research began with data collection in the form of BISINDO gesture videos.
The video will be processed using MediaPipe Holistic to detect body/pose and hand landmarks and then
extract the detected body and hand keypoints. Preprocessing data is done to collect key points into an
array, then the data is divided into a train dataset and test dataset. The train dataset is the data used to
create models and the test dataset is the data used to test/validate whether the model has given good
predictive results. Then the data is entered into the model training process, where the model will be
trained so that the model can recognize patterns from the data. After that, the validation process is carried
out by predicting the test dataset. If the results are not optimal, the hyperparameter tuning will be carried
out and back to the model training process to get the model that has the best performance. The model that
has the best performance will be the final model selected.
In this study, researchers used MediaPipe Holistic to help detect body and hand landmarks as well as
extract key points from the detected body and hands. Since the results of these key points is in the form of
sequence data, the researcher will use the Long Short-term Memory (LSTM) model and Gated Recurrent
Unit (GRU) model.
3.3. Classification
In statistics, classification is the problem of identifying which of a set of categories (sub-populations)
an observation (or observations) belongs to [18]. In machine learning, classification refers to a predictive
modeling problem where a class label is predicted for a given example of input data. Classification
requires a training dataset with many examples of inputs and outputs from which to learn. A model will
use the training dataset and will calculate how to best map examples of input data to specific class labels.
As such, the training dataset must be representative of the problem and have many examples of each class
label. Classification accuracy is a popular metric used to evaluate the performance of a model based on
the predicted class labels. Instead of class labels, some tasks may have categorical labels. Categorical
accuracy is a metric that is suitable for tasks to predict categorical labels. There are three main types of
classification tasks; they are binary classification, multi-class classification, and multi-label classification
[19]. In this study, the data used are gestures of five sign language vocabularies which will be classified
as belonging to one of the five known vocabularies. Thus, the researcher decided to use the Multi-Class
Classification method to classify sign language. The label owned is a categorical label so that categorical
accuracy is the right metric in this case.
Multi-class classification refers to those classification tasks that have more than two class labels. The
data are classified as belonging to one among a range of known classes. It is common to model a multi-
class classification task with a model that predicts a Multinoulli probability distribution for each example.
The Multinoulli distribution is a discrete probability distribution that covers a case where an event will
have a categorical outcome, eg K in {1, 2, 3, …, K}. For classification, this means that the model predicts
the probability of an example belonging to each class label. Many algorithms used for binary
classification can be used for multi-class classification. Popular algorithms that can be used for multi-
class classification include k-Nearest Neighbors, Decision Trees, Naive Bayes, Random Forest, Gradient
Boost, and others [19].

3.4. MediaPipe
MediaPipe is a framework that allows developers to build multi-modal ML channels (video, audio,
any time series). As a framework of nodes and edges or landmarks, they trace key points in different parts
of the body. All point coordinates are normalized to three dimensions [6]. MediaPipe makes it possible to
translate using machine learning in real time (ML solution for live and streaming media). The advantages
that will be obtained for using MediaPipe include [20]:
1. End-to-end acceleration: Able to process and conclude incoming data even on commonly used
hardware.
2. Build once, deploy anywhere: One solution can run on iOS, Android, web, desktop/cloud, and
IoT.
3. Ready to use solution: A sophisticated, ready-to-use machine learning framework solution.
4. Free and open source: The MediaPipe framework and solution code are under the Apache 2.0
license, and are completely free for developers to develop and customize.

Some of the machine learning solutions provided by MediaPipe are [20]:


1. Face Detection : is a machine learning solution provided by mediapie to detect faces with 6
landmarks (signs) and enable multi-face detection. An example of face detection can be seen in
Figure 2.
Figure 2. Example of Face Detection MediaPipe [20]

2. Face Mesh: is a machine learning solution provided by MediaPipe to detect face geometry in real
time and generate 468 facial landmarks (signs) in 3D. An example of face mesh detection can be
seen in Figure 3.

Figure 3. MediaPipe's Face Mesh Example [20]

3. Iris: is a machine learning solution provided by MediaPipe to track landmarks including the iris,
pupil, and eye contour. With an error of less than 10%, MediaPipe is very suitable for use. An
example of an iris can be seen in Figure 4.
Figure 4. An example of an Iris MediaPipe [20]

4. Hands: is a machine learning solution provided by MediaPipe to track hands and create 21 three-
dimensional hand landmarks. The detection process involves two models, namely palm detection
to detect palms and hand landmark detection to mark hands as many as 21 points. This solution
allows two-handed detection. Examples of hand landmarks can be seen in Figure 5 and Figure 6.

Figure 5. Example of Hands MediaPipe [20]

Figure 6. Hand Landmarks MediaPipe [20]


5. Pose: is a machine learning solution provided by MediaPipe to track body poses and create 33
two-dimensional landmarks for the full body, and 25 landmarks for the upper body. Examples of
poses can be seen in Figure 7 and Figure 8.

Figure 7. Example Pose MediaPipe [20]

Figure 8. Pose Landmarks MediaPipe [20]

6. Holistic: is a machine learning solution provided by MediaPipe to track poses, faces and hand
components. Holistic forms 543 landmarks consisting of 33 pose landmarks, 468 facial
landmarks, and 21 hand landmarks per hand. Holistic examples can be seen in Figure 9.
Figure 9. Holistic MediaPipe examples [20]

7. Other machine learning solutions, such as selfie segmentation, hair segmentation, object
detection, box tracking, instant motion tracking, objectron, knife, etc.

3.5. MediaPipe Holistic


The MediaPipe Holistic pipeline integrates separate models for pose, face and hand components, each
of which are optimized for their particular domain. MediaPipe Holistic is designed as a multi-stage
pipeline, which treats the different regions using a region-appropriate image resolution. First, the
MediaPipe Holistic estimates the human pose with BlazePose's pose detector and subsequent landmark
model. Then, using the inferred pose landmarks, it derives three regions of interest (ROI) for each hand
(2x) and the face, and employs a re-crop model to improve the ROI. MediaPipe Holistic then crops the
full-resolution input frame to these ROIs and applies task-specific face and hand models to estimate their
corresponding landmarks. Finally, it merges all landmarks with those of the pose model to yield the full
landmarks; pose, face and hand landmark models to generate a total of 543 landmarks (33 pose
landmarks, 468 facial landmarks, and 21 hand landmarks per hand) [21]. The flow of how MediaPipe
Holistic works can be seen in Figure 10 as follows.

Figure 10. MediaPipe Holistic Pipeline Overview [21]

The output of MediaPipe Holistic is pose landmarks, face landmarks, left hand landmarks, and right
hand landmarks [21]. In this study, researchers only extracted poses, left hand, and right hand landmarks.
Face landmarks are not used because sign language detection does not require face detection.
● Pose Landmarks
A list of pose landmarks. Each landmark consists of the following:
- x and y: Landmark coordinates normalized to [0.0, 1.0] by the image width and height
respectively.
- z: Should be discarded as currently the model is not fully trained to predict depth, but this
is something on the roadmap.
- visibility: A value in [0.0, 1.0] indicating the likelihood of the landmark being visible
(present and not occluded) in the image.
● Left Hand Landmarks
A list of 21 hand landmarks on the left hand. Each landmark consists of x, y and z. x and y are
normalized to [0.0, 1.0] by the image width and height respectively. z represents the landmark
depth with the depth at the wrist being the origin, and the smaller the value the closer the
landmark is to the camera. The magnitude of z uses roughly the same scale as x.
● Right Hand Landmarks
A list of 21 hand landmarks on the right hand, in the same representation as left hand landmarks.

MediaPipe Holistic is also equipped with supported configuration options, including [21]:
● static_image_mode
If set to false, the solution treats the input images as a video stream. It will try to detect the most
prominent person in the very first images, and upon successful detection further localize the pose
and other landmarks. In subsequent images, it then simply tracks those landmarks without
invoking another detection until it loses track, reducing computation and latency. If set to true,
person detection runs every input image, ideal for processing a batch of static, possibly unrelated,
images. Defaults to false.
● model_complexity
Complexity of the pose landmark model: 0, 1 or 2. Landmark accuracy as well as inference
latency generally goes up with the model complexity. Default to 1.
● smooth_landmarks
If set to true, the solution filters pose landmarks across different input images to reduce jitter, but
is ignored if static_image_mode is also set to true. Defaults to true.
● enable_segmentation
If set to true, in addition to the pose, face and hand landmarks, the solution also generates the
segmentation mask. Defaults to false.
● smooth_segmentation
If set to true, the solution filters segmentation masks across different input images to reduce jitter.
Ignored if enable_segmentation is false or static_image_mode is true. Defaults to true.
● refine_face_landmarks
Whether to further refine the landmark coordinates around the eyes and lips, and output
additional landmarks around the irises. Defaults to false.
● min_detection_confidence
Minimum confidence value ([0.0, 1.0]) from the person-detection model for the detection to be
considered successful. Defaults to 0.5.
● min_tracking_confidence
Minimum confidence value ([0.0, 1.0]) from the landmark-tracking model for the pose landmarks
to be considered tracked successfully, or otherwise person detection will be invoked
automatically on the next input image. Setting it to a higher value can increase robustness of the
solution, at the expense of a higher latency. Ignored if static_image_mode is true, where person
detection simply runs on every image. Defaults to 0.5.

This study uses a configuration that is set to default.

3.6. Recurrent Neural Network (RNN)


A recurrent neural network (RNN) is a class of artificial neural networks where connections between
nodes form a directed or undirected graph along a temporal sequence. This allows it to exhibit temporal
dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state
(memory) to process variable length sequences of inputs. The term "recurrent neural network" is used to
refer to the class of networks with an infinite impulse response [22]. It is the first algorithm that
remembers its input, due to an internal memory, which allows them to be very precise in predicting what's
coming next and makes it perfectly suited for machine learning problems that involve sequential data like
time series, speech, text, audio , videos, and more [23].
In a RNN the information cycles through a loop. When it makes a decision, it considers the current
input and also what it has learned from the inputs it received previously.

Figure 11. Information Flow in RNN [23]

A usual RNN has a short-term memory. In combination with an LSTM, they also have a long-term
memory. An RNN can map one input to many output, many to many (translation) and many to one
(classifying a voice).

Figure 12. RNNs Input to Output Map [23]

RNNs can be seen as a sequence of neural networks with backpropagation. It can be seen in Figure
13, the “rolled” visual of the RNN represents the whole neural network, or rather the entire predicted
phrase. The “unrolled” visual represents the individual layers, or time steps, of the neural network.
Information is passed from one time step to the next. This illustration also shows why an RNN can be
seen as a sequence of neural networks [24].
Figure 13. Rolled & Unrolled Version of RNN [24]

3.7. Long Short-term Memory (LSTM)


Long short-term memory networks (LSTMs) are an extension for recurrent neural networks (RNN),
which basically extends the memory. The main difference between an LSTM unit and a standard RNN
unit is that the LSTM unit is more sophisticated. LSTMs can process not only single data points (such as
images), but also entire sequences of data (such as speech or video). The name of LSTM refers to the
analogy that a standard RNN has both "long-term memory" and "short-term memory". The connection
weights and biases in the network change once per episode of training, analogous to how physiological
changes in synaptic strengths store long-term memories; the activation patterns in the network change
once per time-step, analogous to how the moment-to-moment change in electric firing patterns in the
brain store short-term memories. The LSTM architecture aims to provide a short-term memory for RNN
that can last thousands of timesteps, thus "long short-term memory" [25].
A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell
remembers values over arbitrary time intervals and the three gates regulate the flow of information into
and out of the cell. LSTM networks are well-suited to classifying, processing and making predictions
based on time series data, since there can be lags of unknown duration between important events in a time
series. LSTMs were developed to deal with the vanishing gradient problem that can be encountered when
training traditional RNNs. Relative insensitivity to gap length is an advantage of LSTM over RNNs,
hidden Markov models and other sequence learning methods in numerous applications [25].
Figure 14. Long Short-term Memories Structure [26]

The LSTM model consists of a memory cell with a gate structure that replaces the hidden layer
neurons of the RNN [27]. In the structure of the LSTM model, there are input, forgotten, and output gates.
The function of the gate is to deny or allow access to the LSTM memory. The input gate will block all
small values (close to 0) from entering memory in the diagram above. Forget gate will remove all values
from memory. Meanwhile, the output gate determines whether the values stored in the LSTM memory
should be output. Each memory cell has three sigmoid layers and one tanh layer [28]. In the following
formula of LSTM calculation process [29]:

1) The last moment output value and the present input value become the input of the forget gate.
The output value of the forget gate is obtained using the formula (1).

f t=σ (W f ∙ [ h t−1 , x t ]+ bf ) (1)

2) The last output value and present input value are entered into the input gate. The formula (2) and
(3) obtains the output value and candidate cell state of the input gate:
i t =σ (W i ∙ [ ht −1 , xt ] +bi ) (2)

~
Ct =tanh (W c ∙ [ h t−1 , x t ]+b c ) (3)

3) Renew the current cell state using the formula (4).


~
C t=f t∗C t−1 +i t∗C t (4)

4) The output value of h(t−1) and input value of x t are accepted as input values from the output gate
at time t. The results o t from the output gate are obtained using the formula (5).

o t=σ (W o ∙ [ ht −1 , x t ] +b o) (5)

5) LSTM results are obtained using the formula (6).


ht =ot∗tanh ⁡(Ct )

The data used in this study are videos of sign language gestures. Videos contain a large amount of
visual information in scenes as well as profound dynamic changes in motions. As video is a kind of
spatio-temporal sequence, recurrent neural networks (RNNs), especially Long Short-Term Memory
(LSTM) and Gated Recurrent Unit (GRU) have been widely applied to video prediction [30].

3.8. Gated Recurrent Unit (GRU)


The Gated Recurrent Unit (GRU) is a type of Recurrent Neural Network (RNN) that, in certain
cases, has advantages over long short term memory (LSTM). GRU uses less memory and is faster than
LSTM, however, LSTM is more accurate when using datasets with longer sequences [31]. In short, GRU
is the same as the RNN but the difference is in the operation and gates associated with each GRU unit. To
solve the problem faced by standard RNN, GRU incorporates the two gate operating mechanisms called
Update gate and Reset gate.

Figure 15. Gated Recurrent Units (GRU) Structure [32]

The update gate is responsible for determining the amount of previous information that needs to pass
along to the next state. This is really powerful because the model can decide to copy all the information
from the past and eliminate the risk of vanishing gradient. The reset gate is used from the model to decide
how much of the past information is needed to be neglected; in short, it decides whether the previous cell
state is important or not. First, the reset gate comes into action and stores relevant information from the
past time step into new memory content. Then it multiplies the input vector and hidden state with their
weights. Next, it calculates element-wise multiplication between the reset gate and previously hidden
state multiplication. After summing up the above steps the non-linear activation function is applied and
the next sequence is generated [32].
The ability of the GRU to hold on to long-term dependencies or memory stems from the
computations within the GRU cell to produce the hidden state. While LSTMs have two different states
passed between the cells — the cell state and hidden state, which carry the long and short-term memory,
respectively — GRUs only have one hidden state transferred between time steps. This hidden state is able
to hold both the long-term and short-term dependencies at the same time due to the gating mechanisms
and computations that the hidden state and input data go through.
Figure 16. GRU vs LSTM [33]

The difference between GRU and LSTM [33]:


- The LSTM stores its longer-term dependencies in the cell state and short-term memory in the
hidden state, while the GRU stores both in a single hidden state.
- GRUs are faster to train as compared to LSTMs due to the fewer number of weights and
parameters to update during training.
- The accuracy of a model, whether it is measured by the margin of error or proportion of correct
classifications, is usually the main factor when deciding which type of model to use for a task.
Both GRUs and LSTMs are variants of RNNS and can be plugged in interchangeably to achieve
similar results.

3.9. Optimizer
There are several ways to improve learning in deep learning neural networks such as improving the
architecture (for example by making it deeper), finding the optimal parameters, playing with the data
representation, choosing the best optimization algorithm etc. To date, there are no guidelines for setting
up an optimal deep learning architecture. In this paper, the researcher will be interested in a way to
optimize the learning process in deep learning architecture using optimization algorithms based on the
gradient descent method. The main goal is to get a better solution as quickly as possible. The goal of an
optimizer is to minimize an objective function (generally called the loss), which is intuitively the
difference between the predicted data and the expected values. The minimization consists of finding the
set of parameters of the architecture that give best results in the targeted tasks such as classification [35].
Some of the most well-known optimizers are AdaGrad, Adam, and AdaMax. This research will compare
the performance of the model with these three optimizers.
Adaptive Gradient Algorithm (AdaGrad) is a stochastic optimization method that adapts the learning
rate to the parameters. AdaGrad algorithm proposes to adjust the learning rate for each parameter during
the learning phase based on historical information. The objective of this adaptation is to improve the
convergence of the algorithm and its prediction accuracy [36].
Adaptive Moment Estimation (Adam) is a gradient-based optimization algorithm that makes use of
the stochastic gradient extensions of AdaGrad and RMSProp to deal with machine learning problems
involving large datasets and high-dimensional parameter spaces [37]. The adaptive moment estimation
(ADAM) was invented by Kingma and Ba [38] and is nowadays one of the most used optimizer
algorithms. This algorithm also calculates adaptive learning rates for each parameter. Adam stores an
exponentially decaying average of the previous gradient squares like AdaDelta and RMSprop and keeps
an exponentially decaying average of the past gradients as for Momentum or AdaGrad [36].
The Adaptive Max Pooling (AdaMax) algorithm is an extension of the Adam algorithm based on an
infinite norm [36]. Adam can be understood as updating weights inversely proportional to the scaled L2
norm (squared) of past gradients. AdaMax extends this to the so-called infinite norm (max) of past
gradients. Generally, AdaMax automatically adapts a separate step size (learning rate) for each parameter
in the optimization problem [39].

3.10. Evaluation
● Categorical Accuracy
The categorical accuracy metric measures how often the model gets the prediction right. It
calculates the percentage of predicted values (yPred) that match with actual values (yTrue) for
one-hot labels [39]. In a multiclass classification problem, we consider that a prediction is correct
when the class with the highest score matches the class in the label. The formula for categorical
accuracy is:

Figure 17. Formula of categorical accuracy [40]

● Confusion Matrix
The Confusion Matrix is used to know the performance of a Machine Learning classification. It is
represented in a matrix form [41]. The confusion matrix produces accuracy scores, error rate,
specificity, precision, recall, and F1 scores. These scores help evaluate the performance or
feasibility of the model used. An nxn confusion matrix displays the predicted and actual
classification, where n is the number of different classes [42].

Predicted values Actual Values

Positive Negative

Positive True positive (TP) False positive (FP)

Negative False negative (FN) True negative (TN)

Table 1. Confusion Matrix

Figure 18. Formula of Confusion Matrix [43]

Table 1 shows the structure of a 2x2 confusion matrix where [43]:


- True Positive (TP) — model correctly predicts the positive class (prediction and actual both are
positive).
- True Negative (TN) — model correctly predicts the negative class (prediction and actual both are
negative).
- False Positive (FP) — model gives the wrong prediction of the negative class (predicted-positive,
actual-negative). FP is also called a TYPE I error.
- False Negative (FN) — model wrongly predicts the positive class (predicted-negative, actual-
positive). FN is also called a TYPE II error.

Where the formula for accuracy, error rate, specificity, precision, recall, and f1-score are [44];

TP+TN
accuracy =
TP+ TN + FP+ FN
(6)

FP + FN
error = (7)
TP+TN + FP + FN

TN
specificity= (8)
TN + F P

TP
precision= (9)
TP+ F P

TP
recall= (10)
TP+ FN

2 xTP
f 1−score= (11)
2 xTP+ FP + FN

- Accuracy: The overall accuracy of the model. Calculated as the number of all correct predictions
divided by the total number of the dataset. The best accuracy is 1.0, whereas the worst is 0.0.
- Error rate: It tells you what fraction of predictions were incorrect. Calculated as the number of all
false predictions divided by the total number of the dataset. The best error rate is 0.0, whereas the
worst is 1.0.
- Specificity: It tells you what fraction of all negative samples are correctly predicted as negative.
Calculated as the number of correct negative predictions divided by the total number of negatives.
It is also called true negative rate (TNR). The best specificity is 1.0, whereas the worst is 0.0.
- Precision: It tells you what fraction of predictions as a positive class were actually positive.
calculated as the number of correct positive predictions divided by the total number of positive
predictions. It is also called positive predictive value (PPV). The best precision is 1.0, whereas
the worst is 0.0.
- Recall: It tells you what fraction of all positive samples were correctly predicted as positive.
Calculated as the number of correct positive predictions divided by the total number of positive
samples. It is also called sensitivity or true positive rate (TPR). The best recall is 1.0, whereas the
worst is 0.0.
- F1-score: It combines precision and recall into a single measure.

4. Results & Discussion


4.1. Results
In this section, we discuss the results obtained for classifying sign language gestures into five classes:
aku, kamu, maaf, terima kasih, and tolong. We have split images into 90% for training and 10% for
testing and compared two deep learning models, LSTM and GRU. The researcher used the same settings
for both models, namely 3 layers lstm/gru, 3 layers neural network, and 20 epochs.

4.1.1. Long Short-term Memory (LSTM)


With the same model settings, that is 20 epochs, 3 layers LSTM, and 3 layers neural network, the
comparison of optimizers with the LSTM model is as follows:

Optimizer Training Train Train Test Test Error Specificity Precision Recall F1-Score
Time Accuracy Loss Accuracy Loss Rate

AdaGrad 25 s 0.951 0.665 0.760 0.748 0.096 0.940 0.850 0.760 0.724

Adam 24 s 0.916 0.213 0.960 0.104 0.016 0.990 0.966 0.960 0.960

AdaMax 25 s 1 1.99e-04 1.000 3.22e-04 0 1 1 1 1

Table 2. Evaluation LSTM Model


It can be seen in table 2, that the model with the AdaMax optimizer produces the best performance.
The following are the training loss and training accuracy of the model with the AdaMax optimizer;

Figure 18. Training loss and accuracy LSTM with AdaMax

The training loss graphs demonstrate that the training loss rapidly drops until the 5th epoch, then
steadily decreases until the 20th epoch. The training accuracy rapidly grows until the 6th epoch, then
steadily increases until the 20th epoch.

4.1.2. Gated Recurrent Unit (GRU)


With the same model settings, that is 20 epochs, 3 layers GRU, and 3 layers neural network, the
comparison of optimizers with the GRU model is as follows:

Optimizer Training Train Train Test Test Error Specificity Precision Recall F1-Score
Time Accuracy Loss Accuracy Loss Rate
AdaGrad 26 s 0.551 1.533 0.440 1.529 0.224 0.86 0.402 0.440 0.324

Adam 26 s 1 0.0075 1 0.0068 0 1 1 1 1

AdaMax 28 s 1 0.2444 1 0.2379 0 1 1 1 1

Table 3. Evaluation GRU Model


It can be seen in table 3, that the model with the Adam optimizer produces the best performance. The
following is the training loss and training accuracy of the model with the Adam optimizer;

Figure 19. Training loss and accuracy GRU with Adam

The training loss graphs demonstrate that the training loss continuously decreases until the 20th
epoch, the training accuracy rapidly grows but decreases several times until the 15th epoch, then steadily
increases until the 20th epoch.

4.2. Discussion
In this study, the researcher compared two powerful deep learning models, namely Long Short-term
Memory (LSTM) and Gated Recurrent Unit (GRU) to classify five classes of sign language gesture (aku,
kamu, maaf, terima kasih, tolong). The results show that with 20 epochs, the LSTM model with the
AdaMax optimizer has a higher accuracy than the GRU model with the Adam optimizer. Although the
accuracy of the two models is the same, the loss and training time of the LSTM model is much better.
This can happen because GRU is less complex than LSTM because it has less number of gates [45].
Despite the positive findings, there are some limitations from this study. First, the researcher only
used 250 datasets divided into five categories. Second, this study only uses one type of video processing,
where the video is split into 30 frames and each frame is detected to search a hand and body skeleton with
a MediaPipe Holistic. The next research can use a variety of methods, such as CNN to classify every
frame into a class.

5. Conclusion
Barriers to communication are generally experienced by the deaf. In communicating, the deaf need a
language that suits their needs, usually by using sign language, one of them is BISINDO (Indonesian Sign
Language). In this study, the Long Short-term Memory (LSTM) model was compared with the Gated
Recurrent Unit (GRU) model. Both models use three optimizers (AdaGrad, Adam, AdaMax) which were
also compared. From the experimental work, it was found that the LSTM model with AdaMax optimizer
is the most powerful model, with a validation accuracy rate of 100% and a loss of 3.22e-04. For further
research, the researcher recommends doing other methods to classify sign language gestures, such as the
Convolutional Neural Network (CNN) Method.

References
[1] Rahmah F. (2018). PROBLEMATIKA ANAK TUNARUNGU DAN CARA MENGATASINYA.
QUALITY. 6. 1. 10.21043/quality.v6i1.5744.
[2] Damayanti I., Purnamasari SH (2019). HAMBATAN KOMUNIKASI DAN STRES ORANGTUA.
Jurnal Psikologi Insight.
[3] Mursita RA (2015). RESPON TUNARUNGU TERHADAP PENGGUNAAN SISTEM BAHASA
ISYARAT INDONESIA (SIBI) DAN BAHASA ISYARAT INDONESIA (BISINDO) DALAM
KOMUNIKASI. INKLUSI, 2(2), 221–232. https://doi.org/10.14421/ijds.2202
[4] Borman R., Priopradono B., Syah A. (2019). KLASIFIKASI OBJEK KODE TANGAN PADA
PENGENALAN ISYARAT ALPHABET BAHASA ISYARAT INDONESIA (BISINDO). SNIA
(Seminar Nasional Informatika Dan Aplikasinya), 3, D 1-4. Retrieved from
https://snia.unjani.ac.id/web/index.php/snia/article/view/87
[5] Putri AM (2020) PERBANDINGAN PENGGUNAAN BISINDO DAN SIBI DALAM
MENINGKATKAN KEMAMPUAN MENULIS LANJUT SISWA DENGAN HAMBATAN
PENDENGARAN. S2 thesis, Universitas Pendidikan Indonesia.
[6] Putri HM, Fadlisyah F., Fuadi W. (2022). PENDETEKSIAN BAHASA ISYARAT INDONESIA
SECARA REAL-TIME MENGGUNAKAN LONG SHORT-TERM MEMORY (LSTM). Jurnal
Teknologi Terapan & Sains. https://doi.org/10.1976/tts%204.0.v3i1.6853
[7] Anam N. (2022) TA : Sistem Deteksi Simbol pada SIBI (Sistem Isyarat Bahasa Indonesia)
Menggunakan Mediapipe dan Resnet-50. Undergraduate thesis, Universitas Dinamika.
[8] Lahiani H., Neji M. (2018). Hand gesture recognition method based on HOG-LBP features for mobile
devices. Procedia Computer Science, 126, 254-263.
[9] Pratamasunu KS, Fajri FN, Sari PK (2022). Deteksi Tangan Otomatis Pada Video Percakapan Bahasa
Isyarat Indonesia Menggunakan Metode Deep Gated Recurrent Unit (GRU). Jurnal Komputer Terapan ,
8(1), 186–193. https://doi.org/10.35143/jkt.v8i1.4901
[10] Rao, K. Syamala, PVV Kishore and ASCS Sastry. (2018). Deep convolutional neural networks for
sign language recognition. Conference on Signal Processing And Communication Engineering Systems
(SPACES), pp. 194-197, doi: 10.1109/SPACES.2018.8316344
[11] Dong C., Leu MC, Yin Z. (2015). American sign language alphabet recognition using microsoft
kinect; Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition
workshops; Boston, MA, USA; pp. 44–52.
[12] Das A., Gawde S., Suratwala K., Kalbande D. (2018). Sign language recognition using deep learning
on custom processed static gesture images; Proceedings of the 2018 International Conference on Smart
City and Emerging Technology (ICSCET); pp. 1–6.
[13] Rastgoo R., Kiani K., Escalera S. (2018). Multi-modal deep hand sign language recognition in still
images using restricted Boltzmann machine. Entropy;20:809. doi: 10.3390/e20110809.
[14] Konstantinidis D., Dimitropoulos K., Daras P. (2018). SIGN LANGUAGE RECOGNITION
BASED ON HAND AND BODY SKELETAL DATA. 3DTV-Conference: The True Vision - Capture,
Transmission and Display of 3D Video (3DTV-CON), 2018, pp. 1-4, doi: 10.1109/3DTV.2018.8478467.
[15] Kadhim RA, Khamess M. (2020). A Real-Time American Sign Language Recognition System using
Convolutional Neural Network for Real Datasets. TEM Journal. 9. 10.18421/TEM93-14.
[16] Khasanah LU (2022). Empat Sumber Data Sekunder dan Primer. Retrieved from DQLab:
https://www.dqlab.id/empat-sumber-data-sekunder-dan-primer
[17] Garg R. (2018). 7 Types of Classification Algorithms. Retrieved from Analytics India Mag:
https://analyticsindiamag.com/7-types-classification-algorithms
[18] Statistical classification. (2022). Retrieved from Wikipedia:
https://en.wikipedia.org/wiki/Statistical_classification
[19] Brownlee J. (2020). 4 Types of Classification Tasks in Machine Learning. Retrieved from Machine
Learning Mastery: https://machinelearningmastery.com/types-of-classification-in-machine-learning/
[20] MediaPipe. (2020). Retrieved from MediaPipe: https://google.github.io/mediapipe
[21] MediaPipe Holistic. (2020). Retrieved from MediaPipe:
https://google.github.io/mediapipe/solutions/holistic.html
[22] Recurrent neural network. (2022). Retrieved from Wikipedia:
https://en.wikipedia.org/wiki/Recurrent_neural_network
[23] Donges N. (2022). A Guide to RNN: Understanding Recurrent Neural Networks and LSTM
Networks. Retrieved from builtin: https://builtin.com/data-science/recurrent-neural-networks-and-lstm
[24] IBM Cloud Education. (2020). Recurrent Neural Networks. Retrieved from IBM:
https://www.ibm.com/cloud/learn/recurrent-neural-networks
[25] Long short-term memory. (2022). Retrieved from Wikipedia:
https://en.wikipedia.org/wiki/Long_short-term_memory
[26] Goyal P., Hossain KSMT, Deb A., Tavabi N., Bartley N., Abeliuk A., Ferrara E., Lerman K. (2018).
Discovering Signals from Web Sources to Predict Cyber Attacks.
[27] Aldi MWP, Jondri., Aditsania A. (2018). Analisis dan Implementasi Long Short Term Memory
Neural Network untuk Prediksi Harga Bitcoin, e-Proceeding of Engineering 5 (2), 3548-3555.
[28] Liu Y., Yu X., Wu Y., Song S. (2021). Forecasting Variation Trends of Stocks via Multiscale
Feature Fusion and Long Short-Term Memory Learning, Scientific Programming, 1-9.
[29] Qiu J., Wang B., Zhou C. (2020). Forecasting Stock Prices With Long-Short Term Memory Neural
Network Based On Attention Mechanism, PLoS ONE 15 (1), e0227222.
[30] Fan H., Zhu L., & Yang Y. (2019). Cubic LSTMs for Video Prediction. Proceedings of the AAAI
Conference on Artificial Intelligence, 33(01), 8263-8270. https://doi.org/10.1609/aaai.v33i01.33018263
[31] Gated Recurrent Unit (GRU). (2022). Retrieved from MarketMuse:
https://blog.marketmuse.com/glossary/gated-recurrent-unit-gru-definition
[32] Lendave V. (2021). LSTM Vs GRU in Recurrent Neural Network: A Comparative Study. Retrieved
from Analytics India Mag: https://analyticsindiamag.com/lstm-vs-gru-in-recurrent-neural-network-a-
comparative-study/
[33] Loye G. (2019). Gated Recurrent Unit (GRU) With PyTorch. Retrieved from FloydHub:
https://blog.floydhub.com/gru-with-pytorch
[34] Tato A., Nkambou R. (2018). IMPROVING ADAM OPTIMIZER, ICLR 2018 Workshop.
[35] Aatila Mustapha et al. (2021). Comparative study of optimization techniques in deep learning:
Application in the ophthalmology field, J. Phys.: Conf. Ser. 1743 012002.
[36] Ruiz M., Tota K. (2020). Adam: The Birthchild of AdaGrad and RMSProp. Retrieved from Medium:
https://medium.com/@kaitotally/adam-the-birthchild-of-adagrad-and-rmsprop-b5308b24b9cd
[37] Kingma DP, Adam JLBa. (2015). A Method for stochastic Optimization. San Diego: The
International Conference on Learning Representations (ICLR).
[38] Brownlee J. (2021). Gradient Descent Optimization With AdaMax From Scratch. Retrieved from
Machine Learning Mastery: https://machinelearningmastery.com/gradient-descent-optimization-with-
adamax-from-scratch
[39] Dommaraju G. (2020). Keras' Accuracy Metrics. Retrieved from Towards Data Science:
https://towardsdatascience.com/keras-accuracy-metrics-8572eb479ec7
[40] Categorical accuracy. (2022). Retrieved from Peltarion:
https://peltarion.com/knowledge-center/documentation/evaluation-view/classification-loss-metrics/
categorical-accuracy
[41] Bharathi. (2021). Confusion Matrix for Multi-Class Classification. Retrieved from Analytics Vidhya:
https://www.analyticsvidhya.com/blog/2021/06/confusion-matrix-for-multi-class-classification
[42] Visa S., Ramsay B., Ralescu AL, Van Der Knaap E. (2011). Confusion matrix-based feature
selection, MAICS 710 (1), 120-127.
[43] Jayaswal V. (2020). Performance Metrics: Confusion matrix, Precision, Recall, and F1 Score.
Retrieved from Towards Data Science: https://towardsdatascience.com/performance-metrics-confusion-
matrix-precision-recall-and-f1-score-a8fe076a2262
[44] Basic evaluation measures from the confusion matrix. (2016). Retrieved from Classifier evaluation
with imbalanced datasets: https://classeval.wordpress.com/introduction/basic-evaluation-measures
[45] Vogt M. (2022). Comparison of GRU and LSTM in keras with an example. Retrieved from
ProjectPro: https://www.projectpro.io/recipes/what-is-difference-between-gru-and-lstm-explain-with-
example

You might also like