You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/330538092

Sign Language System for Bahasa Indonesia (Known as SIBI) Recognizer using
TensorFlow and Long Short-Term Memory

Conference Paper · October 2018


DOI: 10.1109/ICACSIS.2018.8618134

CITATIONS READS

9 225

2 authors:

Kustiawanto Halim Erdefi Rakun


University of Indonesia University of Indonesia
1 PUBLICATION   9 CITATIONS    19 PUBLICATIONS   103 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Recognition of Indonesian Language Sign System known as SIBI View project

All content following this page was uploaded by Erdefi Rakun on 28 March 2019.

The user has requested enhancement of the downloaded file.


Sign System for Bahasa Indonesia Known as SIBI
(Sistem Isyarat Bahasa Indonesia) Recognizer using
TensorFlow and Long Short-Term Memory

Abstract—SIBI is used formally as Sign Language System for


Bahasa Indonesia. SIBI follows Bahasa Indonesia’s grammatical
structure to make it a unique and complex Sign Language
System. The state of current research in SIBI is that it is possible
to translate the alphabet, root words and numbers to text. This
research focuses in recognizing inflectional words, which are root
words and combination of prefix, infix and suffix. By separating
the root words, prefix, infix and suffix, it was possible to use
minimal feature sets. SIBI sequence data contains temporal
dependencies, therefore Long Short-Term Memory (LSTM) is
used as neural network model. The entire sequence of feature sets
based on the SIBI inflectional word gestures is used as input.
TensorFlow is used as development framework to make sure
model can be easily deployed to a variety of devices, including
smartphones. The best results were obtained using a 2-layer
LSTM with 96.15% of accuracy for root words. The same model
obtained an accuracy score of 78.38% with inflectional words.
The model, however, still struggles in recognizing prefixes and
suffixes correctly.

Keywords—SIBI, Sign Language, LSTM, TensorFlow, Fig. 1. SIBI sign guesture for “perasaan” (= feeling)
Smartphone, Recurrent Neural Network
Ministry of Education and Culture of the Republic of
Indonesia in 1994 issued a dictionary System Sign Language
I. INTRODUCTION
Bahasa Indonesia (SIBI). Through this dictionary, SIBI
People with hearing impairments have difficulties in becomes the official communication medium for hearing
communicating face-to-face as well as when the desired form impaired people in Indonesia, both for educational process, as
of communication requires them to use an intermediary well as official communication means, between hearing
device. Devices that people take for granted, such as phones impaired, as well as between no hearing disabilities and the
and the interfaces of various systems, still require a modicum one who suffer it [1].
of hearing ability, which poses a problem for anyone with a
hearing impairment. As an Indonesian language expressed through gestures,
SIBI follows the rules of Indonesian grammar. For example,
Moreover, people who suffer hearing disabilities from the way in which the SIBI inflectional word is signed, is by
birth will greatly affect the ability of their oral adding the affixed sign (prefix, suffix, and particle) to the
communication. Teaching them to read lips movements to word gesture. For example, the "perasaan" does not have a
communicate with the non-hearing-impaired have been proved special sign, but is formed by the prefix "pe", followed by the
ineffective, since lip motion can only be observed clearly root word "rasa" and adding suffix with the "an" as in Figure
when they face each other at close range. In addition, many 1. These inflectional words sign is the distinguishing feature
different words have the same lips movements [1]. Therefore, of SIBI, relative to other sign languages.
using sign language with a combination of finger movements,
hands/arms, and facial expressions is the best way to The rapid pace of technological advancement means the
communicate for people with hearing impairment [1]. ubiquity of electronic devices that are able to communicate
with each other, and facilitate their users to do the same. It
also opens opportunities to use these devices as a bridge to
overcome communication problems between the hearing
impaired and those without hearing impairment. An electronic
translation system from SIBI sign language to text is one of
the possible solutions to overcome oral communication
constraints [2].
In a previous study [2][3][4][5], Microsoft Kinect was word is recorded as five times for each teacher, resulting in a
used to record gesture movements to be used on the SIBI total of 1760 SIBI gesture sequences.
translator system into text. While it performed fine in
translating most gestures, Kinect-based systems must be The recordings are first preprocessed and then a feature
connected to a computer running on Microsoft Windows in extraction step was performed. As mentioned before, the
order to be able to perform inference . These constraints feature extraction process is not discussed in this study. Figure
hamper the benefits of SIBI's translator system because the 3.3 shows a chart showing an overview of the feature
person with hearing disabilities must be in front of a computer extraction steps performed in the previous study [2]. In part (a)
to translate their gestures into text. the image feature extraction, followed by skeleton feature
extraction in part (b), and continued in part (c) as the process
In order to address these shortcomings, we want to develop of obtaining the combined features of the image and skeleton.
a SIBI translator system that is more flexible and portable,
which means harnessing the power of the now-ubiquitous
smartphone. The contemporary garden-variety smartphone
contains a plethora of useful-for-a-SIBI-translation-system
features such as music player, video player, and camera. The
camera on this smartphone can be used as a gesture recorder
on the SIBI translator system. If it is possible to use the
smartphone as a gesture recorder and as well as a gesture
interpreter, it may be possible to make this system even more
accessible than the Kinect-based system before it.
Fig. 2. Feature Extraction Steps
This study is part of the SIBI translator system into text
developed on smartphones. It focuses on the development of Sign language gesture recognition systems are built using
models using TensorFlow for SIBI to text translator. supervised learning. Consequently, each frame of the gesture
TensorFlow is chosen as the deep-learning framework because movement must be labeled. The data containing the feature
model produced can be used and can be deployed in various vector set of features extracted results from the video will be
devices, including CPU, GPU, Android devices, and iOS divided into four groups of data to be tested. The four groups
devices. This paper examines the model that is intended to be of data are combined data (consisting of a combination of all
used in a SIBI to text translator Android smartphone app . root words, prefixes and suffixes) hereinafter referred as
DATA_ALL, root words data hereinafter called BASIC_ALL,
II. METHOD AND RESEARCH DESIGN prefix data hereinafter called AWALAN_ALL, and suffix data
hereinafter called AKHIRAN_ALL. All data uses the
A. Scope of Research combined features of image and skeleton features. Each set of
This research is part of bigger research entitled " Parsing these data groups will be tested to find out which model
dan Penerjemahan Citra Gerakan Isyarat SIBI menjadi Model performs the best in translating each group of data into text.
Machine Learning". As a part of a larger research, this
research is also tied to patents for the development of SIBI C. Methodology
Translator Application on smartphone devices. The model-related steps taken in this study are the design,
training, and evaluation of the model. The LSTM model
The dataset used in this study is the same dataset already
developed using TensorFlow has two independent variables
used in extant studies [2]. The inflectional word gesture used
that will be measured to determine the best model of training
in this study only include: a prefix + a root word; a root word
results. The number of units in the Hidden Layer (called the
+ a suffix; a prefix + a root word + a suffix. Each of these
Hidden Unit) and the size of the Batch Size, and the
components are recorded in isolation, as opposed to being
determination of these parameters were done in the design
recorded in one go in a sentence.
phase. Execution time is used to strike a reasonable
This study also focuses on the TensorFlow framework compromise with these two parameters.
with the development of a model based on the Recurrent
In the training phase, the LSTM models (1-layer LSTM, 2-
Neural Network (RNN) using Long Short-Term Memory
layer LSTM and Bidirectional LSTM) will pass through the
(LSTM). This study does not include the execution of feature
training process using training datasets in the form of gestures
extraction techniques and does not include the segmentation-
segmented into sequences of frames and their corresponding
into-sequences-of-frames step of the SIBI gestures as part of
labels.
the input preprocessing. .
The determination the best model from was done in the
B. Datasets evaluation phase. Test scenarios were designed to determine
The data was obtained from the recording of SIBI words which models have the highest accuracy. Evaluation was done
sign exhibited by two SLB teachers who were fluent and were by comparing the accuracy of the different LSTM
actively using SIBI. The data used in this study consists of 21 architectures based on testing datasets, as well as the agreed-
root words and 155 inflectional words using the entire set of upon design parameters associated with each architecture,
prefixes and suffixes contained within Bahasa Indonesia. Each namely the number of hidden units and the batch size. To
determine the relationship between the hidden unit parameters
and the batch size parameter that existed in each model, we
used a statistical test of two-way ANOVA. Tuckey as a post-
hoc test was used to determine the best value of the hidden
unit and the best number of batch size for the model. The
results of this statistical test points towards the model with the
right parameter value, i.e. the model with the highest accuracy
score.

III. THEANO AND TENSORFLOW


In related works [2][3][4][5] the models were built using
Theano. Theano is a mathematical Python library with a
NumPy-esque syntax. In September 2017, Theano officially
Fig. 3. 1-layer LSTM Architecture
halted further development [6]. The lack of support the main
reason why this work was done using Tensorflow. In contrast
to Theano, TensorFlow is still in development and can be
served and deployed on a multitude of devices [7]. The ease
with which TF models can be deployed to mobile devices
greatly helped the development of the SIBI translator to text
mobile app.

IV. LONG SHORT-TERM MEMORY


Arrays consisting of sequence chunks of affix word
components and the root word components serve as the inputs
to the LSTM model [8]. This study performed the training
using 3 different LSTM architectures. The three different
architectures are the 1-layer LSTM, the 2-layer LSTM and the
Bidirectional LSTM. The architectural differences and
predictive results generated by these three models will be used
to determine which model is the most effective model for SIBI Fig. 4. 2-layer LSTM Architecture
translation.

A. 1-layer LSTM
The design of this model is described in Fig. 3. Recall the
input features explained in [2]. The output of the final layer at
the final time step of each sequence ought to be the same label
given for that sequence. The design of 1-layer LSTM model
architecture can be seen in Figure 3.

B. 2-layer LSTM
The design of the 2-layer LSTM is shown Fig. 4. The
difference lies in the number of LSTM layers used is the
addition of one LSTM layer so that the total becomes two
layers. At the end of the first layer, the resulting output will be
forwarded on the second layer before producing the final
output. The architecture design of the 2-layer LSTM model
can be seen in Figure 4. Fig. 5. Bidirectional LSTM Architecture

C. Bidirectional LSTM
V. EXPERIMENTAL RESULT
In the Bidirectional LSTM model, there are two LSTM
layers. The first layer will run from the initial set of vector This chapter describes what happened when each model is
inputs to the end, while the second layer will run in the trained and tested. In each model, experiments were performed
opposite direction. The results of both layers will go into the using 128, 256 and 512 hidden units and batch sizes of 50, 100
merge node to be concatenated, before going into the softmax. and 200. There are 10 experiments performed for each
The design of Bidirectional LSTM can be seen in Figure 5. parameter with epoch counts from 100 to 1000. Table I shows
full result of the experiments.
TABLE I. EXPERIMENT RESULTS

Hidden Unit
Best 64 128 256 512
Datasets LSTM Model
Epoch Batch Size Batch Size Batch Size Batch Size
50 100 200 50 100 200 50 100 200 50 100 200

1-layer LSTM 900 60,2 58,5 59,2 67,2 69,3 68,3 72,7 75,6 73,6 75,7 77,7 76,1

DATA_ALL 2-layer LSTM 600 64,6 63,8 67,9 69,8 68,0 71,4 73,9 73,3 73,7 76,8 75,7 78,4

Bidirectional LSTM 1000 61,6 60,6 65,5 66,3 66,7 69,8 72,0 74,2 69,8 74,5 73,9 76,4

1-layer LSTM 800 88,0 86,8 89,9 93,5 93,3 91,8 94,5 92,8 93,8 94,7 95,2 95,0

DASAR_ALL 2-layer LSTM 400 88,2 90,9 86,8 91,1 91,6 92,1 94,7 93,3 94,7 96,2 96,2 94,5

Bidirectional LSTM 200 82,0 83,9 80,5 88,5 88,0 85,8 91,1 90,1 90,6 93,0 93,8 91,6

1-layer LSTM 600 56,8 52,6 55,4 63,9 61,0 60,6 59,6 62,9 64,3 69,0 64,3 67,6

AKHIRAN_ALL 2-layer LSTM 500 46,9 45,1 51,6 52,6 51,2 56,3 61,5 55,4 59,6 54,0 32,4 55,4

Bidirectional LSTM 300 50,2 48,8 54,0 56,8 55,4 56,8 62,0 62,0 59,6 63,4 63,4 65,7

1-layer LSTM 800 43,9 47,0 45,8 45,8 49,6 54,5 57,6 62,1 60,6 63,3 65,5 67,8

AWALAN_ALL 2-layer LSTM 800 58,7 53,8 67,8 58,7 65,2 63,6 72,3 64,4 67,4 66,7 64,8 66,3

Bidirectional LSTM 1000 38,6 37,5 46,6 48,1 45,8 47,0 55,3 51,9 48,5 56,4 57,2 61,4

In DATA_ALL, ANOVA test results were performed on AKHIRAN_ALL datasets because it has significance level 0.
three models, LSTM 1-layer, 2-layer LSTM and Bidirectional Table IV show the result AKHIRAN_ALL test.
LSTM and showed the influence between these three models
to be insignificant, with an accuracy of F (2.27) = 2.816, p = TABLE III. DASAR_ALL
0.078. By comparing the accuracy of the three models, the 2-
layer LSTM is best used for DATA_ALL. Table II shows the 95% Confidence
Model
result DATA_ALL test. LSTM Mean Std. Interval
Compar Sig.
Model Diff. Error Lower Upper
ed
Bound Bound
TABLE II. DATA_ALL 2-layer 1-layer 1.08 .59 .175 -.37 2.54
Bidirecti
Subset for 2.62* .59 .000 1.16 4.08
LSTM Model N onal
alpha = 0.05 *. The mean difference is significant at the 0.05 level.

Bidirectional 10 72.77
TABLE IV. AKHIRAN_ALL
1 Layer 10 74.35
95% Confidence
Model
2 Layer 10 76.27 LSTM Mean Std. Interval
Compar Sig.
Model Diff. Error Lower Upper
Sig. .063 ed
Bound Bound
Means for groups in homogeneous subsets are displayed. 1 Layer 2 Layer 9.15* 1.29 .000 5.96 12.35
a. Uses Harmonic Mean Sample Size = 10.000.
Bidirecti
The result of an ANOVA test conducted on DASAR_ALL, 1.88 1.29 .327 -1.32 5.07
onal
showed the influence between these LSTM models with a *. The mean difference is significant at the 0.05 level.
significant accuracy of F (2,27) = 10,058, p = 0.001.
Comparing the results of Tukey's post hoc test for all model, TABLE V. AWALAN_ALL
the 2-layer LSTM is best used for DASAR_ALL datasets
95% Confidence
because it has significance level 0. Table III show the result LSTM
Model
Mean Std. Interval
DASAR_ALL test. Compar Sig.
Model Diff. Error Lower Upper
ed
Bound Bound
On AKHIRAN_ALL datasets, ANOVA test showed the
influence between three models with a significant accuracy of 2 Layer 1 Layer 3.83* 1.38 .026 .41 7.24
F (2,27) = 25.068 , p = 0. Comparing the results of Tukey's Bidirecti
9.7* 1.38 .000 6.28 13.12
post hoc test for all model, the 1-layer LSTM is best used for onal
*. The mean difference is significant at the 0.05 level.
Result of ANOVA test conducted on AWALAN_ALL, batch size of 50 at the 800th epoch. The accuracy
showed the influence between these LSTM models with a attained was 72.348%.
significant accuracy of F (2,27) = 25.094, p = 0. Comparing
the results of Tukey's post hoc test for all model, the 2-layer • Although the accuracy of the resulting model is good
LSTM is best used for AWALAN_ALL datasets because it enough, the SIBI application can still produce errors in
has significance level 0. Table V show the result making predictions. The use of Indonesian grammar
AWALAN_ALL test. (before displaying text on the smart phone screen) is
expected to help resolve predictive error.
VI. CONCLUSION AND FUTURE WORK • The model's inability to recognize prefix gestures and
endings can be improved by adding a finger-
• In all of the tested datasets, with the three models used, recognition feature at the feature extraction stage. This
there was no significant effect on the Batch Size is being examined and is part of the upcoming agenda
variable changes to the model predictive accuracy for this SIBI translation project.
improvement. Therefore, the Batch Size selected on
each model is the one that produces the highest
accuracy in the experimental group. VII. REFERENCES
[1] S. Siswomartono, Cara Mudah Belajar SIBI (Sistem Isyarat Bahasa
• In all of the tested datasets, with the three models used, Indonesia), Jakarta: Federasi Nasional untuk Kesejahteraan Tunarungu,
changes made to the Hidden Unit variable values have 2007.
a significant effect on increasing the predictive [2] E. Rakun, "Pengenalan Komponen Imbuhan dan Kata Dasar pada
accuracy of the model. The selection of the best Isyarat Kata Berimbuhan dalam SIBI (Sistem Isyarat Bahasa Indonesia)
Hidden Unit values for each model was performed on dengan menggunakan Probabilistic Graphical Model," Fakultas Ilmu
Komputer, Universitas Indonesia, Depok, 2016.
the ANOVA test by looking at Tukey post hoc results.
[3] E. Rakun, M. Febrian Rachmadi, Andros and K. Danniswara, "Spectral
• In all of the tested datasets, with the three models domain cross correlation function and generalized learning vector
used, there was no significant difference during the test quantization for recognizing and classifying indonesian sign language.,"
in International Conference on Advanced Computer Science and
so that the training time difference in each model was Information Systems (ICACSIS), 2012.
negligible as it was worth the increased accuracy [4] E. Rakun, M. Andriani, I. W. Wiprayoga, K. Danniswara and A.
produced. Tjandra, "Combining depth image and skeleton data from kinect for
recognizing words in the sign system for indonesian language (SIBI
• For DATA_ALL, the model that performed best was a [sistem isyarat bahasa indonesia])," in International Conference on
2-layer LSTM using 512 hidden units and a batch size Advanced Computer Science and Information Systems (ICACSIS), 2013.
of 200 at the 600th epoch. The accuracy attained was [5] E. Rakun, M. I. Fanany, I. Wisesa and A. Tjandra, "A heuristic Hidden
78.387%. Markov Model to recognize inflectional words in sign system for
Indonesian language known as SIBI (Sistem Isyarat Bahasa Indonesia),"
• For DASAR_ALL, the model that performed best was in International Conference on Technology, Informatics, Management,
a 2-layer LSTM using 512 hidden units and a batch Engineering & Environment (TIME-E), 2015.
size of 50 at the 400th epoch. The accuracy attained [6] University of Montreal. (2017, November 21). Theano 1.0.0
was 96.154%. documentation. Retrieved from Theano:
http://www.deeplearning.net/software/theano/
• For AKHIRAN_ALL, the model that performed best [7] Google Brain Team. (2018). Retrieved from TensorFlow:
was a 1-layer LSTM using 512 hidden units and a https://www.tensorflow.org/
batch size of 50 at the 600th epoch. The accuracy [8] Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory.
Neural Computation (hal. 1735-1780). MIT Press.
attained was 69.014%.
• For AWALAN_ALL, the model that performed best
was a 1-layer LSTM using 512 hidden units and a

View publication stats

You might also like