You are on page 1of 4

2020 6th International Conference on Interactive Digital Media (ICIDM)

Automatic Segmented-Syllable and Deep Learning-


Based Indonesian Audiovisual Speech Recognition
Suyanto Suyanto Kurniawan Nur Ramadhani
School of Computing School of Computing
Telkom University Telkom University
Bandung, Indonesia Bandung, Indonesia
suyanto@telkomuniversity.ac.id kurniawannr@telkomuniversity.ac.id
2020 6th International Conference on Interactive Digital Media (ICIDM) | 978-1-7281-4928-8/20/$31.00 ©2020 IEEE | DOI: 10.1109/ICIDM51048.2020.9339650

Satria Mandala Adriana Kurniawan


School of Computing School of Computing
Telkom University Telkom University
Bandung, Indonesia Bandung, Indonesia
satriamandala@telkomuniversity.ac.id adriana@student.telkomuniversity.ac.id

Abstract—Many studies proved that the audiovisual speech the phoneme-based one. Two advantages of syllable-based
recognition system was better than the audio-only or visual-only speech recognition systems are that they are more resistant to
ones. Nevertheless, three crucial issues should be carefully noise and better adapt to various dialect variations [3].
designed: the combination of both audio and visual, the acoustic
model, and the feature extraction. In this paper, a deep learning- The third problem is the feature extraction method, which
based Indonesian audiovisual speech recognition (INAVSR) is generally based on Mel frequency cepstral coefficient
system is developed. It is a combination of two models: (MFCC) or the Gaussian Mixture Model (GMM), has high
Indonesian audio speech recognition (INASR) and Indonesian complexity, so it does not support real-time systems. Some
visual speech recognition (INVSR). The INASR is built using the researchers started to use Convolutional Neural Network
Mel frequency cepstral coefficient (MFCC) as well as Mozilla (CNN) [4] to extract visual features and Deep Neural Network
DeepSpeech (MDS) and Kaituoxu Speech-Transformer (KST). (DNN) to extract audio features [2]. Both models, CNN and
Whereas the INVSR is implemented using the LipNet. A simple DNN, are included in deep learning. Since the inception of
procedure of automatic syllable segmentation in the visual data deep learning, ASR technology has made significant advances
is proposed. It functions to solve the out of vocabulary (OOV) [5], [6], [7]. A number of deep learning-based ASR models
words problem in recognizing speech in a sentence-level video. are starting to lead to End-to-End ASR (E2EASR), where
Evaluation of a small dataset shows that the developed deep learning of the three main components of ASR is done
speech-based INASR produces a relatively low word error rate
simultaneously. This makes the E2EASR capable of handling
(WER) of 22.0%. Meanwhile, the developed LipNet-based
a large vocabulary but requires less memory and processor.
INVSR gives a bit higher WER of 30.8%. The proposed
automatic syllable segmentation is able to tackle the problem of
Three models that are popular today are the Sequence-to-
OOV words. Finally, an evaluation of the dataset of videos Sequence Model (S2SM) [7], [8]; Listen, Attend and Spell
informs that the INAVSR system is capable of recognizing (LAS) [9], and Latent Sequence Decompositions (LSD) [10].
audiovisual speech in a sentence-level video. Compared to the Since 2017, ASR has begun to be built using a sub-word
INASR, the INAVSR provides slightly higher performance. It is based model or a series of characters, not a single character
able to give an absolute reduction of the WER by up to 2.0%. anymore like the previous E2EASR model [7], [10]. This
allows the E2EASR to be able to handle words that are not in
Keywords—audiovisual speech recognition, automatic
segmented-syllable, deep learning, Indonesian, out of vocabulary
the dictionary, known as out-of-vocabulary words (OOV).
word Another notable finding in 2017 was the augmentation of data
to build massive acoustic models. As described in [11], if we
I. INTRODUCTION have a sound corpus of 100 hours of recording, we can
augment the data by synthesizing new data from the original
A number of experts have proven that audiovisual speech
sound data by adding noise, modifying pitch, formant, and
recognition systems provide higher accuracy and are more
other speaker characteristics [10].
resistant to noise than the audio-only or visual-only ones [1],
[2]. However, there are three issues that should be carefully In the latest development in 2018, many researchers are
addressed and solved. trying to develop an E2EASR model capable of receiving raw
speech signal input without the need to be extracted using a
The first issue is how to combine audio and visuals. Some
high-complexity MFCC [12]. Of course, it simplifies and
experts propose two fusion models: feature fusion and
speeds up the learning process and introduction to the
decision fusion [2]. As the name implies, feature fusion
E2EASR. However, this model is still not mature enough. The
combines audio features and visual features for later
resulting performance is still relatively low.
introduction using a classification method. Meanwhile, in
decision fusion, two decisions are separately generated first. In this research, an Indonesian audiovisual speech
Both decisions are then combined to produce a final decision. recognition called the INAVSR system is developed using
Research in this part of the integration model is still open. deep learning models. It is a combination of Indonesian audio
Many approaches, techniques, and methods can be used. speech recognition (INASR) and Indonesian visual speech
recognition (INVSR). The INASR is implemented using a
The second problem is that the acoustic model, which is
feature extraction based on MFCC and a classification based
generally based on a triphone, turns out to be less resistant to
on MDS and KST. Next, the INVSR is built using a LipNet.
high noise and dialect variations. In the 2010s, many
An automatic visual syllable segmentation model is proposed
researchers indicate that a syllable-based system is better than

978-1-7281-4928-8/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: East Carolina University. Downloaded on June 25,2021 at 15:49:01 UTC from IEEE Xplore. Restrictions apply.
to solve the OOV words problem in recognizing speech in a Modality amalgamation occurs in the decoding process using
sentence-level video. shallow fusion. The best results were obtained using the GRID
corpus, namely WER of 1.07% for audio-only, 12.92% for
II. RELATED WORK visual only, and 1.48% for the combination of both.
Recent studies show that combining two ASR models In a recent study [20], Paraskevopoulos et al. used the
based on audio-only and visual-only gives better accuracy. Transformer architecture in the development of audiovisual
Mroueh et al. proved it by studying the unimodal approach automatic speech recognition (AV-ASR). The focus is on the
separately, then combining it. The audio-based model context of the scene provided by the visual information to
produces a phone error rate (PER) of 41% at low noise levels. form ASR. Audio feature extraction is carried out at the
The audiovisual fusion model produces a PER of 35.83%, encoder transformer layer and video features using a multi-
even in high noise levels. The research also produced a new head attention layer. In addition, there is also multitask
network architecture, namely, the bilinear softmax layer, training for multi-resolution ASR, where the model is trained
which serves to clarify the specific correlations between to produce character and subword level transcriptions. The
modalities. In [13], utilizing bilinear softmax reduces the PER experimental results on the How2 dataset show that multi-
to be 34.03%. resolution training can accelerate convergence by about 50%
Tamura et al. used a DNN to develop an AVSR system and relatively increase WER performance by 18% compared
[14]. DNN is applied in speech recognition using two ways: to the subword prediction model. The combined visual
hybrid and tandem. In the first way, the probability of each information also improves accuracy by 3.76% over the audio-
Hidden Markov Model (HMM) state is computed using the only model.
DNN. In contrast, the second approach applies DNN on the
feature extraction. Using the CENSREC-1-AV corpus and III. PROPOSED MODEL
several experiments with the DNN method, the best results The developed INAVSR system is illustrated in Fig 1.
were achieved using Deep Bottle-Neck Features (DBNFs) First, the video dataset is split into two subsets of features:
audio and visuals using the multi-stream HMM. audio and visual. Both significantly different features are used
After that, the AVSR system began to be developed in a to construct (train) two classification models separately,
multimodal manner. However, most of them are still based on namely the audio classifier and the visual classifier. Next, both
2D audiovisual with low video sampling. In [15], Su et al. classifiers are combined to produce a new classifier, which is
developed a 3D model with a dataset that has a high sampling called as classifier fusion, to get the final recognition. In the
rate (up to 100Hz). In the visual feature extraction process, a testing phase, the classifier fusion is used to recognize the
bimodal convolutional neural network framework is used. given video input.
Meanwhile, LSTM-RNN is used to generate a visual modal
from an audio modal. Then, both will be integrated with CNN.
The results showed a 27% reduction in CER when compared
to using only audio. When the visual modality is not available,
the system will use visual feature creation techniques and is
able to reduce CER by 18.52%. Another RNN model can also
be able to handle a text sequence, such as in [16].
The application of neural networks then extends from the
initial stages of feature extraction to higher levels, such as a
combination of information and speech modeling. In [17],
Rahmani et al. designed the auto-encoder to generate efficient
bimodal features from audio and visual input. The proposed
Fig. 1. Developed INAVSR system.
basic structure is transformed into three steps to make better
use of visual information. Compared to unimodal and bimodal A. Video Dataset
baselines in the phoneme recognition process in noisy audio
conditions, the performance of the developed model. By using The dataset used in this research consists of both text and
the DNN-HMM classifier, Phoneme Error rates (PER) are video. The text corpus is the set of mother sentence set
reduced by 36.9%. In a clean environment, PER is relatively containing 10 million sentences described in [21], which is
reduced by 19.2%. used for building a language model [22] as well as
constructing the transcription texts in the recording process.
So far, a common problem that occurs in speech To build a language model, the main text corpus is trained on
recognition is a noisy environment. For this reason, Sharmila a 4-gram based model using a tool called KenLM.
et al. in [18] designed speech recognition in a noisy Meanwhile, to construct the transcription text, the text corpus
environment using the discrete wavelet (DWT) feature of the of 10 million sentences was extracted into only 4,030
lip image, which is integrated with the 'Bark Frequency sentences, which contained the same syllable content as the
Cepstral Coefficients (BFCC) and audio features. The Pseudo primary set of sentences, using a greedy algorithm [23].
Hue & Color Intensity approach is used to detect the location Furthermore, the construction of the video corpus is carried
of the lips. As a result, the use of Pseudo Hue, DWT, and out by recording the voice reading the transcription text by
HMM as classifier has an accuracy of 80%. Meanwhile, the 500 speakers with variations of dialect, gender, and age,
use of Color Intensity, DWT, and HMM resulted in an where the detailed explanation is as follows:
accuracy of 78%. Focus on the same problem, Aralikatti et al.
developed an RNN-HMM model to combine information • The text corpus contains 4,030 sentences divided into
from an initially separate audio and visual modal [19]. It 50 subsets, of which 20 subsets contain 80 sentences
allows for separate acoustic and visual model training. and the rest contain 81 sentences;

Authorized licensed use limited to: East Carolina University. Downloaded on June 25,2021 at 15:49:01 UTC from IEEE Xplore. Restrictions apply.
• The number of speaker respondents is 500 people, before arriving at the output, several comparisons and
where each respondent reads 80 or 81 sentences experiments are carried out.
according to the transcription in the subsets; Since a video consists of one sentence, segmentation of
• Five dialects: Javanese (East Javanese, Central Java, five, six, seven, and eight frames carried out for each syllable.
and Yogyakarta), Sundanese (West Javanese), Malay Those frame lengths are obtained from analyzing the average
(Sumatran), Bugis (Sulawesi), and Banjar frames needed by a person to pronounce a syllable. The
(Kalimantan). Thus, there are 100 speakers for each results of the predictions from the four segments are then
dialect; voted on. Hence, the most dominant syllable among the
lengths of the four segments is selected. After that, there will
• Two genders: Male and Female, 50 people each for be a threshold ratio. If the confidence level of the prediction
each of the dialects;
is below the threshold, the syllable is removed from the
• Five age categories, which based on the Indonesian prediction results.
Department of Health:
E. Classifier Fusion
- Early Adolescence: 12 to 16 years (10 people The classifier fusion is simply implemented using
for each dialect and gender) confidence-based decision-level fusion using the sum of
- Late Adolescence: 17 to 25 years (10 people for confidence scores as used in [27]. The confidence scores are
each dialect and gender) separately generated from both individual classifiers using
speech only and visual only. The fusion confidence score of
- Early Adult: 26 to 35 years (10 people for each
a given syllable s is formulated as
dialect and gender)
- Late Adult: 36 to 45 years (10 people for each ( )=∑ ( ), (1)
dialect and gender) where ci(s) is the confidence score of the syllable s in the
- Senior citizens: over 46 years old (10 people for classifier i.
each dialect and gender). The candidate-syllable with the highest fusion confidence
score f(s) is then selected as the recognized syllable r, which
Hence, the video corpus contains 40,300 videos with is mathematically written as
balanced variations of dialect, gender, and age. The video file
format used is MP4 with Full HD resolution and 30 frames per = arg max ( ). (2)
second (fps). Furthermore, each video is annotated
(segmentation and labeling) based on the syllable content. F. Evaluation
Since the manual annotation process is challenging and time- The learned INAVSR is finally evaluated using the word
consuming, only 50 videos are selected to be used for the error rate (WER), which is formulated as
training and testing sets.
= , (3)
B. Data Preprocessing
The dataset is preprocessed using the MFCC. Firstly, where E is the number of errors on the word level, and N is
speech input is divided into blocks of 20 milliseconds. the total number of words that appeared in the testing set.
Secondly, two-block windowing is applied in each frame.
IV. RESULT AND DISCUSSION
Using a fast Fourier transform (FFT), the time-domain
waveforms are then transformed into the frequency domain. Firstly, INASR is evaluated using the dataset of audio-
Wrapping processes of four FFT generate the Mel frequency only. In this research, we gradually tune the MDS parameters
scale. Eventually, the Mel-frequency domain is transformed of the hidden neuron, learning rate, number of epochs, and
back into the time domain. dropout rate. The too-large learning rate makes the costs in
both training and validation converge too-quickly in the
C. Learning Process beginning of iteration and stuck in a local optimum. In
The INAVSR learning process is carried out using the contrast, a too-small learning rate slowly reduces the costs
same techniques and methods as the INASR learning system from the beginning to the end.
and the INVSR system. A number of experiments are then Meanwhile, a too-few number of epochs produce an
carried out to build multiple INASR base-models using MDS underfit MDS with a high WER. Meanwhile, too-large
[5] and multiple INVSR base-models using LipNet [24]. epochs give an overfit MDS with a too-high WER close to
Furthermore, an ensemble learning-based model is then built 100%. The best parameters of MDS obtained in this research
using several base-models. are hidden neurons of 512, the learning rate of 0.00005, the
epoch of 150, and the dropout rate of 0.2275 that gives the
D. Automatic Syllable Segmentation
lowest WER of 23.10%. Meanwhile, five structures of KST
A syllable segmentation can be applied on either [25] are generated. Table I shows that a small KST structure of six
speech or video [26]. The challenge in making automatic encoders and six decoders gives a low performance, which
syllable segmentation is that the time duration of a person produces a high WER of 30%. The optimum structure is ten
pronouncing one syllable is varying. For example, in videos encoders and ten decoders, which gives the lowest WER of
with a frame rate of 30 fps, some people may need only five 22.0%.
frames while the others are up to eight frames or more.
Whereas the input to deep learning is a fixed size. Therefore,

Authorized licensed use limited to: East Carolina University. Downloaded on June 25,2021 at 15:49:01 UTC from IEEE Xplore. Restrictions apply.
TABLE I. WER PRODUCED BY THE AUDIO-ONLY ASR [4] A. A. Nugraha, A. Arifianto, and Suyanto, “Generating Image
Description on Indonesian Language using Convolutional Neural
Structure of MDS WER Network and Gated Recurrent Unit,” in ICoICT 2019, pp. 1–6.
6 Encoders 6 Decoders 30.0% [5] D. Amodei et al., “Deep Speech 2: End-to-End Speech Recognition in
8 Encoders 8 Decoders 28.0% English and Mandarin,” Cornell University Library’s arXiv.org, 2015.
http://arxiv.org/abs/1512.02595.
10 Encoders 10 Decoders 22.0%
[6] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,
5 Encoders 10 Decoders 26.0% “Deep Reinforcement Learning: A Brief Survey,” IEEE Signal
10 Encoders 5 Decoders 24.0% Process. Mag. Spec. Issue Deep Learn. Image Underst., vol. 34, no. 6,
pp. 26–38, 2017.
[7] C.-C. Chiu et al., “State-of-the-art Speech Recognition With Sequence-
Secondly, the INVSR is then evaluated using the dataset to-Sequence Models,” Cornell University Library’s arXiv.org, 2017.
of visual-only. In this research, three thresholds are used to http://arxiv.org/abs/1712.01769.
see their impact on the syllable segmentation. Table II shows [8] J. Chorowski and N. Jaitly, “Towards better decoding and language
that the optimum threshold is 0.8, which gives the lowest model integration in sequence to sequence models,” Cornell Univ.
Libr. arXiv.org, no. 5, 2016.
WER of 30.8%. Finally, INAVSR is investigated. Table III
[9] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A
shows that the INAVSR obtains the lowest WER. It is able to neural network for large vocabulary conversational speech
give an absolute reduction by up to 2% to be 20.0%. recognition,” in 2016 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964.
TABLE II. IMPACT OF THE SEGMENTATION THRESHOLD ON THE [10] W. Chan and Y. Zhang, “Latent Sequence Decompositions,” Cornell
WER PRODUCED BY THE VISUAL-ONLY ASR Univ. Libr. arXiv.org, pp. 1–12, 2017.
Syllable segmentation threshold WER [11] I. Rebai, Y. Benayed, W. Mahdi, and J. P. Lorré, “Improving speech
recognition using data augmentation and acoustic model fusion,”
0.7 34.6% Procedia Comput. Sci., vol. 112, pp. 316–322, 2017.
0.8 30.8% [12] N. Zeghidour, N. Usunier, G. Synnaeve, R. Collobert, and E. Dupoux,
0.9 42.3% “End-to-End Speech Recognition From the Raw Waveform,” Cornell
University Library’s arXiv.org, 2018. http://arxiv.org/abs/1806.07098.
[13] Y. Mroueh, E. Marcheret, and V. Goel, “Deep multimodal learning for
TABLE III. WER PRODUCED BY AUDIO-ONLY,VISUAL-ONLY, AND Audio-Visual Speech Recognition,” 2015 IEEE Int. Conf. Acoust.
AUDIOVISUAL ASR Speech Signal Process., pp. 2130–2134, 2015.
ASR Method WER [14] S. Tamura et al., “Investigation of DNN-Based Audio-Visual Speech
Recognition,” IEICE Trans. Inf. Syst., vol. 99-D, pp. 2444–2451, 2016.
Audio-only ASR 22.0%
[15] R. Su, L. Wang, and X. Liu, “Multimodal learning using 3D audio-
Visual-only ASR 30.8% visual data for audio-visual speech recognition,” 2017 Int. Conf. Asian
Audiovisual ASR 20.0% Lang. Process., pp. 40–43, 2017.
[16] R. Adelia, S. Suyanto, and U. N. Wisesty, “Indonesian Abstractive
V. CONCLUSION Text Summarization Using Bidirectional Gated Recurrent Unit,”
Procedia Comput. Sci., vol. 157, pp. 581–588, 2019.
The developed deep speech-based INASR produces a [17] M. H. Rahmani, F. Almasganj, and S. A. Seyyedsalehi, “Audio-visual
relatively low WER of 22%. Meanwhile the developed feature fusion via deep neural networks for automatic speech
LipNet-based INVSR gives a bit higher WER of 30.8%. The recognition,” Digit. Signal Process., vol. 82, pp. 54–63, 2018.
proposed automatic syllable segmentation in the visual data [18] Sharmila, M. A.N, A. Neeta, V. Vineeta, and M. Sarika, “Hindi speech
is able to tackle the problem of OOV words. Finally, a deep audio visual feature recognition,” vol. 29, no. 5, pp. 1734–1743, 2020.
learning-based INAVSR system is successfully developed. It [19] R. Aralikatti et al., “Audio-Visual Decision Fusion for WFST-based
and seq2seq Models,” ArXiv, vol. abs/2001.1, 2020.
is capable of recognizing audiovisual speech in a sentence-
[20] G. Paraskevopoulos, S. Parthasarathy, A. Khare, and S. Sundaram,
level video. Compared to the INASR system, it gives an “Multiresolution and Multimodal Speech Recognition with
absolute reduction of the WER by up to 2.0%. In the future, Transformers,” ArXiv, vol. abs/2004.1, 2020.
the INAVSR system can be designed using a feature fusion [21] F. Alfiansyah and Suyanto, “Partial greedy algorithm to extract a
instead of classification fusion. A bigger dataset should be minimum phonetically-and-prosodically rich sentence set,” Int. J. Adv.
Comput. Sci. Appl., vol. 9, no. 12, pp. 530–534, 2018.
used to do a more comprehensive evaluation.
[22] S. N. Hidayatullah and Suyanto, “Developing an Adaptive Language
ACKNOWLEDGMENT Model for Bahasa Indonesia,” Int. J. Adv. Comput. Sci. Appl., vol. 10,
no. 1, pp. 488–492, 2019.
This research is fully funded by the Ministry of Research [23] B. N. Budi, Nurtomo, and S. Suyanto, “Greedy Algorithms to Optimize
and Technology/National Research and Innovation Agency a Sentence Set Near-Uniformly Distributed on Syllable Units and
(Kementerian Riset dan Teknologi/Badan Riset dan Inovasi Punctuation Marks,” Int. J. Adv. Comput. Sci. Appl., vol. 9, no. 10, pp.
291–296, 2018.
Nasional or KemenRistek/BRIN) with the scheme of Master
Thesis Research (Penelitian Tesis Magister) and the contract [24] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas, “LipNet:
End-to-End Sentence-level Lipreading,” 2017.
number: 005/SP2H/AMD/LT-MONO/LL4/2020.
[25] S. Suyanto and A. E. Putra, “Automatic Segmentation of Indonesian
Speech into Syllables using Fuzzy Smoothed Energy Contour with
REFERENCES Local Normalization, Splitting, and Assimilation,” J. ICT Res. Appl.,
[1] A. Thanda and S. M. Venkatesan, “Audio Visual Speech Recognition vol. 8, no. 2, pp. 97–112, 2014.
using Deep Recurrent Neural Networks,” CoRR, vol. abs/1611.0, 2016. [26] A. Kurniawan and S. Suyanto, “Syllable-Based Indonesian Lip
[2] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata, Reading Model,” in 2020 8th International Conference on Information
“Audio-visual speech recognition using deep learning,” Appl. Intell., and Communication Technology (ICoICT), Jun. 2020, pp. 1–6.
vol. 42, no. 4, pp. 722–737, 2015. [27] Z. Yao, Z. Wang, W. Liu, Y. Liu, and J. Pan, “Speech emotion
[3] R. Janakiraman, J. C. Kumar, and H. A. Murthy, “Robust syllable recognition using fusion of three multi-task learning-based,” Speech
segmentation and its application to syllable-centric continuous speech Commun., vol. 120, no. February, pp. 11–19, 2020.
recognition,” in National Conference on Communications (NCC),
2010, pp. 1–5.

Authorized licensed use limited to: East Carolina University. Downloaded on June 25,2021 at 15:49:01 UTC from IEEE Xplore. Restrictions apply.

You might also like