You are on page 1of 8

Robustness of Speech Recognition System

of Isolated Speech in Macedonian

Daniel Spasovski1, Goran Peshanski1, Gjorgji Madjarov2, and Dejan Gjorgjevikj2


1
Netcetera, Skopje, Macedonia
spasovski.daniel@gmail.com, pesanski_goran@yahoo.com
2
Faculty of Computer Science and Engineering, Skopje, Macedonia
{gjorgji.madjarov,dejan.gjorgjevikj}@finki.ukim.mk

Abstract. Over five decades the scientists attempt to design machine that
clearly transcripts the spoken words. Even though satisfactory accuracy is
achieved, machines cannot recognize every voice, in any environment, from
any speaker. In this paper we tackle the problem of robustness of Automatic
Speech Recognition for isolated Macedonian speech in noisy environments.
The goal is to exceed the problem of background noise type changing. Five
different types of noise were artificially added to the audio recordings and the
models were trained and evaluated for each one. The worst case scenario for the
speech recognition systems turned out to be the babble noise, which in the
higher levels of noise reaches 81.10% error rate. It is shown that as the noise
increases the error rate is also increased and the model trained with clean
speech, gives considerably better results in lower noise levels.

Keywords: speech recognition, robustness, isolated speech, signal-to-noise


ratio, background noise.

1 Introduction

During the last few years, there is increased reliance and use of the automatic speech
recognition systems. They tend to be standard part of our daily life and people
gradually get used to it. The crucial reason is the voice oriented interface, which
makes the man-machine interaction (MMI) very natural and straightforward for users.
The automatic speech recognition systems are widely used into medicine, in the
military, by the people with disabilities, in remote control systems, dictation services,
telephone operators, automotive industry etc.
Automatic speech recognition is a process of converting the audio signal in
sequence of text symbols. Generally, a speech recognition system is consisted of:
audio signal processing, signal decoding and adaptation. The speaker produces
sequence of words which are transmitted through the communication channel,
whereby the waveform of the audio signal is generated. The waveform is then
forwarded to the speech recognition component where a parameterized acoustic signal
is obtained. By using stochastic methods, the speech decoding component transforms
the parameterized acoustic signal into sequence of symbols.

© Springer International Publishing Switzerland 2015 197


A. Madevska Bogdanova and D. Gjorgjevikj, ICT Innovations 2014,
Advances in Intelligent Systems and Computing 311, DOI: 10.1007/978-3-319-09879-1_20
198 D. Spasovski et al.

Most of the speech recognition systems are constructed to be speaker independent,


i.e. the system can be used from various users without the need of an adaptation to a
specific user. The other portion of the commercially used speech recognition systems,
require the speaker to assist in the process of training the model in order to achieve
better accuracy for the speaker. Thus, the system is adapted on the voice
characteristics of the speaker and the background environment. Many of the systems
reach accuracy of 98-99% in ideal conditions. The ideal conditions are accomplished
if the speaker voice characteristics are similar to the training data. Then the speaker
can be adapted to the system and the system can operate in silent environment
(environment without noise).
Doing a research in this field where the noisy environments can be compared and
evaluated, and the influence of the different noises in the process of speech
recognition can be expressed, is significant. It is significant to determine the critical
noises that we meet every day. Here, the main representatives are the natural noises
(babble noise, rain and wind noise), the industrial noises (traffic noise) and from
scientific perspective the synthetic noises (white noise). The experiments were made
with audio database of isolated speech in Macedonian. The database includes the 20
most frequently occurring names in Macedonia, with 2000 recordings for training the
acoustic model and 845 recordings for testing the model. In many related work it is
proved that the recognition of the isolated speech is independent from the subject
(topic) of the audio samples.
It is interesting to designate the level of the signal to noise ratio (SNR) where a
degradation of the speech recognition system can occur. Since SNR is can be easily
measured and the range and the boundary in which the systems can achieve
satisfactory results can be determined. The experiments are examined in cross-noise
environments where the model is trained on one level of noise, and tested on different
levels. It indicates the process of training the model for greater robustness if we know
the noisy environment and the level of noise.
Sometimes, it is useful to compare the error rate in speech recognition systems
relative to the level of the noise, in different environments. It is shown that the
different environments cause changes to the error rate, when the level of noise is
changed.
Here, only the impact of the noise to the speech recognition systems is presented.
For complete picture of the system behavior, additional research is required in terms
of the channel modification, and more generally the change of the channel.
In Section 2 a review of the related work is shown. Section 3 describes the datasets
and the experimental setup. The experimental results are presented and discussed in
Section 4. Finally, the conclusions are given in Section 5.

2 Related Work

The first evaluation for Speech in Noisy Environments (SPINE1) was conducted by
the Naval Research Labs (NRL) in August, 2000. The purpose of the evaluation was
Robustness of Speech Recognition System of Isolated Speech in Macedonian 199

to test existing core speech recognition technologies for speech in the presence of
varying types and levels of noise.
In terms of robustness of the speech recognition systems a lot has been done
before, especially in the early 90’. P. Schwarz in his doctoral thesis [1] has presented
some of the properties of the speech recognition systems in different condition.
Results of many studies have demonstrated that automatic speech recognition systems
can perform very poorly when they are tested using a different type of microphone or
acoustical environment from the one with which they were trained [2].
Most of the proposed techniques for acoustical robustness are based on signal
processing procedures that enable the speech recognition system to maintain a high
level of recognition accuracy over a wide variety of acoustical environments [4] [7, 8,
9, 10]. Further, more accurate models were proposed with empirically-based methods
and model-based methods, using a variety of optimal estimation procedures [11].
Some speech systems have been proposed for Macedonian language. One
noticeable example is the speaker dependent digit recognition system that recognizes
isolated words, designed by Krajlevski et al. [5, 6]. They proposed hybrid
HMM/ANN architecture and achieved accuracy of around 85%.

3 Datasets and Experimental Setup

It was interesting to verify the behavior of the acoustic model with best performance
in noisy environment. Thus, the existing test data samples were processed, and
modified test data samples were produced by adding different type and level of noise
(signal-to-noise ratio). The experiments were performed with Sphinx [12, 13].

3.1 Datasets

The audio database consists of audio recordings collected with an online web
application. The experiments are performed on recordings from isolated speech with
20 different proper names in Macedonia, where more than 40 speakers were included.
The most frequently used names in Macedonia were used. It is composed of 2845
audio samples of the proper in Macedonia from 50 male and 50 female samples for
each name. The samples are collected with an online web application made especially
for this purpose. 2000 of the samples make the training set and the other 845 make the
test set, which withdraws around 50 different samples from different speakers for
each name. The training dataset is used for training the acoustic model, while the
performance of the acoustic model is evaluated on the corresponding test set.
The fact that the samples are collected with a web application, recorded with many
different microphones in different conditions, indicates that the dataset is a relevant
represent of the environment where such name recognition system would be used.
The training and test audio recordings have the following quality: 8 kHz, 16 bit, 1
mono channel.
200 D. Spasovski et al.

3.2 Experimental Setup

The next part of the process was to create language model for the recognizer. The
language model consists of the words for the most frequent proper names in Macedonia,
phonetic dictionary and fillers, which represent “empty” speech or silence. The phonetic
dictionary contains of the phonemes the words are built from.
For the creation of the HMM based name recognition system in Macedonian we
used the Sphinx framework. It consists of part that provides feature extraction,
modeling and training an acoustic model, modeling a language model and a part for
decoding that enables to test the performance of the system.
The feature extraction part separates the speech signal into overlapping frames and
produces feature vector sequences each containing 39 MFCC features (coefficients).
The 39-dimensional MFCC vector is created of the first 13 MFCC coefficients, 13
features that represent the speed of the signal (the first derivate of the first 13 MFCC
coefficients) and 13 features that represent the acceleration of the signal (the second
derivate of the first 13 MFCC coefficients). This 39-dimensional feature vector,
which is the most widely used in speech recognition, is used as the basic feature
vector for our speech recognition system in Macedonian.
We used the Baum-Welch algorithm (integrated in Sphinx) for the training of the
acoustic models. This algorithm finds i.e. adjusts the unknown parameters of a HMM.
The performance of all acoustic models is evaluated by measuring the Word Error
Rate - WER (equation 1).
#
= % (1)
#

We consider this evaluation relevant because of the versatile nature of the test set
which, in some way, tells us how the acoustic models would behave in natural
environment.

4 Results

4.1 Results in Noisy Environments

The performance of the model was evaluated on 4 different signal-to-noise (SNR)


ratio levels and 5 different types of noise. The natural noises are represented with the
rain, wind and babble noise. The main representative of industrial noises is traffic
noise, and the synthetic noises are represented by the white noise. The results are
shown on Fig. 1.
The obtained results show almost linearly relation between the error rate and the
noise level. The error rate increases proportionally with the noise level, which
indicates that the acoustic model changes with the same intensity as noise level is
added to the test set, without any dependence from the type of the noise. It means that
any model is capable of learning the patterns of the noise better than the other. The
degradation of WER, for the different models is almost constant.
Robustness of Speech Recognition System of Isolated Speech in Macedonian 201

90
80
word error rate (WER)

70
babble
60
rain
50
traffic
40
white
30
wind
20
0 5 10 15
signal-to-noise ratio (SNR)

Fig. 1. Robustness of HMM based model for speech recognition with model trained on clean
speech

The worst case scenario for the speech recognition systems turned out to be the
babble noise, which in the higher levels of noise reaches 81.10% error rate (Table 1).
It is obvious that the accuracy, almost for nearly every model linearly decreases as the
noise increases. The rain noise doesn’t affect the accuracy in the lower levels of noise.
The other noises have similar tendencies. Quantization of information is useful
technique for robustness improvement.

Table 1. Robustness of HMM model in speech recognition compared to the noise with model
trained on clean speech, in noisy test environments with babble, rain, traffic, wind and white
noise

SNR babble rain traffic white wind


0 81.10 43.00 52.40 49.30 59.60
5 67.20 34.80 40.50 38.10 46.40
10 51.40 27.50 32.80 30.80 38.30
15 37.90 24.60 28.00 27.20 29.80

4.2 Results in Cross-Noise Environments

In the previous experiments the explored models were built in one training conditions
and tested in many different test conditions. The training and testing conditions were
invariant from the type of noise. The experiments are based on babble and traffic
noise, which are one of the most common noises. The models are trained on five
levels of noise, and are evaluated on each one of them. Table 2 and Table 3 present
202 D. Spasovski et al.

the absolute WER. The obtained results show that to ensure particular WER, it is
always better to train the system with greater noise level. The system than is capable
for recognition of less distorted patterns with satisfactory WER.

Table 2. Robustness of HMM based system with noisy trained model with babble noise for
different test and train noise level. The training conditions are in rows, and the testing
conditions are in columns. The same training and test conditions are marked.

HMM SNR0 SNR5 SNR10 SNR15 Clean


SNR0 87.3 71.6 64.4 58.8 66.6
SNR5 66.6 57.9 51.1 48.3 58.6
SNR10 61.5 49.6 43.8 42.8 53.3
SNR15 66.0 53.8 45.3 42.4 53.4
Clean 81.1 67.2 51.4 37.9 26.0

However, this doesn’t work always, especially when the target conditions are clean
speech. The patterns which the classifier is observing become totally different. In the
clean speech and the invoiced parts of the speech, the logarithm of the energy may be
close to −∞.

Table 3. Robustness of HMM based system with noisy trained model with traffic noise for
different test and train noise level. The training conditions are in rows, and the testing
conditions are in columns. The same training and test conditions are marked.

HMM SNR0 SNR5 SNR10 SNR15 Clean


SNR0 47.9 41.8 45.6 49.0 64.6
SNR5 41.5 37.9 34.3 35.3 47.6
SNR10 59.3 48.8 42.8 39.6 41.9
SNR15 78.0 64.0 55.1 47.7 48.2
Clean 52.4 40.5 32.8 28.0 26.0

In Table 2 the relative values of WER (babble noise) can be calculated where the
noise level matches. WER on the current noise level is subtracted from WER of the
training noise level. Obtained value greater than zero means the model trained on
particular noise level (rows) can be applied successfully to the actual testing noise
level (columns). The values above the main diagonal are cases where the training
noise level is greater than the testing noise level. The error rate for the models trained
with SNR 10 and SNR 15 is significantly improved when tests are performed over
samples with SNR 0, 5, 10, 15. Worst results again tend to show models trained with
SNR 0 and SNR 5.
The model which is trained with higher level of noise can be used in lower noise
level conditions, with particular tradeoff in WER.
Robustness of Speech Recognition System of Isolated Speech in Macedonian 203

The WER of 26% on clean speech in normal environment is not so satisfactory.


But, in term of isolated speech it is acceptable regarding the conditions in which the
training set samples were collected.
The performances of the model trained with clean speech are very interesting as
well. The same experiments applied with traffic noise, show similar behavior, but the
results obtained are considerably better (Table 3). The behavior is almost the same as
the former trained model with babble noise, which indicates that the models behave in
similar fashion regardless the type, or the source, of the noise.

5 Conclusion

The discussion presented here illustrates the theory that it is better to train models on
lower SNR to ensure WER for better SNR. However, to ensure fully robustness it is
necessary to examine the behavior of the model over the changes in the transmission
channel or more generally the change of the channel.
Here is also shown that as the noise increases the error rate is also increased and
the model trained with clean speech, gives considerably better results in lower noise
levels. That’s the reason why this type of trained model mainly is used by the speech
recognition systems.
In our future work we will try to evaluate the different transmission channel. We
will collect more training and evaluation data. It is interesting to experiment with
other types of background noise, and with continuous speech. Our final goal is to
create different recognizers that will be robust on particular noise level, independent
from the type of noise.

References
1. Schwarz, P.: Phoneme recognition based on long temporal context. PhD Thesis, Faculty of
Information Technology. Department of Computer Graphics and Multimedia, Brno
University of Technology 47–60 (2008)
2. Acero, A.: Acoustical and Environmental Robustness in Automatic Speech Recognition.
Phd. Thesis, Department of Electrical and Computer Engineering, Carnegie Mellon
University, Pittsburgh, Pennsylvania (1990)
3. Sethy, S.N., Parthasarthy, S.: A split lexicon approach for improved recognition of spoken
names. Integrated Media Systems Center, Department of Electrical Engineering-Systems,
University of Southern California, Los Angeles, United States. AT&T Labs-Research,
Florham Park. Speech Communications 48(9) (2006)
4. Liu, F., SternR., H., Huang, M., AceroA, X.: Efficient Cepstral Normalization for Robust
Speech Recognition. Department of Electrical and Computer Engineering, Carnegie
Mellon University (1992)
5. Kraljevski, I., Mihajlov, D., Gjorgjevik, D.: Hybrid HMM/ANN speech recognition
system in Macedonian. Faculty of Electrical Engineering, St. Cirilus and Methodius,
Skopje. Veterinary Institute, Skopje (2000); Краљевски, И., Михајлов, Д., Ѓорѓевиќ Д.:
Хибриден HMM/ANN систем за препознавање на говор на македонски јазик.
Електротехнички факултет, Универзитет Св. Кирил и Методиј, Скопје. Ветеринарен
институт, Скопје (2000)
204 D. Spasovski et al.

6. Gerazov, B., Ivanovski, Z., Labroska, V.: Modeling of the intonation structure of the
Macedonian language on intonation phrases level. Faculty of Electrical engineering and
Information technology, St. Cyrilus and Methodius University, Skopje. Intstitute of
Macedonian Language Krste Misirkov, Skopje (2012)
7. Геразов, Б., Ивановски, З., Лаброска, В.: Моделирање на интонациската структура на
македонскиотјазик на ниво на интонациски фрази. Институт за електроника,
Факултет за електротехника и информацискитехнологии, Универзитет Св. Кирил и
Методиј, Скопје. Институт за македонски јазик Крсте Мисирков, Скопје(2012)
8. Kumar, N., Andreou, A.G.: Heteroscedastic discriminant analysis and reduced rank hmms
for improved speech recognition. Speech Communication 26, 283–297 (1998)
9. Compernolle, V.D.: Noise Adaptation in a Hidden Markov Model Speech Recognition
System. Computer Speech and Language (1989)
10. Hermansky, H., Sharma, S.: Temporal patterns (traps) in asr of noisy speech. In: Proc.
IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP), Phoenix, Arizona, USA
(1999)
11. Jyh-Shing, R.J.: Audio Signal Processing and Recognition. CS Dept., Tsing Hua
University, Taiwan (1996), http://neural.cs.nthu.edu.tw/
jang/books/audiosignalprocessing/index.asp (accessed 2013)
12. Moreno, P.J.: Speech Recognition in Noisy Environments. Department of Electrical and
Computer Engineering. Carnegie Mellon University (1996)
13. Sphinx – 4. Speech Recognizer written in JavaTM,
http://cmusphinx.sourceforge.net/sphinx4/ (accessed 2013)
14. CMUSphinx Wiki. Document for the CMU Sphinx speech recognition engines,
http://cmusphinx.sourceforge.net/wiki/ (accessed 2013)

You might also like