Professional Documents
Culture Documents
Abstract. Over five decades the scientists attempt to design machine that
clearly transcripts the spoken words. Even though satisfactory accuracy is
achieved, machines cannot recognize every voice, in any environment, from
any speaker. In this paper we tackle the problem of robustness of Automatic
Speech Recognition for isolated Macedonian speech in noisy environments.
The goal is to exceed the problem of background noise type changing. Five
different types of noise were artificially added to the audio recordings and the
models were trained and evaluated for each one. The worst case scenario for the
speech recognition systems turned out to be the babble noise, which in the
higher levels of noise reaches 81.10% error rate. It is shown that as the noise
increases the error rate is also increased and the model trained with clean
speech, gives considerably better results in lower noise levels.
1 Introduction
During the last few years, there is increased reliance and use of the automatic speech
recognition systems. They tend to be standard part of our daily life and people
gradually get used to it. The crucial reason is the voice oriented interface, which
makes the man-machine interaction (MMI) very natural and straightforward for users.
The automatic speech recognition systems are widely used into medicine, in the
military, by the people with disabilities, in remote control systems, dictation services,
telephone operators, automotive industry etc.
Automatic speech recognition is a process of converting the audio signal in
sequence of text symbols. Generally, a speech recognition system is consisted of:
audio signal processing, signal decoding and adaptation. The speaker produces
sequence of words which are transmitted through the communication channel,
whereby the waveform of the audio signal is generated. The waveform is then
forwarded to the speech recognition component where a parameterized acoustic signal
is obtained. By using stochastic methods, the speech decoding component transforms
the parameterized acoustic signal into sequence of symbols.
2 Related Work
The first evaluation for Speech in Noisy Environments (SPINE1) was conducted by
the Naval Research Labs (NRL) in August, 2000. The purpose of the evaluation was
Robustness of Speech Recognition System of Isolated Speech in Macedonian 199
to test existing core speech recognition technologies for speech in the presence of
varying types and levels of noise.
In terms of robustness of the speech recognition systems a lot has been done
before, especially in the early 90’. P. Schwarz in his doctoral thesis [1] has presented
some of the properties of the speech recognition systems in different condition.
Results of many studies have demonstrated that automatic speech recognition systems
can perform very poorly when they are tested using a different type of microphone or
acoustical environment from the one with which they were trained [2].
Most of the proposed techniques for acoustical robustness are based on signal
processing procedures that enable the speech recognition system to maintain a high
level of recognition accuracy over a wide variety of acoustical environments [4] [7, 8,
9, 10]. Further, more accurate models were proposed with empirically-based methods
and model-based methods, using a variety of optimal estimation procedures [11].
Some speech systems have been proposed for Macedonian language. One
noticeable example is the speaker dependent digit recognition system that recognizes
isolated words, designed by Krajlevski et al. [5, 6]. They proposed hybrid
HMM/ANN architecture and achieved accuracy of around 85%.
It was interesting to verify the behavior of the acoustic model with best performance
in noisy environment. Thus, the existing test data samples were processed, and
modified test data samples were produced by adding different type and level of noise
(signal-to-noise ratio). The experiments were performed with Sphinx [12, 13].
3.1 Datasets
The audio database consists of audio recordings collected with an online web
application. The experiments are performed on recordings from isolated speech with
20 different proper names in Macedonia, where more than 40 speakers were included.
The most frequently used names in Macedonia were used. It is composed of 2845
audio samples of the proper in Macedonia from 50 male and 50 female samples for
each name. The samples are collected with an online web application made especially
for this purpose. 2000 of the samples make the training set and the other 845 make the
test set, which withdraws around 50 different samples from different speakers for
each name. The training dataset is used for training the acoustic model, while the
performance of the acoustic model is evaluated on the corresponding test set.
The fact that the samples are collected with a web application, recorded with many
different microphones in different conditions, indicates that the dataset is a relevant
represent of the environment where such name recognition system would be used.
The training and test audio recordings have the following quality: 8 kHz, 16 bit, 1
mono channel.
200 D. Spasovski et al.
The next part of the process was to create language model for the recognizer. The
language model consists of the words for the most frequent proper names in Macedonia,
phonetic dictionary and fillers, which represent “empty” speech or silence. The phonetic
dictionary contains of the phonemes the words are built from.
For the creation of the HMM based name recognition system in Macedonian we
used the Sphinx framework. It consists of part that provides feature extraction,
modeling and training an acoustic model, modeling a language model and a part for
decoding that enables to test the performance of the system.
The feature extraction part separates the speech signal into overlapping frames and
produces feature vector sequences each containing 39 MFCC features (coefficients).
The 39-dimensional MFCC vector is created of the first 13 MFCC coefficients, 13
features that represent the speed of the signal (the first derivate of the first 13 MFCC
coefficients) and 13 features that represent the acceleration of the signal (the second
derivate of the first 13 MFCC coefficients). This 39-dimensional feature vector,
which is the most widely used in speech recognition, is used as the basic feature
vector for our speech recognition system in Macedonian.
We used the Baum-Welch algorithm (integrated in Sphinx) for the training of the
acoustic models. This algorithm finds i.e. adjusts the unknown parameters of a HMM.
The performance of all acoustic models is evaluated by measuring the Word Error
Rate - WER (equation 1).
#
= % (1)
#
We consider this evaluation relevant because of the versatile nature of the test set
which, in some way, tells us how the acoustic models would behave in natural
environment.
4 Results
90
80
word error rate (WER)
70
babble
60
rain
50
traffic
40
white
30
wind
20
0 5 10 15
signal-to-noise ratio (SNR)
Fig. 1. Robustness of HMM based model for speech recognition with model trained on clean
speech
The worst case scenario for the speech recognition systems turned out to be the
babble noise, which in the higher levels of noise reaches 81.10% error rate (Table 1).
It is obvious that the accuracy, almost for nearly every model linearly decreases as the
noise increases. The rain noise doesn’t affect the accuracy in the lower levels of noise.
The other noises have similar tendencies. Quantization of information is useful
technique for robustness improvement.
Table 1. Robustness of HMM model in speech recognition compared to the noise with model
trained on clean speech, in noisy test environments with babble, rain, traffic, wind and white
noise
In the previous experiments the explored models were built in one training conditions
and tested in many different test conditions. The training and testing conditions were
invariant from the type of noise. The experiments are based on babble and traffic
noise, which are one of the most common noises. The models are trained on five
levels of noise, and are evaluated on each one of them. Table 2 and Table 3 present
202 D. Spasovski et al.
the absolute WER. The obtained results show that to ensure particular WER, it is
always better to train the system with greater noise level. The system than is capable
for recognition of less distorted patterns with satisfactory WER.
Table 2. Robustness of HMM based system with noisy trained model with babble noise for
different test and train noise level. The training conditions are in rows, and the testing
conditions are in columns. The same training and test conditions are marked.
However, this doesn’t work always, especially when the target conditions are clean
speech. The patterns which the classifier is observing become totally different. In the
clean speech and the invoiced parts of the speech, the logarithm of the energy may be
close to −∞.
Table 3. Robustness of HMM based system with noisy trained model with traffic noise for
different test and train noise level. The training conditions are in rows, and the testing
conditions are in columns. The same training and test conditions are marked.
In Table 2 the relative values of WER (babble noise) can be calculated where the
noise level matches. WER on the current noise level is subtracted from WER of the
training noise level. Obtained value greater than zero means the model trained on
particular noise level (rows) can be applied successfully to the actual testing noise
level (columns). The values above the main diagonal are cases where the training
noise level is greater than the testing noise level. The error rate for the models trained
with SNR 10 and SNR 15 is significantly improved when tests are performed over
samples with SNR 0, 5, 10, 15. Worst results again tend to show models trained with
SNR 0 and SNR 5.
The model which is trained with higher level of noise can be used in lower noise
level conditions, with particular tradeoff in WER.
Robustness of Speech Recognition System of Isolated Speech in Macedonian 203
5 Conclusion
The discussion presented here illustrates the theory that it is better to train models on
lower SNR to ensure WER for better SNR. However, to ensure fully robustness it is
necessary to examine the behavior of the model over the changes in the transmission
channel or more generally the change of the channel.
Here is also shown that as the noise increases the error rate is also increased and
the model trained with clean speech, gives considerably better results in lower noise
levels. That’s the reason why this type of trained model mainly is used by the speech
recognition systems.
In our future work we will try to evaluate the different transmission channel. We
will collect more training and evaluation data. It is interesting to experiment with
other types of background noise, and with continuous speech. Our final goal is to
create different recognizers that will be robust on particular noise level, independent
from the type of noise.
References
1. Schwarz, P.: Phoneme recognition based on long temporal context. PhD Thesis, Faculty of
Information Technology. Department of Computer Graphics and Multimedia, Brno
University of Technology 47–60 (2008)
2. Acero, A.: Acoustical and Environmental Robustness in Automatic Speech Recognition.
Phd. Thesis, Department of Electrical and Computer Engineering, Carnegie Mellon
University, Pittsburgh, Pennsylvania (1990)
3. Sethy, S.N., Parthasarthy, S.: A split lexicon approach for improved recognition of spoken
names. Integrated Media Systems Center, Department of Electrical Engineering-Systems,
University of Southern California, Los Angeles, United States. AT&T Labs-Research,
Florham Park. Speech Communications 48(9) (2006)
4. Liu, F., SternR., H., Huang, M., AceroA, X.: Efficient Cepstral Normalization for Robust
Speech Recognition. Department of Electrical and Computer Engineering, Carnegie
Mellon University (1992)
5. Kraljevski, I., Mihajlov, D., Gjorgjevik, D.: Hybrid HMM/ANN speech recognition
system in Macedonian. Faculty of Electrical Engineering, St. Cirilus and Methodius,
Skopje. Veterinary Institute, Skopje (2000); Краљевски, И., Михајлов, Д., Ѓорѓевиќ Д.:
Хибриден HMM/ANN систем за препознавање на говор на македонски јазик.
Електротехнички факултет, Универзитет Св. Кирил и Методиј, Скопје. Ветеринарен
институт, Скопје (2000)
204 D. Spasovski et al.
6. Gerazov, B., Ivanovski, Z., Labroska, V.: Modeling of the intonation structure of the
Macedonian language on intonation phrases level. Faculty of Electrical engineering and
Information technology, St. Cyrilus and Methodius University, Skopje. Intstitute of
Macedonian Language Krste Misirkov, Skopje (2012)
7. Геразов, Б., Ивановски, З., Лаброска, В.: Моделирање на интонациската структура на
македонскиотјазик на ниво на интонациски фрази. Институт за електроника,
Факултет за електротехника и информацискитехнологии, Универзитет Св. Кирил и
Методиј, Скопје. Институт за македонски јазик Крсте Мисирков, Скопје(2012)
8. Kumar, N., Andreou, A.G.: Heteroscedastic discriminant analysis and reduced rank hmms
for improved speech recognition. Speech Communication 26, 283–297 (1998)
9. Compernolle, V.D.: Noise Adaptation in a Hidden Markov Model Speech Recognition
System. Computer Speech and Language (1989)
10. Hermansky, H., Sharma, S.: Temporal patterns (traps) in asr of noisy speech. In: Proc.
IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP), Phoenix, Arizona, USA
(1999)
11. Jyh-Shing, R.J.: Audio Signal Processing and Recognition. CS Dept., Tsing Hua
University, Taiwan (1996), http://neural.cs.nthu.edu.tw/
jang/books/audiosignalprocessing/index.asp (accessed 2013)
12. Moreno, P.J.: Speech Recognition in Noisy Environments. Department of Electrical and
Computer Engineering. Carnegie Mellon University (1996)
13. Sphinx – 4. Speech Recognizer written in JavaTM,
http://cmusphinx.sourceforge.net/sphinx4/ (accessed 2013)
14. CMUSphinx Wiki. Document for the CMU Sphinx speech recognition engines,
http://cmusphinx.sourceforge.net/wiki/ (accessed 2013)