Professional Documents
Culture Documents
Abstract—Due to the profound differences between acoustic situations these systems still remain both significantly sensi-
characteristics of neutral and whispered speech, the performance tive and unreliable. The ASR performance may be affected by
of traditional automatic speech recognition (ASR) systems trained various factors including: speech type and quality, individual
on neutral speech degrades significantly when whisper is applied.
In order to deeply analyze this mismatched train/test situation and speaker characteristics, such as speech rate and style, dialect,
to develop an efficient way for whisper recognition, this study first the vocal tract anatomy, the speaker’s psycho-physical state etc.
analyzes acoustic characteristics of whispered speech, addresses Furthermore, other influence types may also occur, particularly
the problems of whispered speech recognition in mismatched originating from the surrounding environment, and entailing:
conditions, and then proposes a new robust cepstral features and ambient noise, reverberation, loudness etc. Apart from the neu-
preprocessing approach based on deep denoising autoencoder
(DDAE) that enhance whisper recognition. The experimental tral mode, speech also occurs in different modalities, such as
results confirm that Teager-energy-based cepstral features, emotional speech, the Lombard-effect speech, as well as whis-
especially TECCs, are more robust and better whisper descriptors pered speech. All these above-mentioned atypical speech modes
than traditional Mel-frequency cepstral coefficients (MFCC). present challenging problems in ASR.
Further detailed analysis of cepstral distances, distributions of
Whisper is a specific form of verbal communication that is
cepstral coefficients, confusion matrices, and experiments with
inverse filtering, prove that voicing in speech stimuli is the main frequently utilized in different situations. Firstly, it is employed
cause of word misclassification in mismatched train/test scenarios. to make a discreet and intimate atmosphere in conversation,
The new framework based on DDAE and TECC feature, signif- and, secondly, it is used to protect some confidential and private
icantly improves whisper recognition accuracy and outperforms information from uninvolved parties. Besides, speakers whisper
traditional MFCC and GMM-HMM (Gaussian mixture density— when they do not want to disturb other people, for example in
Hidden Markov model) baseline, resulting in an absolute 31%
improvement of whisper recognition accuracy. The achieved word the library, or during a business meeting, but also in criminal
recognition rate in neutral/whisper scenario is 92.81%. activities, e.g., when criminals try to disguise their identity.
Nevertheless, in spite of the conscious production of a whisper,
Index Terms—Automatic speech recognition, deep denoising
autoencoder, inverse filtering, Teager-energy operator, whispered
it may occur due to health problems such as chronic disease of
speech recognition. the larynx structures [1].
Whisper has become a research topic of interest, essentially
I. INTRODUCTION important for speech technologies, mainly because of its sub-
ODERN-DAY automatic speech recognition (ASR) sys- stantial difference compared to normally phonated (neutral)
M tems have shown good performances and evident com-
mercial utilization, while, at the same time they have displayed
speech, and this is the case primarily due to the glottal vibra-
tions’ absence, noisy structure and lower SNR (Signal-to-Noise
a considerable number of weaknesses, flaws, and problems Ratio) [2]. Namely, the most current neutral speech oriented
when practically utilized. In order to accomplish more efficient interfaces of ASR systems are not capable of handling such
and better quality of human-computer interaction, these prob- an acoustic mismatch. Therefore, the performance of neutral-
lems have yet to be solved. Despite much research concerned trained ASR systems degrades significantly when whisper is
with ASR systems’ improvement over the last years, in some applied. Nonetheless, the research on the automatic whisper
recognition is still at the beginning, and there have been only a
few studies so far. The literature shows several approaches that
Manuscript received October 31, 2016; revised May 21, 2017; accepted June
17, 2017. Date of current version November 27, 2017. This work was supported have been proposed to alleviate this acoustic mismatch through
by the Serbian Ministry of Education, Science and Technological Development, model adaptation [3]–[7], feature transformations [5], or using
under Grants TR 32032 and OI 178027. The guest editor coordinating the alternative sensing technologies such as throat microphone [8].
review of this manuscript and approving it for publication was Dr. Prof. Tanja
Schultz . (Corresponding author: Đorđe T. Grozdić.) Also, there is an audio-visual approach to isolated word recogni-
Đ. T. Grozdić is with the School of Electrical Engineering, University of tion under whisper speech condition [9]. A vast majority of stud-
Belgrade, Belgrade 11120, Serbia (e-mail: djordjegrozdic@gmail.com). ies on whisper recognition use GMM-HMM (Gaussian mixture
S. T. Jovičić is with the School of Electrical Engineering, University of
Belgrade, Belgrade 11120, Serbia, and also with the Life Activities Advance- density—Hidden Markov model) system and traditional MFCC
ment Center, Laboratory for Forensic Acoustics and Phonetics, Belgrade 11000, (Mel-Frequency Cepstral Coefficients) or PLP (Perceptual Lin-
Serbia (e-mail: jovicic@etf.rs). ear Prediction) features.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. Our contribution to the literature is to demonstrate that by
Digital Object Identifier 10.1109/TASLP.2017.2738559 using state-of-the-art acoustic modeling techniques such as
2329-9290 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
2314 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 12, DECEMBER 2017
deep denoising autoencoder (DDAE) a significant increase in recognizer. In the next section, we conduct isolated word recog-
whisper recognition performance can be attained without the nition experiments to evaluate different cepstral features, inverse
need for model adaptation. The DDAE is applied to transform filtering and application of deep denoising autoencoder in mis-
cepstral feature vectors of whispered speech, into clean cepstral matched train/test conditions. Finally, conclusions are presented
feature vectors of neutral speech. More specifically, feature in Section VI.
extraction by a DDAE is achieved by training the deep neural
network to predict original neutral speech features from II. RELATED WORK
pseudo-whisper features that are artificially generated by in- Automatic recognition of whispered speech is an ongoing
verse filtering of neutral speech data. Acquired cepstral features and active field of research which is hindered by the lack of
are then concatenated with original cepstral features and then suitable and systematically collected corpora. Currently, there
processed with a conventional GMM-HMM recognizer to are only several existing databases of parallel neutral and whis-
conduct an isolated word recognition task. The advantage of pered speech collected for English [13], [14], [15], Japanese [3],
our feature mechanism is that whisper-robust features are easily Serbian [13] and Mandarin language [17]. Most of them have
acquired by a rather simple mechanism. Regarding the input a small or medium-sized vocabulary, while only a few of them
feature vectors, three different types of cepstral coefficients are transcribed and phonetically balanced.
were tested: traditional MFCC (Mel-Frequency Cepstral One of the first experiments with automatic whisper recog-
Coefficients) and two more recent cepstral coefficients, the nition was started by researchers at the University of Nagoya
TECC (Teager Energy Cepstral Coefficients) and the TEMFCC [3]. They intended to develop a speech recognizer specifically
(Teager Energy based Mel-Frequency Cepstral Coefficients) capable of handling whisper on cell phones in noisy conditions.
[10]. These two features use the nonlinear TEO (Teager-Energy Using HMM technique and MFCC features, they analyzed dif-
Operator) [9], [10] which serves for Teager energy calcula- ferent mismatched train/test scenarios with three speech modes:
tion. So far, nonlinear TEO-based features introduced very whisper, low-voice speech, and neutral speech. The results of
promising results in ASR of quiet and unvoiced murmured these experiments show severe degradation in ASR when using
speech as well in speech classification under stress and noisy mismatch data. However, the outstanding result is that the whis-
conditions [9]–[12]. On the other side, the characteristics of pered speech model (ASR model trained on whisper) performs
whispered speech might be considered to have similarities with as well on either type of speech. This phenomenon, among
non-audible murmur, noise corrupted speech and speech under other things, will be explained and experimentally proved in
stress. Due to these similarities, it was expected that the TECC our study. Next, the authors [3] confirm that covering the mouth
and the TEMFCC could be good descriptors of whispered and the cell-phone with a hand can cause the SNR increase
speech. The experiments on whispered speech recognition were in a noisy environment and may improve whisper and low-
conducted on a special database that was particularly created voice speech recognition to some extent. It was also shown that
for this study, entitled Whi-Spe [13], which contains 10000 ASR systems that are trained on neutral speech can be adapted
audio recordings of parallel neutral and whispered speech. to whisper recognition by using a small amount of whispered
Another important task of this study was to analyze the confu- speech data, which can improve whisper recognition. Utiliz-
sion in mismatched train/test conditions. The situation in which ing such speaking-style-independent model yields the whisper
a speaker is in front of the ASR system and switches from neu- recognition accuracy of about 66%.
tral speech to whisper provides particularly interesting results Later studies of whisper recognition attempt to reduce this
of neutral/whisper scenario that corresponds to real situation. In acoustic mismatch in different ways, for example through model
order to better understand these mismatched scenarios and to de- adaptation [4]–[6], [17] and feature transformations [5]. There
velop system which can provide better whisper recognition un- are also studies focused on front-end feature extraction strate-
der such conditions, confusion in word recognition was deeply gies [14], [18] and front-end filter bank redistribution based on
analyzed. The analysis of acoustic characteristics, distributions the subband relevance [15]. The efficiency of Vocal Tract Length
of cepstral coefficients, cepstral distances, as well as analysis Normalization (VTLN) [20] and shift transform [21] for whis-
of confusion matrices were done. Based on these results, a new per recognition was also investigated in [20]. Several studies
speech signal pre-processing approach was proposed. This ap- deal with ASR model adaptation to whisper recognition when
proach includes TECC features, inverse filtering, and DDAE, small amounts of whisper are available [19], [21]. In [20] Vector
which reduces the difference between neutral speech and whis- Taylor Series (VTS) based approach to pseudo-whisper adapta-
per and improves the word recognition rates in mismatched tion sample generation was investigated. All above-mentioned
scenarios. papers use HMM as ASR system. There are only two stud-
The reminder of this study is organized into seven sections. In ies that investigate neural network-based approach in whisper
Section II, we briefly review related work on whisper recogni- recognition [16], [22]. Also, there is an audio-visual approach
tion. Section III introduces the speech corpus that was specially for isolated digits recognition under whisper and neutral speech
constructed for this study and describes the nature of whispered [9]. In terms of feature extraction, only basic MFCC and PLP
speech. Here, the analyses of whisper characteristics and differ- features were tested in whisper recognition.
ences compared to neutral speech are presented. In Section IV, Although some improvement is achieved in each study, suc-
we introduce the general framework of the proposed ASR sys- cessful and usable recognition of whisper is still an unsolved
tem and describe feature extraction, DDAE, and GMM-HMM problem and an open research topic.
GROZDIĆ AND JOVIČIĆ: WHISPERED SPEECH RECOGNITION USING DEEP DENOISING AUTOENCODER AND INVERSE FILTERING 2315
III. WHISPER
A. Corpus of Neutral/Whispered Speech
Given the fact that there is a lack of extensive, appropriate
and publicly available databases in this area, the special speech
corpus for Serbian language, entitled “Whi-Spe” (abbreviated
from Whispered Speech), was developed for the purpose of this
study. The corpus consists of two parts - 5000 audio recordings
of spoken isolated words in neutral speech and 5000 recordings
of the same words in whisper. The corpus contains 50 differ-
ent words that were taken as a part of GEES speech database
[24] and achieve the balance of linguistic features. Both whis-
pered speech and neutral speech were collected from 5 male
and 5 female speakers, with typical pronunciation of speech
and whisper, and correct hearing. Each speaker had read all 50
words ten times in both speech modes, so the Whi-Spe corpus
contains 10000 recorded words. The recording process was car-
ried out under quiet laboratory conditions in a sound booth with
a high-quality omnidirectional microphone. Speech data were Fig. 1. Waveforms, spectrograms and spectrums of word “pijatsa” (“market”
in English) in neutral (left) and whispered speech (right).
digitized using a sampling frequency of 22050 Hz with 16 bits
per sample. More pieces of information concerning the Whi-Spe
corpus can be found in [13].
distortions caused by characteristics of communication chan- interferences [10]. Recent studies have shown that the human
nels or recording devices. However, CMS is partially effective physiology dictates that the auditory filter bandwidths are given
in reducing the effects of additive environmental noise. by the ERB (Equivalent Rectangular Bandwidth) function [10]:
The other two cepstral coefficient types are relatively new and
ERB(fc ) = 6.23(fc /1000) + 93.39(fc /1000) + 28.52, (4)
more robust to noise interference. Moreover, they are still not
tested on whispered speech recognition. Their basic extraction where fc is the central frequency in Hz. Then, the impulse
algorithm is similar to MFCC feature, but includes one signifi- response of the asymmetrical Gammatone filter is given by:
cant difference in terms of energy estimation. For TEMFCC and
TECC calculation, the nonlinear Teager-energy operator (TEO) g(t) = Atη −1 e(−2π bE R B (f c )t) cos(2πfc t), (5)
is used [29], which serves for estimation of the Teager-energy where A., b and η (the filter order) are the Gammatone design
in place of the standard energy calculation method (Standard- parameters, and fc is the filter central frequency.
Energy Operator, SEO −x(t)2 ). The main idea behind using At the end, in order to describe speech signal dynamics more
TEO is nonlinear modeling of speech production. Namely, the thoroughly, the MFCC, TECC, and TEMFCC feature vectors
traditional acoustic theory assumes that airflow from the vocal were concatenated with their first (delta) and second-time (delta-
tract propagates as a plane wave. However, this assumption may delta) derivatives. In this way, every frame was represented with
not hold since the flow is actually separate and concomitant vor- 36-dimensional vector (12 static features + 12 delta + 12 delta-
tices are distributed throughout the vocal tract. The true source of delta).
speech production is actually the vortex-flow interactions which
are nonlinear [30]. Since whispered speech has nonlinear and B. Denoising Autoencoder (DAE)
extremely turbulent airflow, TEO provides an efficient way for
signal processing. Contrary to the traditional signal energy ap- The deep autoencoder is a special type of the DNN (Deep
proximation, which only takes into account the kinetic energy of Neural Network) whose output is a reconstruction of the input
the signal’s source, TEO computes the “true” total source energy which is commonly utilized for dimensionality compression and
of a resonance signal, and incorporates both amplitude and fre- feature extraction [32]. Typically, an autoencoder has an input
quency information [10]. In other words, the total source energy, layer which represents the original feature vectors, one or more
i.e., the sum of the potential and kinetic energy, is proportional to hidden layers that represent the transformed feature, and an out-
the product amplitude and frequency squared which is nothing put layer which matches the input layer for reconstruction. The
but the true energy required to generate the speech signal [31]. dimension of the hidden layers can be either smaller than the
These supplementary information can improve time-frequency input dimension when the goal is feature compression, or larger
description of rapid energy changes within a glottal cycle [29] when the goal is mapping the feature to a higher-dimensional
and improve the representation of formant information in the space. An autoencoder tries to find deterministic mapping be-
feature vectors. For continuous-time signals, TEO is defined tween input units and hidden nodes by means of a nonlinear
as [32]: function fΘ (x):
where ẋ(t) and ẍ(t) are first and second-time derivatives of where W is a d × d weight matrix, b is a bias vector, and f1 (.)
speech signal x(t). The TEO’s good feature is also a relatively is a nonlinear function such as sigmoid or tanh. This mapping is
easy and fast way of computing it, which could be of benefit called the encoder. The latent representation y is then mapped
to real-time continuous speech recognition. For discrete time back to reconstruct the input signal with:
signal is nearly instantaneous and requires only three adjacent z = fθ (y) = f2 (W y + b ), (7)
signal samples at each time instance [32]:
where W is a d × d weight matrix, b is a bias vector, and
Ψ(x[n]) = x[n]2 − x[n − 1]x[n + 1]. (3) f2 (.) is either nonlinear function, such as sigmoid or tanh, or a
In addition to the TEO utilization, the TECC feature is dis- linear function. This mapping is called the decoder. The goal of
tinctive by one more characteristic in comparison to MFCC and training is to minimize squared error function:
TEMFCC features, and that is the use of Gammatone filterbank
L(x, z) = x − z2 . (8)
[10] instead of Mel-filterbank. In this research, the Gammatone
filterbank contains the same number of filters (L = 30) as in To prevent autoencoder from learning the trivial identity map-
the Mel-filterbank in the process of MFCC and TEMFCC fea- ping function, some constraints are usually applied during the
tures extraction. The Gammatone filters are asymmetric, non- training. For example adding Gaussian noise to the input signal
constant-Q type, smoother and broader compared to Mel’s fil- (which is applied in our study) or using the “dropout” trick by
ters that are asymmetric, triangular and of constant-Q type. The randomly forcing certain values to be zero at the input data.
Gammatone filterbank is auditory motivated filterbank created Such autoencoder is known as denoising autoencoder (DAE)
with the aim to simulate the nature of cochlea and basilar mem- [32]. DAE shares the same structure as autoencoder, but in-
brane, and thereby human sound perception. It is also superior put data is a deteriorated version of the output data. In other
to the Mel-filterbank and provides higher robustness in situation words, DAE use feature mapping to convert corrupted input data
when signal is degraded with additive noise and other harmful (x̂ - input signal) into clean output (x - teacher signal). In our
2318 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 12, DECEMBER 2017
TABLE II
WORD RECOGNITION RATE (%) IN MISMATCHED TRAIN/TEST SCENARIOS
ACHIEVED BY GMM-HMM SYSTEM USING DIFFERENT CEPSTRAL FEATURES.
(p < 0.05 ∗ ; p < 0.01 ∗∗ ; p < 0.006 ∗∗∗; Confidence interval = 95%)
out that these confusions occur in most cases when two words
start with the same fricative and contain several consonants such
as /s/, /z/, /ts/ that are very similar in neutral and whisper mode.
For example fricative /z/ is voiced pair of /s/ and in whisper,
due to the lack of sonority, their confusion is very often because
of their similar articulation and spectral structure. However, the
mentioned confusions are not noticeable in whisper/neutral sce-
nario. The following question arises: Why don’t they occur in
the whisper/neutral scenario? The possible answer is that in
whisper/neutral scenario ASR system is trained with whisper
that has completely unvoiced sounds, while as an input comes
speech with both voiced and unvoiced sounds. In this situation,
for ASR system it is sufficient to match patterns of the unvoiced
sounds, while the voiced sounds are surplus. On the other hand, Fig. 7. Example of inverse filtering on a speech sample in neutral speech: (a)
in neutral/whisper scenario ASR system expects to get both FFT spectrum of a speech. (b) LPC spectral envelope. (c) FFT spectrum after
voiced and unvoiced sounds, but instead gets only unvoiced inverse filtering. (d) frequency response of inverse filter IF(z).
sounds of whisper. With these attitudes, whisper/neutral sce-
nario displays a higher recognition score than neutral/whisper
scenario (shown in Fig. 5), and consequently lower recognition
confusion, as indicated by Fig. 6.
However, the confusion matrices in the case of TECC feature
usage have a smaller number of misclassified words. This fact
points out that TECC feature is better for whisper recognition
than MFCC feature. The reason for this lower confusion level
is different kind of information extracted from the words in
speech and whisper as a consequence of using the TEO and the
Gammatone filterbank.
D. Reduction of Confusion Using Inverse Filtering
The phenomenon of better recognition results in the
whisper/neutral mismatch condition over the neutral/whisper Fig. 8. LTAS of Whi-Spe database, before and after inverse filtering.
mismatch condition has also been noted in two studies with
HMM systems [3], [18], but its deeper analysis has not been
performed. While looking for an explanation, the following where H(z) is the transfer function of vocal tract, ai are the LPC
hypothesis was brought - The ASR system that was trained coefficients of an utterance, and p = 10 is the order of the LPC
on whisper has better recognition in whisper/neutral scenario filter. It is known that inverse filter is reciprocal of the all-pole
than the ASR system that was trained on neutral speech in filter H(z) [37]. Hence, the frequency response of inverse filter is
scenario neutral/whisper, because the most of whisper features reciprocal to the LPC spectrum envelope of an utterance as it is
are contained in neutral speech, which is not the opposite case. illustrated in Fig. 7. In this way, inverse filter performs spectrum
Correspondingly, the ASR system that was trained on neutral flattening for each word from the Whi-Spe database.
speech was trained with speech features the basis of which The final result of inverse filtering can be perceived from the
is composed of voicing that does not exist in whisper, and long-term average spectrums (LTAS) of Whi-Spe database in
therefore whisper is less recognized. Fig. 8.
The way to test this hypothesis is to reduce the voicing influ- Comparing the shapes of LTAS before and after inverse filter-
ence in neutral speech, and to make it more similar to whisper ing, we can see that after the filtering, also known as spectrum
in terms of acoustic characteristics, and after such modification “whitening” [38], the spectrums of speech and whisper are flat-
to apply it in ASR training. According to Fig. 1, the whis- tened and more similar. To prove this achieved spectral similar-
pered speech spectrum is very flat. On the other hand, in neutral ity, the CDs between speech and whisper stimuli are measured
speech, voicing is dominant in domain of the first four formants once again. The average CD for all words before inverse filter-
of voiced phonemes, primarily vowels, i.e., at lower frequencies ing was 6.19 dB, while the new one has noticeable less value,
below 5 kHz. In order to make speech spectrum more simi- 3.45 dB.
lar to whisper, it is necessary to reduce this spectral tilt. For The changes of energy and spectral slope can be evaluated by
this purpose, in the pre-processing stage we added inverse filter c0 and c1 cepstral distributions that are presented in Fig. 9. As we
given by: can see, the energy level (c0 values) in both speech modes drops
off slightly because of the suppression of spectral components
p
IF (z) = 1/H(z) = 1 − ai z −i , (9) during inverse filtering (compare with Fig. 3). Nevertheless, the
i=1 c0 distributions are centered at the same position one above the
GROZDIĆ AND JOVIČIĆ: WHISPERED SPEECH RECOGNITION USING DEEP DENOISING AUTOENCODER AND INVERSE FILTERING 2321
TABLE IV
WORD RECOGNITION RATE (%) ACHIEVED BY PROPOSED ASR SYSTEM AND
APPLICATION OF DDAE-BASED CEPSTRAL FEATURES
(p < 0.05 ∗ ; p < 0.01 ∗∗ ; p < 0.006 ∗∗∗; Confidence interval = 95%)
speech and reconstruct neutral cepstral features leading to sig- [15] S. Ghaffarzadegan, H. Boril, and J. H. L. Hansen, “UT-Vocal Effort II:
nificant performance gain in whisper recognition. The efficiency Analysis and constrained-lexicon recognition of whispered speech,” in
Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Florence, Italy,
of this approach was demonstrated through a comparative study May 2014, pp. 2544–2548.
with the conventional GMM-HMM speech recognizer and three [16] T. Tran, S. Mariooryad, and C. Busso, “Audiovisual corpus to analyze
types of cepstral features: MFCC, TECC and TEMFCC. whisper speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
Vancouver, BC, Canada, May 2013, pp. 8101–8105.
Experimental results confirmed that the proposed model [17] P. X. Lee et al., “A whispered mandarin corpus for speech tech-
presents several advantageous characteristics such as (i) sig- nology applications,” in Proc. INTERSPEECH, Singapore, Sep. 2014,
nificantly lowered word error rate in mismatched train/test con- pp. 1598–1602.
[18] J. Galić, S. T. Jovičić, Đ. Grozdić, and B. Marković, “HTK-based recogni-
ditions together with high performances in matched train/test tion of whispered speech,” in Proc. Int. Conf. Speech Comput., Novi Sad,
scenarios, (ii) easily acquirable whisper-robust features, and Serbia, Oct. 2014, pp. 251–258.
(iii) no need for real whisper data in training process or model [19] C. Zhang and J. H. L. Hansen, “Advancements in whisper-island detection
using the linear predictive residual,” in Proc. IEEE Int. Conf. Acoust.,
adaptation. Furthermore, the experimental results demonstrate Speech, Signal Process., Dallas, TX, USA, Mar. 2010, pp. 5170–5173.
that usage of Teager-based cepstral features outperforms tra- [20] S. Ghaffarzadegan, H. Boril, and J. H.L. Hansen, “Model and feature
ditional MFCC features in whisper recognition accuracy by based compensation for whispered speech recognition,” in Proc. INTER-
SPEECH, Singapore, Sep. 2014, pp. 2420–2424.
nearly 10%. Thus, combining TECC features with DDAE ap- [21] H. Bořil and J. H. L. Hansen, “Unsupervised equalization of lombard
proach gives the best results and shows that the proposed frame- effect for speech recognition in noisy adverse environments,” IEEE Trans.
work can considerably reduce recognition errors and improve Audio, Speech, Lang. Process., vol. 18, no. 6, pp. 1379–1393, Aug. 2010.
[22] S. Ghaffarzadegan, H. Boril, and J. H. L. Hansen, “Generative modeling
whisper recognition performance by 31% over the traditional of pseudo-target domain adaptation samples for whispered speech recog-
HTK-MFCC baseline and thereby achieve word recognition ac- nition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 10,
curacy of 92.81%. pp. 1705–1720, Oct. 2016.
[23] Đ. T. Grozdić, B. Markovic, J. Galic, and S. T. Jovičić, “Application of
neural networks in whispered speech recognition,” in Proc. 20th Telecom-
mun. Forum, Belgrade, Serbia, Nov. 2012, pp. 728–731.
REFERENCES [24] S. T. Jovičić, Z. Kašić, M. Đorđević, and M. Rajković, “Serbian emotional
[1] S. T. Jovičić and Z. Šarić, “Acoustic analysis of consonants in whispered speech database: Design, processing and evaluation,” in Proc. 9th Conf.
speech,” J. Voice, vol. 22, pp. 263–274, 2008. Speech Comput., St. Petersburg, Russia, Sep. 2004, pp. 77–81.
[2] R. Morris, “Enhancement and recognition of whispered speech,” Ph.D. [25] S. T. Jovicic, “Formant feature differences between whispered and
dissertation, School Elect. Comput. Eng., Georgia Inst. Technol., Atlanta, voiced sustained vowels,” Acta Acust. United With Acust., vol. 84,
GA, USA, 2003. pp. 739–743, 1998.
[3] T. Ito, K. Takeda, and F. Itakura, “Analysis and recognition of whispered [26] X. Fan and J. H. L. Hansen, “Speaker identification within whispered
speech,” Speech Commun., vol. 45, pp. 139–152, 2005. speech audio streams,” IEEE Trans. Audio, Speech, Lang. Process.,
[4] B. P. Lim, “Computational differences between whispered and non- vol. 19, no. 5, pp. 1408–1421, Jul. 2011.
whispered speech,” Ph.D. dissertation, Dept. Elect. Comput. Eng., Univ. [27] C. Zhang and J. H. L. Hansen, “Analysis and classification of speech
Illinois at Urbana-Champaign, Champaign, IL, USA, 2011. mode: Whispered through shouted,” in Proc. INTERSPEECH, Antwerp,
[5] C. Y. Yang, G. Brown, L. Lu, J. Yamagishi, and S. King, “Noise-robust Belgium, Aug. 2007, pp. 2396–2399.
whispered speech recognition using a non-audible-murmur microphone [28] K. J. Kallail and F. W. Emanuel, “Formant-frequency differences between
with VTS compensation,” in Proc. 8th Int. Symp. Chinese Spoken Lang. isolated whispered and phonated vowel samples produced by adult female
Process., Hong Kong, China, Dec. 2012, pp. 220–223. subjects,” J. Speech Hear. Res., vol. 27, pp. 245–251, 1984.
[6] A. Mathur, S. M. Reddy, and R. M. Hegde, “Significance of parametric [29] J. F. Kaiser, “Some useful properties of Teager’s energy operators,” in
spectral ratio methods in detection and recognition of whispered speech,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Minneapolis, MN,
EURASIP J. Adv. Signal Process, vol. 2012, p. 157, 2012. USA, Apr. 1993, pp. 149–152.
[7] S. Ghaffarzadegan, S. Boril, and J. H. L. Hansen, “Generative modeling of [30] H. M. Teager, “Some observations on oral air flow during phonation,”
pseudo-target domain adaptation samples for whispered speech recogni- IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 5, pp. 599–601,
tion,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Brisbane, Oct. 1980.
Australia, Apr. 2015, pp. 5024–5024. [31] P. Maragos, J. F. Kaiser, and T. F. Quatieri, “Energy separation in signal
[8] S. Jou, T. Schultz, and A. Waibel, “Adaptation for soft whisper recognition modulations with application to speech analysis,” IEEE Trans. Signal
using a throat microphone,” in Proc. INTERSPEECH, Jeju Island, South Process., vol. 41, no. 10, pp. 3024–3051, Oct. 1993.
Korea, Oct. 2004, pp. 5–8. [32] E. Kvedalen, “Signal processing using the Teager energy operator and
[9] F. Tao and C. Busso, “Lipreading approach for isolated digits recogni- other nonlinear operators,” Candies Scientific Thesis, Univ. Oslo, Norway,
tion under whisper and neutral speech,” in Proc. ISCA INTERSPEECH, 2003.
Singapore, Sep. 2014, pp. 1154–1158. [33] P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol, “Extracting and
[10] D. Dimitriadis, P. Maragos, and A. Potamianos, “Auditory teager energy composing robust features with denoising autoencoders,” in Proc. 25th
cepstrum coefficients for robust speech recognition,” in Proc. EUSPICO, Int. Conf. Mach. Learn., Helsinki, Finland, Jul. 2008, pp. 1096–1103.
Nice, France, Jul. 2005, pp. 3013–3016. [34] Đ. T. Grozdić, S. T. Jovičić, J. Galić, and B. Marković, “Application of
[11] P. Heracleous, “Using teager energy cepstrum and HMM distances,” Int. inverse filtering in enhancement of whisper recognition,” in Proc. IEEE
J. Inf. Commun. Eng., vol. 5, no. 1, pp. 31–37, 2009. Neural Netw. Appl. Elect. Eng., Belgrade, Serbia, Nov. 2014, pp. 157–162.
[12] G. Zhou, J. Hansen, and J. Kaiser, “Classification of speech under stress [35] G. E. Hinton, “Training products of experts by minimizing contrastive
based on features derived from the nonlinear Teager energy operator,” in divergence,” Neural Comput., vol. 14, pp. 1771–1800, 2002.
Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Seattle, WA, USA, [36] S. Young et al., The HTK Book (for HTK Version 3.2). Cambridge Univ.,
May 1998, pp. 549–552. Eng. Dept., Cambridge, U.K., 2002.
[13] B. Marković, S. T. Jovičić, J. Galić, and Đ. Grozdić, “Whispered [37] C. J. Leggetter and P. C. Woodland, “Flexible speaker adaptation us-
speech database: Design, processing and application,” in Proc. 16th ing maximum likelihood linear regression,” in Proc. ARPA Spoken Lang.
Int. Conf. Text, Speech, Dialogue, Pilsen, Czech Republic, Sep. 2013, Technol. Workshop, Austin, TX, USA, Jan. 1995, pp. 110–115.
pp. 591–598. [38] D. Havelock, S. Kuwano, and M. Vorländer, Handbook of Signal Process-
[14] C. Zhang and J. H. L. Hansen, “Whisper-Island detection based on un- ing in Acoustics. New York, NY, USA: Springer, 2008.
supervised segmentation with entropy-based speech feature processing,”
IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 4, pp. 883–894,
May 2011. Authors’ photographs and biographies not available at the time of publication.