Base Paper

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO.
12, DECEMBER 2017 2313
Whispered Speech Recognition Using Deep

Denoising Autoencoder and Inverse Filtering
Đorđe T. Grozdić and Slobodan T. Jovičić
Abstract—Due to the profound differences between acoustic situations these systems still remain both significantly sensi-
characteristics of neutral and whispered speech, the performance tive and unreliable. The ASR performance may be affected by
of traditional automatic speech recognition (ASR) systems trained various factors including: speech type and quality, individual
on neutral speech degrades significantly when whisper is applied.
In order to deeply analyze this mismatched train/test situation and speaker characteristics, such as speech rate and style, dialect,
to develop an efficient way for whisper recognition, this study first the vocal tract anatomy, the speaker’s psycho-physical state etc.
analyzes acoustic characteristics of whispered speech, addresses Furthermore, other influence types may also occur, particularly
the problems of whispered speech recognition in mismatched originating from the surrounding environment, and entailing:
conditions, and then proposes a new robust cepstral features and ambient noise, reverberation, loudness etc. Apart from the neu-
preprocessing approach based on deep denoising autoencoder
(DDAE) that enhance whisper recognition. The experimental tral mode, speech also occurs in different modalities, such as
results confirm that Teager-energy-based cepstral features, emotional speech, the Lombard-effect speech, as well as whis-
especially TECCs, are more robust and better whisper descriptors pered speech. All these above-mentioned atypical speech modes
than traditional Mel-frequency cepstral coefficients (MFCC). present challenging problems in ASR.
Further detailed analysis of cepstral distances, distributions of
Whisper is a specific form of verbal communication that is
cepstral coefficients, confusion matrices, and experiments with
inverse filtering, prove that voicing in speech stimuli is the main frequently utilized in different situations. Firstly, it is employed
cause of word misclassification in mismatched train/test scenarios. to make a discreet and intimate atmosphere in conversation,
The new framework based on DDAE and TECC feature, signif- and, secondly, it is used to protect some confidential and private
icantly improves whisper recognition accuracy and outperforms information from uninvolved parties. Besides, speakers whisper
traditional MFCC and GMM-HMM (Gaussian mixture density— when they do not want to disturb other people, for example in
Hidden Markov model) baseline, resulting in an absolute 31%
improvement of whisper recognition accuracy. The achieved word the library, or during a business meeting, but also in criminal
recognition rate in neutral/whisper scenario is 92.81%. activities, e.g., when criminals try to disguise their identity.
Nevertheless, in spite of the conscious production of a whisper,
Index Terms—Automatic speech recognition, deep denoising
autoencoder, inverse filtering, Teager-energy operator, whispered
it may occur due to health problems such as chronic disease of
speech recognition. the larynx structures [1].
Whisper has become a research topic of interest, essentially
I. INTRODUCTION important for speech technologies, mainly because of its sub-
ODERN-DAY automatic speech recognition (ASR) sys- stantial difference compared to normally phonated (neutral)
M tems have shown good performances and evident com-
mercial utilization, while, at the same time they have displayed
speech, and this is the case primarily due to the glottal vibra-
tions’ absence, noisy structure and lower SNR (Signal-to-Noise
a considerable number of weaknesses, flaws, and problems Ratio) [2]. Namely, the most current neutral speech oriented
when practically utilized. In order to accomplish more efficient interfaces of ASR systems are not capable of handling such
and better quality of human-computer interaction, these prob- an acoustic mismatch. Therefore, the performance of neutral-
lems have yet to be solved. Despite much research concerned trained ASR systems degrades significantly when whisper is
with ASR systems’ improvement over the last years, in some applied. Nonetheless, the research on the automatic whisper
recognition is still at the beginning, and there have been only a
few studies so far. The literature shows several approaches that
Manuscript received October 31, 2016; revised May 21, 2017; accepted June
17, 2017. Date of current version November 27, 2017. This work was supported have been proposed to alleviate this acoustic mismatch through
by the Serbian Ministry of Education, Science and Technological Development, model adaptation [3]–[7], feature transformations [5], or using
under Grants TR 32032 and OI 178027. The guest editor coordinating the alternative sensing technologies such as throat microphone [8].
review of this manuscript and approving it for publication was Dr. Prof. Tanja
Schultz . (Corresponding author: Đorđe T. Grozdić.) Also, there is an audio-visual approach to isolated word recogni-
Đ. T. Grozdić is with the School of Electrical Engineering, University of tion under whisper speech condition [9]. A vast majority of stud-
Belgrade, Belgrade 11120, Serbia (e-mail: djordjegrozdic@gmail.com). ies on whisper recognition use GMM-HMM (Gaussian mixture
S. T. Jovičić is with the School of Electrical Engineering, University of
Belgrade, Belgrade 11120, Serbia, and also with the Life Activities Advance- density—Hidden Markov model) system and traditional MFCC
ment Center, Laboratory for Forensic Acoustics and Phonetics, Belgrade 11000, (Mel-Frequency Cepstral Coefficients) or PLP (Perceptual Lin-
Serbia (e-mail: jovicic@etf.rs). ear Prediction) features.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. Our contribution to the literature is to demonstrate that by
Digital Object Identifier 10.1109/TASLP.2017.2738559 using state-of-the-art acoustic modeling techniques such as
2329-9290 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
2314 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 12, DECEMBER 2017
deep denoising autoencoder (DDAE) a significant increase in recognizer. In the next section, we conduct isolated word recog-
whisper recognition performance can be attained without the nition experiments to evaluate different cepstral features, inverse
need for model adaptation. The DDAE is applied to transform filtering and application of deep denoising autoencoder in mis-
cepstral feature vectors of whispered speech, into clean cepstral matched train/test conditions. Finally, conclusions are presented
feature vectors of neutral speech. More specifically, feature in Section VI.
extraction by a DDAE is achieved by training the deep neural
network to predict original neutral speech features from II. RELATED WORK
pseudo-whisper features that are artificially generated by in- Automatic recognition of whispered speech is an ongoing
verse filtering of neutral speech data. Acquired cepstral features and active field of research which is hindered by the lack of
are then concatenated with original cepstral features and then suitable and systematically collected corpora. Currently, there
processed with a conventional GMM-HMM recognizer to are only several existing databases of parallel neutral and whis-
conduct an isolated word recognition task. The advantage of pered speech collected for English [13], [14], [15], Japanese [3],
our feature mechanism is that whisper-robust features are easily Serbian [13] and Mandarin language [17]. Most of them have
acquired by a rather simple mechanism. Regarding the input a small or medium-sized vocabulary, while only a few of them
feature vectors, three different types of cepstral coefficients are transcribed and phonetically balanced.
were tested: traditional MFCC (Mel-Frequency Cepstral One of the first experiments with automatic whisper recog-
Coefficients) and two more recent cepstral coefficients, the nition was started by researchers at the University of Nagoya
TECC (Teager Energy Cepstral Coefficients) and the TEMFCC [3]. They intended to develop a speech recognizer specifically
(Teager Energy based Mel-Frequency Cepstral Coefficients) capable of handling whisper on cell phones in noisy conditions.
[10]. These two features use the nonlinear TEO (Teager-Energy Using HMM technique and MFCC features, they analyzed dif-
Operator) [9], [10] which serves for Teager energy calcula- ferent mismatched train/test scenarios with three speech modes:
tion. So far, nonlinear TEO-based features introduced very whisper, low-voice speech, and neutral speech. The results of
promising results in ASR of quiet and unvoiced murmured these experiments show severe degradation in ASR when using
speech as well in speech classification under stress and noisy mismatch data. However, the outstanding result is that the whis-
conditions [9]–[12]. On the other side, the characteristics of pered speech model (ASR model trained on whisper) performs
whispered speech might be considered to have similarities with as well on either type of speech. This phenomenon, among
non-audible murmur, noise corrupted speech and speech under other things, will be explained and experimentally proved in
stress. Due to these similarities, it was expected that the TECC our study. Next, the authors [3] confirm that covering the mouth
and the TEMFCC could be good descriptors of whispered and the cell-phone with a hand can cause the SNR increase
speech. The experiments on whispered speech recognition were in a noisy environment and may improve whisper and low-
conducted on a special database that was particularly created voice speech recognition to some extent. It was also shown that
for this study, entitled Whi-Spe [13], which contains 10000 ASR systems that are trained on neutral speech can be adapted
audio recordings of parallel neutral and whispered speech. to whisper recognition by using a small amount of whispered
Another important task of this study was to analyze the confu- speech data, which can improve whisper recognition. Utiliz-
sion in mismatched train/test conditions. The situation in which ing such speaking-style-independent model yields the whisper
a speaker is in front of the ASR system and switches from neu- recognition accuracy of about 66%.
tral speech to whisper provides particularly interesting results Later studies of whisper recognition attempt to reduce this
of neutral/whisper scenario that corresponds to real situation. In acoustic mismatch in different ways, for example through model
order to better understand these mismatched scenarios and to de- adaptation [4]–[6], [17] and feature transformations [5]. There
velop system which can provide better whisper recognition un- are also studies focused on front-end feature extraction strate-
der such conditions, confusion in word recognition was deeply gies [14], [18] and front-end filter bank redistribution based on
analyzed. The analysis of acoustic characteristics, distributions the subband relevance [15]. The efficiency of Vocal Tract Length
of cepstral coefficients, cepstral distances, as well as analysis Normalization (VTLN) [20] and shift transform [21] for whis-
of confusion matrices were done. Based on these results, a new per recognition was also investigated in [20]. Several studies
speech signal pre-processing approach was proposed. This ap- deal with ASR model adaptation to whisper recognition when
proach includes TECC features, inverse filtering, and DDAE, small amounts of whisper are available [19], [21]. In [20] Vector
which reduces the difference between neutral speech and whis- Taylor Series (VTS) based approach to pseudo-whisper adapta-
per and improves the word recognition rates in mismatched tion sample generation was investigated. All above-mentioned
scenarios. papers use HMM as ASR system. There are only two stud-
The reminder of this study is organized into seven sections. In ies that investigate neural network-based approach in whisper
Section II, we briefly review related work on whisper recogni- recognition [16], [22]. Also, there is an audio-visual approach
tion. Section III introduces the speech corpus that was specially for isolated digits recognition under whisper and neutral speech
constructed for this study and describes the nature of whispered [9]. In terms of feature extraction, only basic MFCC and PLP
speech. Here, the analyses of whisper characteristics and differ- features were tested in whisper recognition.
ences compared to neutral speech are presented. In Section IV, Although some improvement is achieved in each study, suc-
we introduce the general framework of the proposed ASR sys- cessful and usable recognition of whisper is still an unsolved
tem and describe feature extraction, DDAE, and GMM-HMM problem and an open research topic.
GROZDIĆ AND JOVIČIĆ: WHISPERED SPEECH RECOGNITION USING DEEP DENOISING AUTOENCODER AND INVERSE FILTERING 2315
III. WHISPER
A. Corpus of Neutral/Whispered Speech
Given the fact that there is a lack of extensive, appropriate
and publicly available databases in this area, the special speech
corpus for Serbian language, entitled “Whi-Spe” (abbreviated
from Whispered Speech), was developed for the purpose of this
study. The corpus consists of two parts - 5000 audio recordings
of spoken isolated words in neutral speech and 5000 recordings
of the same words in whisper. The corpus contains 50 differ-
ent words that were taken as a part of GEES speech database
[24] and achieve the balance of linguistic features. Both whis-
pered speech and neutral speech were collected from 5 male
and 5 female speakers, with typical pronunciation of speech
and whisper, and correct hearing. Each speaker had read all 50
words ten times in both speech modes, so the Whi-Spe corpus
contains 10000 recorded words. The recording process was car-
ried out under quiet laboratory conditions in a sound booth with
a high-quality omnidirectional microphone. Speech data were Fig. 1. Waveforms, spectrograms and spectrums of word “pijatsa” (“market”
in English) in neutral (left) and whispered speech (right).
digitized using a sampling frequency of 22050 Hz with 16 bits
per sample. More pieces of information concerning the Whi-Spe
corpus can be found in [13].
B. Acoustic Analysis of Whispered Speech

Whisper represents an alternative speech mode that is quite
different from neutral speech with regard to its characteris-
tics, nature and generation mechanism. The main peculiarity
of whisper’s generation mechanism is the absence of vocal cord
vibration. The vocal tract shape differs from neutral speech
and implies narrowed pharyngeal cavity, different placement
of tongue, lips and mouth opening [25]. As a result of differ-
ent vocal tract shape and articulation organs’ specific behavior,
whisper has distinctive acoustic characteristics. In order to more Fig. 2. The relationship between cepstral distance (between words in neutral
and whispered speech) and the ratio of unvoiced consonants.
closely describe this acoustic mismatch that occurs in ASR sys-
tems, several time-domain and spectral-domain examples are
presented. Fig. 1 shows comparisons of waveforms, spectro-
grams and long-term average spectrums of a word in neutral from the long-term average spectrums, in which whisper, in op-
and whispered speech. We can see a clear difference in ampli- position to neutral speech, has much flatter spectral slope [3],
tude intensity from the time waveforms. Since there is the lack [24], [26].
of sonority, amplitudes of voiced phonemes (vowels chiefly) are In order to quantify the difference between the same words
considerably lower in whisper, while the amplitudes of unvoiced in neutral speech and whisper, as a measure of cepstral dissim-
phonemes show similar intensity for both modes [3]. Whisper ilarity, the cepstral distance can be used. The cepstral distance
has completely noisy structure and conspicuously lower energy. (CD) is defined as:

The whispered speech duration is slightly longer (it is more N 2
obvious in longer utterances), which is also one of its character- 10 2
(n ) (w )
CD = ci − ci [dB], (1)
istics [25], [26]. ln(10) i=1
The spectrograms’ analysis shows the other differences be-
(n ) (w )
tween neutral and whispered speech. Although the vocal tract where ci and ci are the cepstral coefficients (MFCC) for
shape is different in whisper, spectrograms save the most impor- the neutral and the whispered speech, respectively, and N is the
tant spectral speech characteristics. Despite the noisy structure, fixed number of cepstral coefficients per word. The average CD
the spectral concentrates of some phonemes can still be ob- for each word from Whi-Spe corpus is calculated and presented
served. The most important spectral changes are perceived in in Fig. 2 showing the correlation with the ratio of unvoiced
vowels. In whisper the lower formants’ locations are shifted to consonants per word (unvoiced/phoneme count).
higher frequencies [3], [24], [27]. In contrast to vowels, un- The straight line of best fit shows strong and negative corre-
voiced consonants are not significantly changed in their spectral lation (Pearson’s coefficient −0.834), meaning that the words
domain [3]. The lack of sonority in whisper is also observed with the lower ratio of unvoiced consonants have the higher
Fig. 4. Architecture of the proposed ASR system. The system is com-

posed of: (a) Feature extractor. (b) Deep denoising autoencoder (DDAE).
(c) GMM-HMM.
designed for normally phonated speech. Some of these differ-

Fig. 3. Normalized distributions of c0 and c1 coefficients in neutral and ences can be reduced, i.e., differences in energy level. This
whispered speech.
problem can be solved by closer microphone position while
whispering, or by simple signal amplification under good SNR
CD and vice versa. For example, a word with 3 unvoiced conditions. However, spectral differences still remain, e.g., dif-
consonants and total 4 phonemes (ratio 0.75) has the lowest ferences in spectral slopes, and require their proper modification
CD, 4.7 dB. Conversely, the words without unvoiced conso- in order to be adjusted for ASR.
nants (ratio 0) have the highest CD or dissimilarity, on average
6.7 dB. This indicates that characteristics of vowels and voiced IV. MODEL
consonants change more significantly than those of unvoiced A schematic diagram of the proposed ASR system is shown
consonants [3]. in Fig. 4.
Cepstral coefficients can be further analyzed and compared The proposed framework consists of two feature extractors
between neutral and whispered speech. For example distribu- to process audio signals and one GMM-HMM. The first extrac-
tions of the first two cepstral coefficients, c0 and c1 , can be used tor serves for MFCC, TECC, and TEMFCC cepstral features
to verify the energy and spectral slope in Whi-Spe database. extraction. The second extractor is in the form of a deep denois-
To be specific, c0 incorporates information about signal energy, ing autoencoder (DDAE) which filters out the effects of whis-
while c1 is related to the spectral slope [15]. Fig. 3 shows c0 pered speech and reconstructs clean neutral speech features.
and c1 distributions in neutral and whispered speech, as well as Accordingly, DDAE outputs are extracted to serve as the robust
distributions of their unvoiced and voiced parts. cepstral features for mismatched whispered speech. Finally, a
The ratio of unvoiced to voiced speech segments in the Whi- GMM-HMM recognizes isolated words by binding acquired
Spe neutral speech recordings is 37.65/62.35%, respectively, cepstral feature sequences from both extractors.
and 99.2/0.8% in whispered recordings. The left part of the
Fig. 3 presents c0 distributions which reflect the energy levels.
A. Feature Extraction
The distributions of neutral speech show that dominant voiced
segments tend to reach higher energy (higher c0 values) than Three types of cepstral features are tested in this study:
the unvoiced. In whisper, unvoiced segments are dominant and Mel-frequency Cepstral Coefficient (MFCC), Teager En-
the overall distributions are shifted to the left indicating lower ergy Cepstral Coefficient (TECC) and Teager-Energy based
energies compared to neutral speech. Mel-frequency Cepstral Coefficient (TEMFCC). These features
The right part of Fig. 3 shows distributions of c1 which reflect are extracted from 25 ms time frames with a step size of 10 ms.
the spectral slope. In neutral speech, voiced distribution occu- The MFCCs are traditional cepstral features that are used in ASR
pies the highest c1 values, i.e., the steepest slopes. Following the and their extraction procedure is performed in the following
intuition, unvoiced segments are centered together at lower c1 s steps: (1) the discrete Fourier transform (DFT) of speech signal
(flatter slopes). In whisper, all distributions are shifted to the left frame is computed, (2) the power spectrum is calculated, (3) the
(lower c1 s). This reflects that the overall spectral slope is flatter Mel-filterbank consisting of 30 triangular shaped filters is ap-
compared to neutral speech. In contrast to distributions of c0 , plied, (4) the signal output log energy of each filter is estimated,
here we have a noticeable difference in distributions’ positions. (5) the discrete cosine transform (DCT) is taken and the cepstral
When compared with neutral speech, all these specific dif- coefficients are obtained, and (6) only the first 12 MFCCs are
ferences render whisper a particular challenge for recognition extracted. Finally, feature vectors are channel normalized us-
by means of traditional ASR systems that have been primarily ing cepstral mean subtraction (CMS) to remove convolutional
distortions caused by characteristics of communication chan- interferences [10]. Recent studies have shown that the human
nels or recording devices. However, CMS is partially effective physiology dictates that the auditory filter bandwidths are given
in reducing the effects of additive environmental noise. by the ERB (Equivalent Rectangular Bandwidth) function [10]:
The other two cepstral coefficient types are relatively new and
ERB(fc ) = 6.23(fc /1000) + 93.39(fc /1000) + 28.52, (4)
more robust to noise interference. Moreover, they are still not
tested on whispered speech recognition. Their basic extraction where fc is the central frequency in Hz. Then, the impulse
algorithm is similar to MFCC feature, but includes one signifi- response of the asymmetrical Gammatone filter is given by:
cant difference in terms of energy estimation. For TEMFCC and
TECC calculation, the nonlinear Teager-energy operator (TEO) g(t) = Atη −1 e(−2π bE R B (f c )t) cos(2πfc t), (5)
is used [29], which serves for estimation of the Teager-energy where A., b and η (the filter order) are the Gammatone design
in place of the standard energy calculation method (Standard- parameters, and fc is the filter central frequency.
Energy Operator, SEO −x(t)2 ). The main idea behind using At the end, in order to describe speech signal dynamics more
TEO is nonlinear modeling of speech production. Namely, the thoroughly, the MFCC, TECC, and TEMFCC feature vectors
traditional acoustic theory assumes that airflow from the vocal were concatenated with their first (delta) and second-time (delta-
tract propagates as a plane wave. However, this assumption may delta) derivatives. In this way, every frame was represented with
not hold since the flow is actually separate and concomitant vor- 36-dimensional vector (12 static features + 12 delta + 12 delta-
tices are distributed throughout the vocal tract. The true source of delta).
speech production is actually the vortex-flow interactions which
are nonlinear [30]. Since whispered speech has nonlinear and B. Denoising Autoencoder (DAE)
extremely turbulent airflow, TEO provides an efficient way for
signal processing. Contrary to the traditional signal energy ap- The deep autoencoder is a special type of the DNN (Deep
proximation, which only takes into account the kinetic energy of Neural Network) whose output is a reconstruction of the input
the signal’s source, TEO computes the “true” total source energy which is commonly utilized for dimensionality compression and
of a resonance signal, and incorporates both amplitude and fre- feature extraction [32]. Typically, an autoencoder has an input
quency information [10]. In other words, the total source energy, layer which represents the original feature vectors, one or more
i.e., the sum of the potential and kinetic energy, is proportional to hidden layers that represent the transformed feature, and an out-
the product amplitude and frequency squared which is nothing put layer which matches the input layer for reconstruction. The
but the true energy required to generate the speech signal [31]. dimension of the hidden layers can be either smaller than the
These supplementary information can improve time-frequency input dimension when the goal is feature compression, or larger
description of rapid energy changes within a glottal cycle [29] when the goal is mapping the feature to a higher-dimensional
and improve the representation of formant information in the space. An autoencoder tries to find deterministic mapping be-
feature vectors. For continuous-time signals, TEO is defined tween input units and hidden nodes by means of a nonlinear
as [32]: function fΘ (x):
Ψ(x(t)) = ẋ(t)2 − x(t)ẍ(t), (2) y = fθ (x) = f1 (W x + b), (6)
where ẋ(t) and ẍ(t) are first and second-time derivatives of where W is a d × d weight matrix, b is a bias vector, and f1 (.)
speech signal x(t). The TEO’s good feature is also a relatively is a nonlinear function such as sigmoid or tanh. This mapping is
easy and fast way of computing it, which could be of benefit called the encoder. The latent representation y is then mapped
to real-time continuous speech recognition. For discrete time back to reconstruct the input signal with:
signal is nearly instantaneous and requires only three adjacent z = fθ (y) = f2 (W y + b ), (7)
signal samples at each time instance [32]:
where W is a d × d weight matrix, b is a bias vector, and
Ψ(x[n]) = x[n]2 − x[n − 1]x[n + 1]. (3) f2 (.) is either nonlinear function, such as sigmoid or tanh, or a
In addition to the TEO utilization, the TECC feature is dis- linear function. This mapping is called the decoder. The goal of
tinctive by one more characteristic in comparison to MFCC and training is to minimize squared error function:
TEMFCC features, and that is the use of Gammatone filterbank
L(x, z) = x − z2 . (8)
[10] instead of Mel-filterbank. In this research, the Gammatone
filterbank contains the same number of filters (L = 30) as in To prevent autoencoder from learning the trivial identity map-
the Mel-filterbank in the process of MFCC and TEMFCC fea- ping function, some constraints are usually applied during the
tures extraction. The Gammatone filters are asymmetric, non- training. For example adding Gaussian noise to the input signal
constant-Q type, smoother and broader compared to Mel’s fil- (which is applied in our study) or using the “dropout” trick by
ters that are asymmetric, triangular and of constant-Q type. The randomly forcing certain values to be zero at the input data.
Gammatone filterbank is auditory motivated filterbank created Such autoencoder is known as denoising autoencoder (DAE)
with the aim to simulate the nature of cochlea and basilar mem- [32]. DAE shares the same structure as autoencoder, but in-
brane, and thereby human sound perception. It is also superior put data is a deteriorated version of the output data. In other
to the Mel-filterbank and provides higher robustness in situation words, DAE use feature mapping to convert corrupted input data
when signal is degraded with additive noise and other harmful (x̂ - input signal) into clean output (x - teacher signal). In our
case, we use deep DAE to transform cepstral feature vectors TABLE I

WORD RECOGNITION RATE (%) IN MATCHED TRAIN/TEST SCENARIOS
of whispered speech, into clean cepstral feature vectors of neu- ACHIEVED BY GMM-HMM SYSTEM USING DIFFERENT CEPSTRAL FEATURES
tral speech. The proposed architecture of DDAE is illustrated
in Fig. 4 and consists of two encoder layers with sigmoid func-
Matched train/test Matched train/test
tions and one decoder layer with linear function. The input
layer has 11 × 36 nodes while output has 36 nodes. This means Feature neutral/neutral whisper/whisper
MFCC 98.52 97.80
that we use 11 contiguous frames with 36 cepstral coefficients TECC 99.41 99.01∗
as x̂ to encode, and use only the corresponding middle frame TEMFCC 99.27 98.13∗
with 36 clean cepstral features as x to fine-tune. The output MFCC + Δ 99.65 99.24
TECC + Δ 99.92 99.85∗
36-dimensional feature vector is joined with the original 36- TEMFCC + Δ 99.86 99.28∗
dimensional feature vector from feature extractor and together MFCC + Δ + ΔΔ 99.77 99.53
fed to the GMM-HMM, as shown in Fig. 4. TECC + Δ + ΔΔ 99.95 99.94∗∗
TEMFCC + Δ + ΔΔ 99.91 99.60∗
In terms of training, with parallel deteriorated and clean
speech data, a DDAE can be pre-trained on pseudo-whisper (p < 0.05 ∗ ; p < 0.01 ∗∗ ; p < 0.006 ∗∗∗; Confidence interval = 95%)
cepstral features and fine-tuned by neutral speech cepstral fea-
tures. To be precise, in our experiments pseudo-whisper sam- with strictly left-right structure and 16 mixture components.
ples are obtained by inverse filtering [34] of neutral speech The number of training cycles in embedded re-estimation was
samples and adding random Gaussian noise with 10 dB SNR. restricted to 5 and the variance floor for Gaussian probability
Such pseudo-whisper data is used in the pre-training process, density functions was set to 1%. The initial model parameters
while neutral speech data is applied for fine-tuning of DDAE. were estimated using the flat-start method, in which models
This is a standard way to train DDAE. There are several rea- were initialized with global mean and variance. In the testing
sons for adding random noise. First, with adding Gaussian noise, phase, Viterbi algorithm was applied to determine the most
inverse-filtered signal becomes more similar to whisper in terms likely model that best matched each test utterance.
of its noisy nature, and the model learned that way would be
robust to the same kind of distortions in the test data. Second, V. RESULTS
since the noise is added randomly, the DDAE avoid learning
The following subsections present results of isolated word
the trivial identity solution. Third after all, each distorted input
recognition obtained through 10-fold cross-validation.
sample is different, which greatly increases the training set size
and thus can additionally alleviate the overfitting problem. In
this way, the rich nonlinear structure in the DDAE can be used A. ASR performance Evaluation—Matched
to learn an efficient transfer function which suppresses whisper Train/Test Scenarios
characteristics in speech while keeping enough phonetically dis- This section presents the results of speaker-dependent word
criminative information to generate good reconstructed neutral recognition in matched train/test scenarios − neutral/neutral and
speech features. This can be of great importance for large vo- whisper/whisper, performed on the whole Whi-Spe database
cabulary continuous speech recognition systems, where ground and using GMM-HMM system without denoising autoencoder
truth labels are hard to obtain (i.e., in the whisper case). features. The word recognition results achieved by GMM-HMM
Pre-training consists of learning a stack of restricted system are presented in Table I.
Boltzmann machines (RBMs), each having only one layer of As expected for matched data, the recognition efficiency in
feature detectors. After learning one RBM, the status of the both speech modes is very high. Although the “ceiling effect”
learned hidden units given the training data can be used as fea- has been reached, the recognition accuracies show two slight
ture vectors for the second RBM layer. The contrastive diver- tendencies. Firstly, MFCC based features show a bit less recog-
gence method [35] can be used to learn the second RBM in the nition rate in both speech modes compared to the other two
same fashion. Then, the status of the hidden units of the second features. Secondly, in whisper recognition, the TECC based fea-
RBM can be used as the feature vectors for the third RBM, etc. tures give the best results (TECC: 99.01%; TECC+Δ: 99.85%;
This layer-by-layer learning can be repeated for many times. TECC+Δ+ΔΔ: 99.94%) which predicts the fact that TEO and
After the pre-training, the RBMs are unrolled to create a deep Gammatone filterbank better describe whisper characteristics. In
autoencoder, which is then fine-tuned using backpropagation of order to prove these statements, a statistical test was needed. For
error derivatives. Backpropagation modifies the weights of the this purpose, the two-tailed Wilcoxon signed-rank test was used
network to reduce the error of the teacher signal and the output to analyze statistical significance of these small differences in
value when a pair of signals (input signal and the ideal teacher achieved average word recognition accuracies between MFCC
signal) are given. feature and other cepstral features. P-values for Wilcoxon test
are presented with asterisks in Table II. The obtained results con-
firm for all speakers that Teager-based features are statistically
C. GMM-HMM significant (p<0.05) in whisper recognition. TECC+Δ+ΔΔ
We used a standard GMM-HMM system trained and tested feature shows the highest statistical significance among the other
with the Hidden Markov model toolkit (HTK) [36]. The acous- features (p = 0.007). However, in neutral/neutral scenario the
tic model contains 5 states in total (3 of which are emitting) choice of different features does not show any significance.
TABLE II
WORD RECOGNITION RATE (%) IN MISMATCHED TRAIN/TEST SCENARIOS
ACHIEVED BY GMM-HMM SYSTEM USING DIFFERENT CEPSTRAL FEATURES.
Feature Neutral/whisper Whisper/neutral
MFCC 45.19 62.13

TECC 54.42 ∗∗∗ 68.51
TEMFCC 47.85 ∗ 63.57
MFCC + Δ 59.67 67.22 Fig. 5. Word recognition rate (%) for different train/test scenarios in case of
TECC + Δ 65.29 ∗∗∗ 73.25 expanded set of features (feature+Δ+ΔΔ).
TEMFCC + Δ 59.94 ∗ 69.43
MFCC + Δ + ΔΔ 61.81 73.24
TECC + Δ + ΔΔ 71.33 ∗∗ 77.35
TEMFCC + Δ + ΔΔ 63.71 75.04
(p < 0.05 ∗ ; p < 0.01 ∗∗ ; p < 0.006 ∗∗∗; Confidence interval = 95%)
B. ASR Performance Evaluation—Mismatched

Train/Test Scenarios
The usual problem of the ASR systems occurs when speaker
switches from neutral speech to whisper, or vice versa. Such a
problem is also noticeable and related to mismatched train/test Fig. 6. Confusion matrices of the word recognition in two train/test scenarios
scenarios in the application of GMM-HMM systems. This ex- (neutral/whisper on the left and whisper/neutral on the right) using MFCC
feature. The color bar scale to the right indicates probability ranges from 1
periment investigates exactly that situation. The analysis was down to 0.
carried out for the whispered speech recognition with the ASR
system that was trained to recognize neutral speech, and vice
The relative relations of our results (particularly the better
versa, neutral speech recognition with the ASR system that was
recognition results in the whisper/neutral mismatch condition
trained to recognize the whispered speech. The averaged results
over the neutral/whisper mismatch condition) are in agreement
achieved by GMM-HMM without using denoising autoencoder
with the results of the other two studies performed with the
features for all speakers are presented in Table II.
HMM systems [3], [17]. Namely, in Ito’s work [3], HMM sys-
In contrast to the matched train/test scenarios, mismatched
tem also gives higher word recognition rate in whisper/neutral
scenarios show significantly lower word recognition accuracies.
scenario (53%) compared to neutral/whisper scenario (27%).
However, there are three important observations. Firstly, adding
The same tendency is observed in [18] where word recognition
dynamic features to their static features results in the significant
rate in whisper/neutral scenario is 48.36% compared to 36.24%
improvement of word recognition in both speech modes for all
in neutral/whisper scenario. This phenomenon will be discussed
tree speech features: MFCC, TEMFCC and TECC. Secondly, in
in the following sections.
whisper/neutral scenario the TECC based features show much
higher word recognition rate than other two features. This is
C. Confusion Analysis
even more noticeable in neutral/whisper scenario. For example,
in case of using static features alone, TECC achieves 54.42% This Section describes the analysis of confusion in word
in word recognition accuracy compared to 47.85% and 45.19% recognition in mismatched train/test scenarios. As an illustration
when TEMFCC and MFCC are applied respectively. When delta Fig. 6 presents the word recognition confusion matrices when
features are added, TECC+Δ feature again shows better perfor- MFCC feature was utilized.
mance with 65.29% of correct word recognition in contrast to The vertical axis on the matrices marks the input stimuli
59.94% (TEMFCC+Δ) and 59.67% (MFCC+Δ). The best per- words, while the horizontal axis denotes the output recognized
formance is achieved when both delta and delta-delta features words. In the graphical representations of confusion matrices,
are used, and in that case TECC+Δ+ΔΔ feature once again the gray scale on their right side indicates the accuracy of word
demonstrates the best result with word recognition accuracy of recognition. So, the darker shade denotes a higher possibility
71.33% compared to 63.71% (TEMFCC) and 61.81% (MFCC). in word recognition, where black indicates maximum probabil-
These observations are statistically confirmed with two-tailed ity (p = 1) and white indicates minimal probability (p = 0).
Wilcoxon test (p-values are marked with asterisks in Table II) The dark diagonals are clearly highlighted, and they represent
which highlights once again that the TECC feature is the one that correctly classified words. In addition to the diagonals, slightly
best characterizes whisper. Thirdly, the TECC features show the brighter vertical lines and dots are observed, particularly in the
smallest difference between these two scenarios at the level of neutral/whisper scenario, which indicate distinct confusions be-
6.02%, especially when delta, delta-delta features are added, in tween words in whisper and neutral speech.
comparison to MFCC with 11.43% and TEMFCC with 11.33%. The analysis of confusion in neutral/whisper scenario re-
Better insight in the relation between all train/test scenarios is vealed several highly misclassified stimuli words and some of
illustrated in Fig. 5. them are marked with arrows in confusion matrix. It was found
out that these confusions occur in most cases when two words
start with the same fricative and contain several consonants such
as /s/, /z/, /ts/ that are very similar in neutral and whisper mode.
For example fricative /z/ is voiced pair of /s/ and in whisper,
due to the lack of sonority, their confusion is very often because
of their similar articulation and spectral structure. However, the
mentioned confusions are not noticeable in whisper/neutral sce-
nario. The following question arises: Why don’t they occur in
the whisper/neutral scenario? The possible answer is that in
whisper/neutral scenario ASR system is trained with whisper
that has completely unvoiced sounds, while as an input comes
speech with both voiced and unvoiced sounds. In this situation,
for ASR system it is sufficient to match patterns of the unvoiced
sounds, while the voiced sounds are surplus. On the other hand, Fig. 7. Example of inverse filtering on a speech sample in neutral speech: (a)
in neutral/whisper scenario ASR system expects to get both FFT spectrum of a speech. (b) LPC spectral envelope. (c) FFT spectrum after
voiced and unvoiced sounds, but instead gets only unvoiced inverse filtering. (d) frequency response of inverse filter IF(z).
sounds of whisper. With these attitudes, whisper/neutral sce-
nario displays a higher recognition score than neutral/whisper
scenario (shown in Fig. 5), and consequently lower recognition
confusion, as indicated by Fig. 6.
However, the confusion matrices in the case of TECC feature
usage have a smaller number of misclassified words. This fact
points out that TECC feature is better for whisper recognition
than MFCC feature. The reason for this lower confusion level
is different kind of information extracted from the words in
speech and whisper as a consequence of using the TEO and the
Gammatone filterbank.
D. Reduction of Confusion Using Inverse Filtering
The phenomenon of better recognition results in the
whisper/neutral mismatch condition over the neutral/whisper Fig. 8. LTAS of Whi-Spe database, before and after inverse filtering.
mismatch condition has also been noted in two studies with
HMM systems [3], [18], but its deeper analysis has not been
performed. While looking for an explanation, the following where H(z) is the transfer function of vocal tract, ai are the LPC
hypothesis was brought - The ASR system that was trained coefficients of an utterance, and p = 10 is the order of the LPC
on whisper has better recognition in whisper/neutral scenario filter. It is known that inverse filter is reciprocal of the all-pole
than the ASR system that was trained on neutral speech in filter H(z) [37]. Hence, the frequency response of inverse filter is
scenario neutral/whisper, because the most of whisper features reciprocal to the LPC spectrum envelope of an utterance as it is
are contained in neutral speech, which is not the opposite case. illustrated in Fig. 7. In this way, inverse filter performs spectrum
Correspondingly, the ASR system that was trained on neutral flattening for each word from the Whi-Spe database.
speech was trained with speech features the basis of which The final result of inverse filtering can be perceived from the
is composed of voicing that does not exist in whisper, and long-term average spectrums (LTAS) of Whi-Spe database in
therefore whisper is less recognized. Fig. 8.
The way to test this hypothesis is to reduce the voicing influ- Comparing the shapes of LTAS before and after inverse filter-
ence in neutral speech, and to make it more similar to whisper ing, we can see that after the filtering, also known as spectrum
in terms of acoustic characteristics, and after such modification “whitening” [38], the spectrums of speech and whisper are flat-
to apply it in ASR training. According to Fig. 1, the whis- tened and more similar. To prove this achieved spectral similar-
pered speech spectrum is very flat. On the other hand, in neutral ity, the CDs between speech and whisper stimuli are measured
speech, voicing is dominant in domain of the first four formants once again. The average CD for all words before inverse filter-
of voiced phonemes, primarily vowels, i.e., at lower frequencies ing was 6.19 dB, while the new one has noticeable less value,
below 5 kHz. In order to make speech spectrum more simi- 3.45 dB.
lar to whisper, it is necessary to reduce this spectral tilt. For The changes of energy and spectral slope can be evaluated by
this purpose, in the pre-processing stage we added inverse filter c0 and c1 cepstral distributions that are presented in Fig. 9. As we
given by: can see, the energy level (c0 values) in both speech modes drops
off slightly because of the suppression of spectral components

p
IF (z) = 1/H(z) = 1 − ai z −i , (9) during inverse filtering (compare with Fig. 3). Nevertheless, the
i=1 c0 distributions are centered at the same position one above the
TABLE IV
WORD RECOGNITION RATE (%) ACHIEVED BY PROPOSED ASR SYSTEM AND
APPLICATION OF DDAE-BASED CEPSTRAL FEATURES
Feature Neutral/neutral Neutral/whisper
MFCC + Δ + ΔΔ + DDAE 99.83 85.97 ∗∗

TECC + Δ + ΔΔ + DDAE 99.94 92.81 ∗∗
TEMFCC + Δ + ΔΔ + DDAE 99.87 87.73 ∗∗∗
DDAE (MFCC) 94.17 84.89 ∗∗
DDAE (TECC) 95.83 92.74 ∗∗
DDAE (TEMFCC) 95.47 85.97 ∗∗
(p < 0.05 ∗ ; p < 0.01 ∗∗ ; p < 0.006 ∗∗∗; Confidence interval = 95%)
E. Application of Denoising Autoencoder

As explained in previous sections, inverse filtering can be
easily applied for fast generation of pseudo-whisper samples
from neutral speech samples and applied in training of DDAE.
Fig. 9. Normalized distributions of c0 and c1 coefficients in neutral and
whispered speech after inverse filtering. This section presents the results of a new proposed ASR frame-
work (illustrated in Fig. 4) which enables better whisper recog-
TABLE III
WORD RECOGNITION RATE (%) IN MISMATCHED TRAIN/TEST SCENARIOS
nition in mismatched scenarios. The results are presented in
BEFORE AND AFTER INVERSE FILTERING ACHIEVED BY GMM-HMM Table IV. For simplicity, only the results of neutral-trained rec-
ognizer models are presented, which are the most important
Neutral/ Whisper/ Difference in from the standpoint of a real ASR application.
whisper neutral scenarios The results demonstrate three things. Firstly, the fusion
of standard cepstral features and DDAE-based cepstral fea-
Features Before After Before After Before After
tures considerably improves whisper recognition in mismatched
MFCC + Δ + ΔΔ 62.81 72.45∗∗∗ 73.24 75.25 10.43 2.80 condition while keeping high performances in matched neu-
TECC + Δ + ΔΔ 71.33 76.57∗∗∗ 77.35 79.21 6.02 2.64
TEMFCC + Δ + ΔΔ 63.71 71.42∗∗∗ 75.04 75.16 12.33 3.74
tral/neutral scenario. The obtained word recognition rates in
neutral/whisper scenario are 85.97%, 92.81% and 87.73% for
(p < 0.05 ∗ ; p < 0.01 ∗∗ ; p < 0.006 ∗∗∗; Confidence interval = 95%) MFCC, TECC and TEMFCC features respectively. This im-
provement in word recognition accuracy in comparison with
other. On the right side, c1 distributions are also centered at the the results from Table II is statistically confirmed with Wilcoxon
same position. This means that spectral slope in neutral speech signed-rank test. Secondly, using only DDAE features proves
is now flattened and matched with the one in whispered speech. that improvement in mismatched scenario comes from DDAE
The shift of c1 distribution in neutral speech is obvious when features itself and not from the larger number of parameters.
compared Figs. 3 and 8. However, these spectral changes didn’t However, using only DDAE features shows slightly worse re-
harm the formant structure in terms of formant locations and sults in neutral/neutral scenario. Thirdly, TECC feature once
their visibility. With stimuli spectrum corrected in such a way, again demonstrates the highest score, proving to be the best
the experiments in word recognition were repeated. feature for whisper recognition.
The new recognition results in matched scenarios achieved
by GMM-HMM after inverse filtering are very similar to those
VI. CONCLUSION
in Table I. For example in neutral/neutral scenario the average
recognition accuracy is 99.82%, while in whisper/whisper is Lately applications of deep learning with its strong mod-
99.76% (when TECC+Δ+ΔΔ are used). The new results of eling power helped to quickly spread the success of DDAE
mismatched scenarios are presented in Table III. in various industrial and machine learning tasks such as: im-
Three facts are important to be noticed in mismatched scenar- age recognition, optical character recognition, computer vision,
ios. Firstly, all results of word recognition after inverse filtering natural language and text processing, information retrieval, au-
are better. According to p-values from Wilcoxon test, this im- tomatic speech recognition, etc. This paper extends this trend
provement is statistically significant only in neutral/whisper sce- and demonstrates how a DDAE can be applied to solve a com-
nario, which is a consequence of suppression of voicing effects plex problem in ASR such as whispered speech recognition. We
during inverse filtering. Secondly, the TECC feature once again propose a new and more advanced approach based on DDAE
shows the best accuracy, the maximum of 76.57%. Thirdly, for isolated word recognition, which can also be applied in
the difference in word recognition rates between mismatched continuous speech recognition. Using a database of neutral
scenarios is much less. This difference is minimal for the TECC and whispered speech (Whi-Spe), this article evaluates whis-
feature and it is 2.64%. These results prove the hypothesis that pered speech recognition performance in mismatched train/test
the voicing in speech stimuli is the main cause of difference in conditions and demonstrates how DDAE with the help of in-
word recognition accuracies in mismatched scenarios. verse filtering can effectively filter out the effects of whispered
speech and reconstruct neutral cepstral features leading to sig- [15] S. Ghaffarzadegan, H. Boril, and J. H. L. Hansen, “UT-Vocal Effort II:
nificant performance gain in whisper recognition. The efficiency Analysis and constrained-lexicon recognition of whispered speech,” in
Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Florence, Italy,
of this approach was demonstrated through a comparative study May 2014, pp. 2544–2548.
with the conventional GMM-HMM speech recognizer and three [16] T. Tran, S. Mariooryad, and C. Busso, “Audiovisual corpus to analyze
types of cepstral features: MFCC, TECC and TEMFCC. whisper speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
Vancouver, BC, Canada, May 2013, pp. 8101–8105.
Experimental results confirmed that the proposed model [17] P. X. Lee et al., “A whispered mandarin corpus for speech tech-
presents several advantageous characteristics such as (i) sig- nology applications,” in Proc. INTERSPEECH, Singapore, Sep. 2014,
nificantly lowered word error rate in mismatched train/test con- pp. 1598–1602.
[18] J. Galić, S. T. Jovičić, Đ. Grozdić, and B. Marković, “HTK-based recogni-
ditions together with high performances in matched train/test tion of whispered speech,” in Proc. Int. Conf. Speech Comput., Novi Sad,
scenarios, (ii) easily acquirable whisper-robust features, and Serbia, Oct. 2014, pp. 251–258.
(iii) no need for real whisper data in training process or model [19] C. Zhang and J. H. L. Hansen, “Advancements in whisper-island detection
using the linear predictive residual,” in Proc. IEEE Int. Conf. Acoust.,
adaptation. Furthermore, the experimental results demonstrate Speech, Signal Process., Dallas, TX, USA, Mar. 2010, pp. 5170–5173.
that usage of Teager-based cepstral features outperforms tra- [20] S. Ghaffarzadegan, H. Boril, and J. H.L. Hansen, “Model and feature
ditional MFCC features in whisper recognition accuracy by based compensation for whispered speech recognition,” in Proc. INTER-
SPEECH, Singapore, Sep. 2014, pp. 2420–2424.
nearly 10%. Thus, combining TECC features with DDAE ap- [21] H. Bořil and J. H. L. Hansen, “Unsupervised equalization of lombard
proach gives the best results and shows that the proposed frame- effect for speech recognition in noisy adverse environments,” IEEE Trans.
work can considerably reduce recognition errors and improve Audio, Speech, Lang. Process., vol. 18, no. 6, pp. 1379–1393, Aug. 2010.
[22] S. Ghaffarzadegan, H. Boril, and J. H. L. Hansen, “Generative modeling
whisper recognition performance by 31% over the traditional of pseudo-target domain adaptation samples for whispered speech recog-
HTK-MFCC baseline and thereby achieve word recognition ac- nition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 10,
curacy of 92.81%. pp. 1705–1720, Oct. 2016.
[23] Đ. T. Grozdić, B. Markovic, J. Galic, and S. T. Jovičić, “Application of
neural networks in whispered speech recognition,” in Proc. 20th Telecom-
mun. Forum, Belgrade, Serbia, Nov. 2012, pp. 728–731.
REFERENCES [24] S. T. Jovičić, Z. Kašić, M. Đorđević, and M. Rajković, “Serbian emotional
[1] S. T. Jovičić and Z. Šarić, “Acoustic analysis of consonants in whispered speech database: Design, processing and evaluation,” in Proc. 9th Conf.
speech,” J. Voice, vol. 22, pp. 263–274, 2008. Speech Comput., St. Petersburg, Russia, Sep. 2004, pp. 77–81.
[2] R. Morris, “Enhancement and recognition of whispered speech,” Ph.D. [25] S. T. Jovicic, “Formant feature differences between whispered and
dissertation, School Elect. Comput. Eng., Georgia Inst. Technol., Atlanta, voiced sustained vowels,” Acta Acust. United With Acust., vol. 84,
GA, USA, 2003. pp. 739–743, 1998.
[3] T. Ito, K. Takeda, and F. Itakura, “Analysis and recognition of whispered [26] X. Fan and J. H. L. Hansen, “Speaker identification within whispered
speech,” Speech Commun., vol. 45, pp. 139–152, 2005. speech audio streams,” IEEE Trans. Audio, Speech, Lang. Process.,
[4] B. P. Lim, “Computational differences between whispered and non- vol. 19, no. 5, pp. 1408–1421, Jul. 2011.
whispered speech,” Ph.D. dissertation, Dept. Elect. Comput. Eng., Univ. [27] C. Zhang and J. H. L. Hansen, “Analysis and classification of speech
Illinois at Urbana-Champaign, Champaign, IL, USA, 2011. mode: Whispered through shouted,” in Proc. INTERSPEECH, Antwerp,
[5] C. Y. Yang, G. Brown, L. Lu, J. Yamagishi, and S. King, “Noise-robust Belgium, Aug. 2007, pp. 2396–2399.
whispered speech recognition using a non-audible-murmur microphone [28] K. J. Kallail and F. W. Emanuel, “Formant-frequency differences between
with VTS compensation,” in Proc. 8th Int. Symp. Chinese Spoken Lang. isolated whispered and phonated vowel samples produced by adult female
Process., Hong Kong, China, Dec. 2012, pp. 220–223. subjects,” J. Speech Hear. Res., vol. 27, pp. 245–251, 1984.
[6] A. Mathur, S. M. Reddy, and R. M. Hegde, “Significance of parametric [29] J. F. Kaiser, “Some useful properties of Teager’s energy operators,” in
spectral ratio methods in detection and recognition of whispered speech,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Minneapolis, MN,
EURASIP J. Adv. Signal Process, vol. 2012, p. 157, 2012. USA, Apr. 1993, pp. 149–152.
[7] S. Ghaffarzadegan, S. Boril, and J. H. L. Hansen, “Generative modeling of [30] H. M. Teager, “Some observations on oral air flow during phonation,”
pseudo-target domain adaptation samples for whispered speech recogni- IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 5, pp. 599–601,
tion,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Brisbane, Oct. 1980.
Australia, Apr. 2015, pp. 5024–5024. [31] P. Maragos, J. F. Kaiser, and T. F. Quatieri, “Energy separation in signal
[8] S. Jou, T. Schultz, and A. Waibel, “Adaptation for soft whisper recognition modulations with application to speech analysis,” IEEE Trans. Signal
using a throat microphone,” in Proc. INTERSPEECH, Jeju Island, South Process., vol. 41, no. 10, pp. 3024–3051, Oct. 1993.
Korea, Oct. 2004, pp. 5–8. [32] E. Kvedalen, “Signal processing using the Teager energy operator and
[9] F. Tao and C. Busso, “Lipreading approach for isolated digits recogni- other nonlinear operators,” Candies Scientific Thesis, Univ. Oslo, Norway,
tion under whisper and neutral speech,” in Proc. ISCA INTERSPEECH, 2003.
Singapore, Sep. 2014, pp. 1154–1158. [33] P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol, “Extracting and
[10] D. Dimitriadis, P. Maragos, and A. Potamianos, “Auditory teager energy composing robust features with denoising autoencoders,” in Proc. 25th
cepstrum coefficients for robust speech recognition,” in Proc. EUSPICO, Int. Conf. Mach. Learn., Helsinki, Finland, Jul. 2008, pp. 1096–1103.
Nice, France, Jul. 2005, pp. 3013–3016. [34] Đ. T. Grozdić, S. T. Jovičić, J. Galić, and B. Marković, “Application of
[11] P. Heracleous, “Using teager energy cepstrum and HMM distances,” Int. inverse filtering in enhancement of whisper recognition,” in Proc. IEEE
J. Inf. Commun. Eng., vol. 5, no. 1, pp. 31–37, 2009. Neural Netw. Appl. Elect. Eng., Belgrade, Serbia, Nov. 2014, pp. 157–162.
[12] G. Zhou, J. Hansen, and J. Kaiser, “Classification of speech under stress [35] G. E. Hinton, “Training products of experts by minimizing contrastive
based on features derived from the nonlinear Teager energy operator,” in divergence,” Neural Comput., vol. 14, pp. 1771–1800, 2002.
Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Seattle, WA, USA, [36] S. Young et al., The HTK Book (for HTK Version 3.2). Cambridge Univ.,
May 1998, pp. 549–552. Eng. Dept., Cambridge, U.K., 2002.
[13] B. Marković, S. T. Jovičić, J. Galić, and Đ. Grozdić, “Whispered [37] C. J. Leggetter and P. C. Woodland, “Flexible speaker adaptation us-
speech database: Design, processing and application,” in Proc. 16th ing maximum likelihood linear regression,” in Proc. ARPA Spoken Lang.
Int. Conf. Text, Speech, Dialogue, Pilsen, Czech Republic, Sep. 2013, Technol. Workshop, Austin, TX, USA, Jan. 1995, pp. 110–115.
pp. 591–598. [38] D. Havelock, S. Kuwano, and M. Vorländer, Handbook of Signal Process-
[14] C. Zhang and J. H. L. Hansen, “Whisper-Island detection based on un- ing in Acoustics. New York, NY, USA: Springer, 2008.
supervised segmentation with entropy-based speech feature processing,”
IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 4, pp. 883–894,
May 2011. Authors’ photographs and biographies not available at the time of publication.

Base Paper

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Base Paper

Uploaded by

Copyright:

Available Formats

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO.

12, DECEMBER 2017 2313

Whispered Speech Recognition Using Deep

B. Acoustic Analysis of Whispered Speech

Fig. 4. Architecture of the proposed ASR system. The system is com-

designed for normally phonated speech. Some of these differ-

Ψ(x(t)) = ẋ(t)2 − x(t)ẍ(t), (2) y = fθ (x) = f1 (W x + b), (6)

case, we use deep DAE to transform cepstral feature vectors TABLE I

Feature Neutral/whisper Whisper/neutral

MFCC 45.19 62.13

B. ASR Performance Evaluation—Mismatched

Feature Neutral/neutral Neutral/whisper

MFCC + Δ + ΔΔ + DDAE 99.83 85.97 ∗∗

E. Application of Denoising Autoencoder

You might also like