You are on page 1of 6

2022 National Conference on Communications (NCC)

WHISPER TO NEUTRAL MAPPING USING I-VECTOR SPACE LIKELIHOOD AND A


COSINE SIMILARITY BASED ITERATIVE OPTIMIZATION FOR WHISPERED SPEAKER
VERIFICATION

Abinay Reddy Naini1 , Achuth Rao MV2 , and Prasanta Kumar Ghosh2
1
Electrical and Computer Engineering, The University of Texas at Dallas, Richardson, TX, USA
2
Electrical Engineering, Indian Institute of Science, Bangalore, India

ABSTRACT cal speaker verification (SV) system, where a speaker is en-


rolled using neutral speech, perform poorly when tested using
In this work, we propose an iterative optimization algorithm whispered speech [12, 13, 14].
to learn a feature mapping (FM) from the whispered to neu-
2022 National Conference on Communications (NCC) | 978-1-6654-5136-9/22/$31.00 ©2022 IEEE | DOI: 10.1109/NCC55593.2022.9806732

A typical speaker verification system verifies whether a


tral speech features. Such an FM can be used to improve the
given test speech sample belongs to an enrolled speaker or
performance of speaker verification (SV) systems when pre-
not [15, 16]. Extensive research has been done on neutral SV,
sented with a whispered speech. In one of previous works,
where only neutral speech is used for enrolling the speaker as
the equal error rate (EER) in an SV task has been shown
well as testing. In the context of neutral SV, front-end factor
to improve by ∼24% based on an FM network trained us-
analysis [17], and deep neural network (DNN) embeddings
ing a cosine similarity based loss function over that using a
[18, 19, 20] based methods are considered to be the state-of-
mean squared error based objective function. As the mapped
the-art. Although DNN embeddings based methods deliver
whispered features obtained in this manner may not lie in the
promising results for neutral speech, extending these meth-
trained i-vector space, we, in this work, iteratively optimize
ods to whispered SV is difficult where a speaker is enrolled
the i-vector space likelihood (by updating T-matrix) and a co-
using neutral and/or whispered speech and tested using only
sine similarity based loss function for learning the parameters
whispered speech. This is mainly due to the absence of large
of the FM network. The proposed iterative optimization im-
whispered speech corpora. Lack of large parallel whispered
proves the EER by ∼26% compared to when the FM network
and neutral speech corpora also makes it challenging to obtain
parameters are learned based on only cosine similarity based
a reliable feature mapping (FM) from whispered to neutral
loss function without any T-matrix update, which is a special
speech features.
case of the proposed iterative optimization.
Despite limited research on whispered SV, several meth-
Index Terms— Speaker verification, whispered speech,
ods have been proposed in the literature to reduce the gap be-
feature mapping, cosine similarity.
tween whispered and neutral SV performance. These meth-
ods can be broadly classified into two categories: 1) use of ro-
1. INTRODUCTION bust features to improve the whispered SV performance [21,
22, 23, 24, 13], 2) developing various FM schemes [11, 25,
Just like neutral speech, whispered speech is one of the nat- 12], where the mapped whispered features during test phase
ural ways of communication, which is primarily character- are used to improve the whispered SV performance. A de-
ized by the absence of glottal vibration [1, 2]. Whispered tailed survey of different whispered SV methods can be found
speech can occur in private conversations as well as patho- in [26]. Among these, Sarria et al. [25] explored a DNN based
logical conditions such as laryngectomy [3, 4, 5, 6]. As the FM between whispered to neutral features, which is trained
recent speech-enabled virtual assistant devices are becoming by minimizing the mean squared error (MSE) between dy-
robust and natural in neutral speech setup, the demand for namic time-warped (DTW) neutral and whispered features.
making these devices robust to whispered speech is increas- In our previous work [12], it is shown that MSE based ob-
ing [7, 8]. Whispered speech spectrum is significantly dif- jective function in FM is not appropriate for the task of SV.
ferent from the corresponding neutral speech spectrum. The Hence, we proposed a cosine similarity based objective func-
key features in the whispered speech include lack of pitch [9], tion in the i-vector space, which resulted in a better FM from
hyper-articulation to preserve intelligibility [10], flatter spec- whispered to neutral speech features. Such an FM network
tral slope [11, 3], and a significant shift in lower-order for- training involves a use of the total variability space, i.e., the
mants [2] compared to those in the neutral speech. These dif- T-matrix, in its objective function, and the T-matrix is trained
ferences between neutral and whispered speech make a typi- using neutral and whispered speech. As we progress through

978-1-6654-5136-9/22/$31.00 ©2022 IEEE 130


Authorized licensed use limited to: University of Tartu Estonia. Downloaded on January 31,2023 at 08:22:43 UTC from IEEE Xplore. Restrictions apply.
2022 National Conference on Communications (NCC)

Training Enrollment
Neutral Iterative Training i-vector enrollment
Speech Neutral
enrollment
Feature
Data( )
extraction LDA
Feature Data( ) Feature training
extraction GMM-UBM FM to
whisper T-matrix
Speech
training Training whisper
i-vector
Data( ) Feature
enrollment FM mean
Data( ) extraction
FM Training
objective
yw Feature ym wm function
Testing
Whispered Feature to
Mapping Feature i-vector testing
features i-vector Neutral decision
(FM) Cosine extraction
Test sentence
Neutral similarity Feature LDA/
Feature to to Cosine
features yn i-vector wn FM
(1-Φ) Whisper Feature i-vector distance
T-matrix Test sentence extraction

Fig. 1. Block diagram of the whispered speaker verification system with the proposed feature mapping (FM). Feature
extraction step is shown in red, Iterative training step is shown in blue , FM is shown in green and the objective
function is in magenta.

the FM training, which results in mapped whispered features, 2.2. Iterative Training
the super vectors computed using these features may not lie in
the trained total variability space, affecting the training pro- In the proposed iterative training process, we find out the
cess to achieve reliable FM. Further, even if the T-matrix is optimum FM and the total variability space, i.e., T-matrix.
retrained, including the mapped whispered features, it still af- This training process involves three main steps: (i) i-vector
fects the performance as the retrained T-matrix is not used in training, where we extract an initial estimate of the i-vector
the objective function during training the FM network. space (T-matrix column space) using the training data [17],
To overcome this limitation, In this paper, we propose (ii) FM training, where we obtain FM from whispered to neu-
an iterative optimization scheme using the i-vector space (to- tral features using cosine similarity based objective function
tal variability space) likelihood and a cosine similarity based [12], (iii) An iterative optimization, where we iteratively opti-
objective function to train an FM from whispered to neutral mize the i-vector space likelihood and FM until convergence.
speech features [12]. We use this mapping with the front These three steps are explained in detail below:
end i-vector setup to perform whispered SV. We have ex- (i) i-vector training: In the i-vector training step [17],
perimented with 1882 speakers comprising 186708 neutral, a Gaussian mixture model (GMM) based Universal Back-
and 26892 whispered recordings for the whispered SV task. ground Model (UBM) is trained using feature vectors of di-
These experiments reveal that the equal error rate (EER) using mension F , obtained from training speakers’ speech. The
the proposed method is lower than that using the best baseline mean vectors of the speaker-dependent GMM are concate-
by ∼26% (relative). nated to form a super vector (ms ) of dimension CF , where
C is the number of mixtures in the UBM. Then a speaker-
dependent super vector is modeled using ms = m + T w,
2. PROPOSED WHISPERED SV SYSTEM where m is a speaker and channel-independent super vector,
T is a tall and low-rank matrix of dimension CF × d and
The block diagram in Fig. 1 summarizes the steps of the pro- w is referred to as the i-vector (with dimension d), which is
posed whispered SV system. Detailed explanations about all sampled from the standard normal distribution. This i-vector
the sub blocks are provided below. training step is used as an initialization for the iterative train-
ing.
(ii) FM Training: In the FM training stage, an FM is
2.1. Feature extraction learnt from whispered features to neutral features. In the
test time, such a transform is used to map whispered fea-
Given a speech signal, which is pre-emphasized with a fil- tures to improve the whispered SV performance. We explored
ter coefficient of α, a 13-dimensional MFCC feature vector an affine transform (y = Ax + b) and a DNN based trans-
is computed using a window length Tw with a shift of Ts . form for FM. We also explored the Long-short term memory
To add temporal dynamics to the feature vector, velocity and (LSTM) network to include the temporal context in learning
acceleration coefficients are concatenated, resulting in a 39- the FM. We used a cosine similarity based objective func-
dimensional feature vector [23], which are used in all experi- tion [12] to train the FM. The goal of the cosine similar-
ments in this work. ity based objective function is to train an FM from whis-

131
Authorized licensed use limited to: University of Tartu Estonia. Downloaded on January 31,2023 at 08:22:43 UTC from IEEE Xplore. Restrictions apply.
2022 National Conference on Communications (NCC)

pered features to neutral features such that the i-vector com- Algorithm 1 Proposed Iterative Training Algorithm
puted from the mapped features is close to the i-vector com- Input: training MFCC features ([yn , yw ])
puted using the neutral features in terms of the cosine simi-
larity. As shown in Fig 1, a mapping fΘ (with parameter set
Description:
Θ = {θi , 1 ≤ i ≤ N }) is learned to map whispered features
yw to ym . An i-vector is computed using the mapped fea- 1: Obtain UBM, T0 matrix, and Σ0 using [yn , yw ].
tures (ym ), which is denoted as wm . Similarly, an i-vector is 2: Θ0 = ψopt (T0 , Σ0 , [yn , yw ]);
computed using the corresponding neutral features, denoted 3: ym = F MΘ0 (yw );
as wn . The cosine similarity between wn and wm is indi- 4: for i = 1 : k
cated by φ(wn , wm ) = 1 − ( < wn , wm >/||wn || ||wm ||), where 5: Obtain expectations (E[w], E[wwt ]) using
< a, b > indicates the inner product between the vectors a
[yn , ym ];
and b, and ||a|| indicates the norm of a vector a. The deriva-
tives of the loss function (1 − φ(wn , wm )) with respect to FM
6: [T, Σ] = arg maxT̂ ,Σ̂ E[log(P ([yn , ym ]|w))];
parameters are computed as follows: 7: Θi = ψopt (Ti , Σi , [yn , yw ]);
8: ym = F MΘi (yw );
 T 9: return Θk , Tk
dφ(wn , wm ) dφ(wn , wm ) dwm dym
= , 1≤i≤N
dθi dwm dym dθi
(1)
(iii) Proposed iterative optimization: Given a sequence of Table 1. Number of male/female speakers and record-
whispered MFCC features yw = {y1w , y2w , . . . , yLw
} and neu- ings per speaker for all five databases considered in
n n n this work. ∗ indicates that the number can be different
tral MFCC features yn = {y1 , y2 , . . . , yM }, the total likeli-
hood of the model (ms = m + T w) is P ([yn , yw ]|w). The depending on the experimental condition. tr indicates
training process uses the Expectation-maximisation algorithm training.
(EM) [27] to obtain the optimum T , Σ (diagonal covariance Data split
Num. of Speakers/database Total Recordings
VoxCeleb1 wTIMIT TIMIT CHAINS wSPIRE Ne Wh
matrix of noise in ms modeling), and w. In the Expectation Num. of Speakers 1251 48 630 36 9 178k 15.4k
step, for a given estimate of the T and Σ, expectations E[w] # of FEMALE 563 24 192 16 3 - -
# of MALE 688 24 438 20 6 - -
(i-vector), E[wwt ] are computed. UBM tr 1251 0 462 0 0 157k 0
In the maximization step, the T-matrix and the Σ are up- T-matrix tr 0 24 462 0 9 19.4k 14.8k
FM tr 0 14 0 0 9 14.8k 14.8k
dated using [17, 28] Enrollment/
0 24 100 36 0 1280 480*
LDA tr
Testing 0 24 100* 36 0 320* 120*

[T, Σ] = arg max E[log(P ([yn , yw ]|w))] (2)


T̂ ,Σ̂
similarity between the resulted vectors with thresholding to
In the FM training step as mentioned above, for a given make an SV decision.
[yn , yw ], T-matrix, and Σ, the FM parameters (Θ) are up-
dated using eq.1. This step is indicated as ψopt (T, Σ, [yn , yw ]).
After the FM training, given whispered MFCCs (yw ) are
used in FM to obtain the mapped features (ym ). Finally, we
3. EXPERIMENTS AND RESULTS
repeated the Expectation step, Maximization step (eq.2), and
FM training step (ψopt (T, Σ, [yn , yw ])) iteratively until con-
vergence. The algorithm for the training is described below. 3.1. Database
The vi indicates the value of parameter v in i-th iteration.
In our experiments, five different datasets were used: (1)
2.3. i-vector enrollment/testing wTIMIT corpus [30], (2) CHAINS corpus [31], (3) Vox-
Celb1 [32], (4) TIMIT corpus [33], (5) wSPIRE. Among
In the enrollment stage, we obtained a reference i-vector for these, wSPIRE is an in-house recorded data containing nine
each enrolled speaker by taking the mean of all the i-vectors speakers. Each speaker speaks 460 MOCHA-TIMIT sen-
from an enrolled speaker. We also obtained an optimal sub- tences [34] in both neutral and whispered modes, and it is
space, where the enrolled speakers are linearly discriminated recorded at 16kHz. Details of this dataset can be found in
using linear discriminant analysis (LDA) [29]. During the [35]. Total number of male/female speakers considered from
testing, we first computed i-vector from the test speech utter- all the datasets is provided in Table 1. We resample record-
ance, then reduced the dimension of both reference and test ings from all corpora to a common sampling frequency of
i-vectors using the learned LDA. Finally, we used the cosine 16kHz.

132
Authorized licensed use limited to: University of Tartu Estonia. Downloaded on January 31,2023 at 08:22:43 UTC from IEEE Xplore. Restrictions apply.
2022 National Conference on Communications (NCC)

Table 2. Performance comparison of the proposed and the baseline methods for different experimental conditions
with Nwe = 0. EERne and EERwh indcate the EER for only neutral and only whispered test utterances respectively.
Method WFM F MM SE F Mcs 0 F Mcs k
mapping type - DNN Af LSTM DNN Af LSTM DNN Af LSTM
T-matrix data Ne Ne+Wh Ne+Wh Ne+Wh Ne+Wh Ne+Wh Ne+Wh Ne+Wh Ne+mW Ne+mW Ne+mW
EERne 4.12 5.31 6.64 6.64 6.64 5.31 5.31 5.31 5.73 5.52 5.48
EERwh 23.54 22.7 20.08 20.2 19.42 15.06 15.12 14.96 12.28 12.64 11.72

3.2. Experimental setup ing is the same, in the case of F Mcs 0 and F Mcs k . However,
for a given iteration in F Mcs k , along with the neutral speech
In the experimental stage, we divided recordings from all five (Ne), we have used mapped whispered speech (mW) features,
datasets into training and testing speakers, as shown in Table which are obtained using the FM, which is learned using the
1. In the enrollment/LDA training and testing phase, only ten T-matrix from the previous iteration.
utterances from the neutral and whispered speech are consid- MFCC features are extracted using α = 0.97, Tw =
ered. Among these ten utterances, eight are used for enroll- 25ms and Ts = 10ms. In i-vector extraction we used
ment/LDA training, and the remaining two neutral/whisper C = 512, F = 39 and d = 400. In the DNN based FM,
pairs from wTIMIT and CHAINS are used for testing. How- we used two layers of DNN with 39 units with tanh and linear
ever, in the i-vector training phase, all the available utterances activations respectively, along with a 0.1 dropout factor [36].
are considered as mentioned in Table 1. To evaluate the pro- For LSTM based FM, we used a 39 unit LSTM layer followed
posed SV method, we considered two experiments. In the first by a 39 unit DNN layer with linear activation.
experiments, each speaker is enrolled using only eight neutral We have used equal error rate (EER) as an evaluation
utterances (Nne = 8), without including any whispered utter- metric for SV, which is the error rate of the SV system when
ances (Nwe = 0). In the second experiment, along with the the false acceptance rate of the imposter and the false rejec-
eight neutral utterances, we varied the whispered utterances tion rate of the enrolled speakers are equal [37]. We imple-
(Nwe = 1, 2, 4, 6, 8) for the enrollment/LDA training. mented the feature mapping in tensorflow [38] and keras [39].
We considered two baseline schemes for the experiments. We have used 20% of the FM training data as the validation
In the first baseline (W F M ), no FM is performed, and for set. We optimized the objective function using the gradients
both whispered and neutral speech, i-vectors are computed shown in eq.1 and adam optimizer [40], until the validation
directly using the MFCCs of the test utterance. In the sec- error increases.
ond baseline (F MM SE ), we mapped whispered MFCCs us-
ing DNN based FM, which is trained by optimizing the MSE
between the DTW aligned whispered and neutral MFCC fea- 3.3. Results & discussion
tures [25]. We compared results from these baseline schemes
with the proposed approach, referred to as F Mcs k , where k Table 2 shows the comparison of EER for the proposed
is the number of iterations performed. We have also consid- F Mcs k and the F Mcs 0 [12] methods along with the two
ered the special case of the proposed approach k=0 (F Mcs 0 ) baseline schemes (W F M , F MM SE [25]) in different exper-
as the baseline, which is identical to the scheme proposed by imental conditions for the whispered SV task. We observe
Abinay et al. [12]. This corresponds to the first step of the from Table 2 that, in all experimental conditions, EERne is
proposed iterative optimization. significantly lower than the EERwh . This shows that the i-
To understand the effect of the type of data used for T- vectors computed using the whispered MFCCs of a speaker
matrix training, we considered two types of data. The first deviate largely from that of the i-vector computed using the
type (Ne) uses features computed using only neutral data, and corresponding speaker’s neutral MFCCs in terms of cosine
the second type (Ne+Wh) uses features computed using both similarity. Among the proposed F Mcs k mappings, LSTM
neutral and whisper data. The data used for the T-matrix train- based FM performed better than the both the DNN and Affine
transform (Af) based FM. This could be due to the LSTM’s
ability to capture the temporal context while learning the map-
ping parameters. We can also observe that the LSTM based
Table 3. Comparison of EER for different values of Nwe FM method using the cosine similarity based objective func-
between the proposed and WFM methods. tion (F Mcs 0 ) showed an improvement of ∼24%(relative) in
Nwe 0 1 2 4 6 8 EERwh , compared to the best baseline condition (F MM SE
WFM 22.7 11.31 9.71 7.13 6.46 6.44 with LSTM based FM). LSTM based FM using iterative
F Mcs 0 (LSTM) 15.06 11.43 10.68 7.54 7.18 7.12 training (F Mcs k ) showed a further improvement over the
F Mcs k (LSTM) 11.42 10.64 8.93 7.34 6.85 6.82 F Mcs 0 by ∼21%(relative) in EERwh .

133
Authorized licensed use limited to: University of Tartu Estonia. Downloaded on January 31,2023 at 08:22:43 UTC from IEEE Xplore. Restrictions apply.
2022 National Conference on Communications (NCC)

(a) (b) (b)


4. CONCLUSION
1 0.6

0.5
10
0.4
In this work, we showed that an iterative optimization of i-

Magnitude
20 vector space likelihood and the cosine similarity based objec-
0
30
0.2
tive function for an FM results in a more reliable mapping
-0.5 0
from whispered features to neutral features for whispered SV
10 20 30 10 20 30 0 2000 4000
Frequency (Hz) task. We experimented with different FM models, such as an
affine transform, DNN, and LSTM. Among these, an LSTM
model, which is trained using proposed iterative optimiza-
Fig. 2. (a,b): Learned matrix A for (F Mcs 0 , F Mcs k )
tion, showed the state-of-the-art results for whispered SV in
respectively, (c): Reconstructed spectrum from the static terms of EER. The proposed FM methods do not require any
MFCC bias vector (F Mcs k , F Mcs 0 ). DTW between whispered and neutral features, which avoids
the possible error in the frame alignment. As these FM meth-
ods can be extended to end-to-end SV models, given the avail-
ability of large whispered speech corpus, we want to explore
Table 2 shows, in the W F M case, that using whispered
these as a part of our future works.
features along with neutral features in T-matrix training can
improve the whispered SV performance, with a small drop
in the neutral SV performance. However, the drop in neutral 5. REFERENCES
SV performance is smaller for the proposed FM methods than [1] C. Zhang and J. H. Hansen, “Analysis and classification of
that of the baseline FM methods (F MM SE ). speech mode: whispered through shouted,” in Eighth Annual
Table 3 shows the EER comparison of the whispered SV, Conference of the International Speech Communication Asso-
ciation, 2007, pp. 2289–2292.
using the proposed FM (F Mcs k ) with the best FM types
among the baseline schemes (W F M , F Mcs 0 with LSTM) [2] S. T. Jovičić, “Formant feature differences between whispered
for different values Nwe . As the W F M with Nwe > 0 uses and voiced sustained vowels,” Acta Acustica united with Acus-
whispered MFCCs without any FM in the enrollment, it acts tica, vol. 84, no. 4, pp. 739–743, 1998.
as a lower bound in terms of EER, for high values of Nwe . [3] X. Fan and J. H. Hansen, “Speaker identification within whis-
Among the F Mcs k , F Mcs 0 FM types, F Mcs k performed pered speech audio streams,” IEEE Transactions on audio,
better at all values of Nwe , and the gap in the EER with the speech, and language processing, vol. 19, no. 5, pp. 1408–
W F M is lower for F Mcs k compared to F Mcs 0 at high 1421, 2011.
values of Nwe . This shows that F Mcs k preserved the speech [4] G. N. Meenakshi and P. K. Ghosh, “Robust whisper activity
spectral characteristics better than F Mcs 0 in the mapping. detection using long-term log energy variation of sub-band
In the case of W F M , it requires at least one whispered utter- signal,” IEEE Signal Processing Letters, vol. 22, no. 11, pp.
ance in the enrollment to perform comparably with the best 1859–1863, 2015.
performing proposed FM type without any whispered data in [5] S. Adler, “Speech after laryngectomy,” The American Journal
the enrollment. We have also observed the performance of of Nursing, vol. 69, no. 10, pp. 2138–2141, 1969. [Online].
F Mcs k by adding whispered MFCCs along with the mapped Available: http://www.jstor.org/stable/3454024
whispered and neutral MFCCs to the T-matrix training in each [6] A. N. Reddy, A. M. Rao, G. N. Meenakshi, and P. K.
iteration. This further improved the performance of F Mcs k Ghosh, “Reconstructing neutral speech from tracheoe-
by ∼8%(relative) from the best performing case (F Mcs k sophageal speech,” Proc. Interspeech 2018, pp. 1541–1545,
(LSTM)). 2018.
[7] M. Cotescu, T. Drugman, G. Huybrechts, J. Lorenzo-Trueba,
Fig. 2(a) and 2(b) show the trained matrices (A) of the Af
and A. Moinet, “Voice conversion for whispered speech syn-
based FM from the F Mcs 0 , F Mcs k respectively. It is clear
thesis,” IEEE Signal Processing Letters, 2019.
from the figures that the learned matrices are close to a di-
agonal matrix. However, the mapping learned using F Mcs k [8] A. R. Naini, M. Satyapriya, and P. K. Ghosh, “Whisper activ-
ity detection using cnn-lstm based attention pooling network
captures information from the first and second derivatives of
trained for a speaker identification task,” Proc. Interspeech
the features, unlike in the case of F Mcs 0 . The diagonal en-
2020, pp. 2922–2926, 2020.
tries show that 11th to 13th static MFCCs are suppressed in
Af based FM in both F Mcs k and the F Mcs 0 , which is also [9] V. C. Tartter, “What’s in a whisper?” The Journal of the Acous-
true for corresponding derivative coefficients. Fig. 2(c) shows tical Society of America, vol. 86, no. 5, pp. 1678–1683, 1989.
the spectrum reconstructed from the bias vector using only [10] M. J. Osfar, “Articulation of whispered alveolar consonants,”
the first 13 coefficients. This is similar to the spectrum of the Ph.D. dissertation, Urbana, Illinois, 2011.
glottal flow. We observed that the bias vector did not change [11] X. Fan and J. H. Hansen, “Speaker identification for whispered
much with the F Mcs k compared to the F Mcs 0 method. speech based on frequency warping and score competition,” in

134
Authorized licensed use limited to: University of Tartu Estonia. Downloaded on January 31,2023 at 08:22:43 UTC from IEEE Xplore. Restrictions apply.
2022 National Conference on Communications (NCC)

Ninth Annual Conference of the International Speech Commu- improved normal and whispered speech speaker verification,”
nication Association, 2008, pp. 1313–1316. in IEEE International Conference on Acoustics, Speech and
[12] A. R. Naini, M. Achuth Rao, and P. K. Ghosh, “Whisper to Signal Processing (ICASSP), March 2016, pp. 5480–5484.
neutral mapping using cosine similarity maximization in i- [26] V. Vestman, D. Gowda, M. Sahidullah, P. Alku, and T. Kin-
vector space for speaker verification,” Proc. Interspeech 2019, nunen, “Speaker recognition from whispered speech: A tuto-
pp. 4340–4344, 2019. rial survey and an application of time-varying linear predic-
[13] A. R. Naini, A. Rao MV, and P. K. Ghosh, “Formant-gaps tion,” Speech Communication, vol. 99, pp. 62–79, 2018.
features for speaker verification using whispered speech,” in [27] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum
IEEE International Conference on Acoustics, Speech and Sig- likelihood from incomplete data via the EM algorithm,” Jour-
nal Processing (ICASSP), May 2019, pp. 6231–6235. nal of the Royal Statistical Society: Series B (Methodological),
[14] R. K. Das and H. Li, “On the importance of vocal tract con- vol. 39, no. 1, pp. 1–22, 1977.
striction for speaker characterization: The whispered speech [28] Y. Zhang, “Useful derivations for i-vector based approach to
study,” in ICASSP 2020 - 2020 IEEE International Conference data clustering in speech recognition,” 2011.
on Acoustics, Speech and Signal Processing (ICASSP), 2020, [29] S. Balakrishnama and A. Ganapathiraju, “Linear discriminant
pp. 7119–7123. analysis-a brief tutorial,” Institute for Signal and information
[15] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker ver- Processing, vol. 18, pp. 1–8, 1998.
ification using adapted Gaussian mixture models,” Digital sig- [30] B. P. Lim, “Computational differences between whispered and
nal processing, vol. 10, no. 1-3, pp. 19–41, 2000. non-whispered speech,” Ph.D. dissertation, University of Illi-
[16] B. S. Atal, “Effectiveness of linear prediction characteristics nois at Urbana-Champaign, 2011.
of the speech wave for automatic speaker identification and [31] F. Cummins, M. Grimaldi, T. Leonard, and J. Simko, “The
verification,” the Journal of the Acoustical Society of America, chains corpus: Characterizing individual speakers,” in Proc of
vol. 55, no. 6, pp. 1304–1312, 1974. SPECOM, vol. 6, 2006, pp. 431–435.
[17] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouel- [32] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb:
let, “Front-end factor analysis for speaker verification,” IEEE a large-scale speaker identification dataset,” arXiv preprint
Transactions on Audio, Speech, and Language Processing, arXiv:1706.08612, 2017.
vol. 19, no. 4, pp. 788–798, 2011.
[33] J. S. Garofolo, “TIMIT acoustic phonetic continuous speech
[18] E. Variani, X. Lei, E. McDermott, I. Lopez-Moreno, and corpus,” Linguistic Data Consortium, 1993, 1993.
J. Gonzalez-Dominguez, “Deep neural networks for small
footprint text-dependent speaker verification.” in ICASSP, [34] A. A. Wrench, “A multichannel articulatory database and its
vol. 14, 2014, pp. 4052–4056. application for automatic speech recognition,” in Proceedings
5 th Seminar of Speech Production, 2000, pp. 305–308.
[19] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu-
danpur, “X-vectors: Robust DNN embeddings for speaker [35] B. Singhal, A. R. Naini, and P. K. Ghosh, “wspire: A parallel
recognition,” in 2018 IEEE International Conference on multi-device corpus in neutral and whispered speech,” in 24th
Acoustics, Speech and Signal Processing (ICASSP), April Conference of the Oriental COCOSDA International Com-
2018, pp. 5329–5333. mittee for the Co-ordination and Standardisation of Speech
Databases and Assessment Techniques (O-COCOSDA), 2021,
[20] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep pp. 146–151.
speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
[36] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
[21] X. Fan and J. H. Hansen, “Speaker identification for whis- R. Salakhutdinov, “Dropout: a simple way to prevent neural
pered speech using modified temporal patterns and MFCC’s,” networks from overfitting,” The journal of machine learning
in Tenth Annual Conference of the International Speech Com- research, vol. 15, no. 1, pp. 1929–1958, 2014.
munication Association, 2009, pp. 896–899.
[37] J. P. Campbell, “Speaker recognition: A tutorial,” Proceedings
[22] M. O. Sarria-Paja and T. H. Falk, “Strategies to enhance whis- of the IEEE, vol. 85, no. 9, pp. 1437–1462, 1997.
pered speech speaker verification: A comparative analysis,”
Canadian Acoustics, vol. 43, no. 4, pp. 31–45, 2015. [38] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,
M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Ten-
[23] M. Sarria-Paja and T. H. Falk, “Fusion of auditory inspired sorflow: A system for large-scale machine learning,” in 12th
amplitude modulation spectrum and cepstral features for whis- {USENIX} Symposium on Operating Systems Design and Im-
pered and normal speech speaker verification,” Computer plementation ({OSDI} 16), 2016, pp. 265–283.
Speech & Language, vol. 45, pp. 437–456, 2017.
[39] F. Chollet et al., “Keras,” 2015.
[24] ——, “Fusion of bottleneck, spectral and modulation spectral
features for improved speaker verification of neutral and whis- [40] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
pered speech,” Speech Communication, vol. 102, pp. 78–86, mization,” arXiv preprint arXiv:1412.6980, 2014.
2018.
[25] M. Sarria-Paja, M. Senoussaoui, D. O’Shaughnessy, and T. H.
Falk, “Feature mapping, score-, and feature-level fusion for

135
Authorized licensed use limited to: University of Tartu Estonia. Downloaded on January 31,2023 at 08:22:43 UTC from IEEE Xplore. Restrictions apply.

You might also like