Professional Documents
Culture Documents
Speech Communication
journal homepage: www.elsevier.com/locate/specom
a r t i c l e i n f o
Article history:
Accepted 6 October 2016
Available online 14 October 2016
1. Introduction Prince and Elder, 2007; Kenny, 2010; Garcia-Romero and Espy-
Wilson, 2011). These are widely used statistical modelling proce-
The present paper reports on an evaluation of a commercial dures in automatic speaker recognition. The data used to train the
forensic voice comparison system, Batvox version 4.1 by Agnitio, statistical models (the universal background model and the T ma-
under conditions reflecting those of a real forensic voice com- trix for generating the i-vectors, and PLDA) came from a large
parison case (forensic_eval_01). The introduction to the evaluation, database containing a diverse set of speakers and recording con-
rules for the evaluation, description of the training and test data, ditions. These data are not case specific. The product shipped to
and explanation of the performance metrics and graphics appear in customers includes models trained using these data, not the raw
the introduction to the virtual special issue of which this paper is data themselves.
a part (Morrison and Enzinger, 2016). The research presented here The user has the option of entering case specific “reference
is also part of the validation procedure of Batvox that is being con- population” data. These data should represent the relevant pop-
ducted at the Netherlands Forensic Institute (NFI). The purpose of ulation and the conditions of the known-speaker recording from
such evaluations is validation of automatic forensic voice compari- the case. We entered 105 recordings, one known-speaker-condition
son for use in casework. recording from each of the 105 speakers in the training set that
was provided with the data for the experiments. Batvox provides
2. Description and use of Batvox the option of using a subset of these recordings. Two variants of
the system were tested, one using all 105 recordings and one us-
Version 4.1 of Batvox was tested. The following description rep- ing 30 recordings selected by Batvox. Batvox selects the members
resents our understanding of its basic architecture based on publi- of the subset based on their similarity with the known-speaker
cally available information, user manuals, and responses from Ag- recording (the Kullback–Leibler, KL, divergence is used as a similar-
nito with respect to an earlier draft of this description. ity metric; Kullback and Leibler, 1951; Ramos-Castro et al., 2007).
Acoustic information is extracted from recordings in the form Use of the subset option may be more advantageous if the user
of mel frequency cepstral coefficients (MFCCs; Davis and Mermel- enters a relatively diverse set of reference data, as opposed to the
stein, 1980), deltas, and double deltas (Furui, 1986). Cepstral mean carefully selected training data provided for the evaluation. Batvox
subtraction (CMS; Furui, 1981), relative spectral filtering (RASTA; ran the selection process afresh for each known-speaker recording
Hermansky and Morgan, 1994), and feature warping (Pelecanos in the test set.
and Sridharan, 2001) are applied as feature level mismatch com- The user also has the option of entering a set of “imposter
pensation techniques. These types of features and mismatch recordings” which should represent the relevant population and
compensation techniques are widely used in automatic speaker the conditions of the questioned-speaker recording from the case.
recognition. The system calculates scores via i-vectors (Dehak We tested two variants, one without imposter recordings and an-
et al., 2011) and probabilistic linear discriminant analysis (PLDA; other using one questioned-speaker-condition recording from each
of the 105 speakers in the training set. Since the use of imposters
R
This paper is part of the Virtual Special Issue entitled: Multi-laboratory is recommended by Agnitio when dealing with mismatched mate-
eval-uation of forensic voice comparison systems under conditions reflecting those rial as is the case here, we expected that using the imposter set
of a real forensic case (forensic_eval_01), [http://www.sciencedirect.com/science/ would result in better performance from Batvox.
journal/01676393/vsi], Guest Edited by G. S. Morrison and E. Enzinger.
E-mail addresses: david@holmes.nl, d.van.der.vloed@nfi.minvenj.nl
http://dx.doi.org/10.1016/j.specom.2016.10.001
0167-6393/© 2016 Elsevier B.V. All rights reserved.
128 D. van der Vloed / Speech Communication 85 (2016) 127–130
Table 1
Exact values of the accuracy and precision metrics shown in Fig. 2.
all reference data + no impostor data all reference data + impostor data
1
0.9
0.8
Cumulative Proportion
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
selected reference data + no impostor data selected reference data + impostor data
1
0.9
0.8
Cumulative Proportion
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−4 −3 −2 −1 0 1 2 3 −4
4 −3 −2 −1 0 1 2 3 4
log10 Likelihood ratio log10 Likelihood ratio
all reference data + no impostor data all reference data + impostor data
1
0.9
0.8
Cumulative Proportion
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
selected reference data + no impostor data selected reference data + impostor data
1
0.9
0.8
Cumulative Proportion
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−4 −3 −2 −1 0 1 2 3 −4
4 −3 −2 −1 0 1 2 3 4
log10 Likelihood ratio log10 Likelihood ratio
10
3. Results
5
Accuracy and precision metrics, log likelihood ratio costs and
credible intervals (Cllr mean versus 95% CI, and Cllr pooled ), are repre- 2
sented graphically in Fig. 1, and their numeric values are provided
in Table 1. Tippett plots, a Detection Error Tradeoff (DET) plot, 1
and Empirical Cross-Entropy (ECE) plots are provided in Figs. 2
0.5 selected reference data + no impostor data
through 5.
all reference data + no impostor data
0.2 selected reference data + impostor data
4. Discussion and conclusion
0.1 all reference data + impostor data
The performance of Batvox 4.1 was evaluated under conditions
reflecting those of a real forensic case. 0.1 0.2 0.5 1 2 5 10 20 30 40
false alarm rate (in %)
Better performance was achieved when recordings from all 105
speakers in the training set were used as “reference population” Fig. 4. DET plot.
data than when a subset of 30 of these were selected by Batvox.
The recordings in the training data had previously been selected to
be a sample representative of the relevant population for the case.
When this is done it is probably not appropriate to have Batvox have been designed with a more diverse set of input data in mind.
further select and reduce the size of the sample – the option may A bias towards lower valued log likelihood ratios is apparent in all
the Tippett plots, but this is much larger when the subset of 30
recordings was used.
1
In previous versions of Batvox, to compensate for the fact that there is no Better performance was also achieved when an “imposter” set
between-session or recording-condition mismatch between the parts of the record- of recordings from all 105 speakers in the training set was used
ing used to calculate the same-speaker scores, a within source degradation proce-
dure was applied (González-Rodríguez et al, 2006). This is not used in version 4.1.
than when no “imposter” set was used.
It is implied that for the new version this problem is addressed by the Bayesian The best performance was achieved by the system variant
approach employed. which included recordings from all speakers in the “reference pop-
130 D. van der Vloed / Speech Communication 85 (2016) 127–130
all reference data + no impostor data all reference data + impostor data
LR values LR values
0.6 After PAV 0.6 After PAV
LR=1 always LR=1 always
0.5 0.5
Empirical cross−entropy
Empirical cross−entropy
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
log 10 prior odds log 10 prior odds
selected reference data + no impostor data selected reference data + impostor data
LR values LR values
0.6 After PAV 0.6 After PAV
LR=1 always LR=1 always
0.5 0.5
Empirical cross−entropy
0.3 0.3
0.2 0.2
0.1 0.1
0 0
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
log 10 prior odds log 10 prior odds
ulation” data and recordings from all speakers in the “imposter” Furui, S., 1981. Cepstral analysis technique for automatic speaker verification. IEEE
set. Although one should be cautious about generalising from the trans. Acoust. Speech Signal Proc 29, 254–272. http://dx.doi.org/10.1109/TASSP.
1981.1163530.
performance observed under the conditions of one case to the con- Furui, S., 1986. Speaker-independent isolated word recognition using dynamic fea-
ditions of other cases, the results suggest that if one has a sam- tures of speech spectrum. IEEE Trans. Acoust. Speech Sig. Proc. 34, 52–59.
ple of recordings of speakers representative of the relevant popula- http://dx.doi.org/10.1109/TASSP.1986.1164788.
Garcia-Romero, D., Espy-Wilson, C., 2011. Analysis of i-vector length normaliza-
tion for the case and reflecting the known-speaker and questioned- tion in speaker recognition systems. In: Proc. Interspeech. Florence, Italy,
speaker speaking styles and recording conditions for the case, then pp. 249–252.
best practice for using Batvox 4.1 would be to use known-speaker- González-Rodríguez, J., Drygajlo, A., Ramos-Castro, D., Garcia-Gomar, M., Ortega-
García, J., 2006. Robust estimation, interpretation and assessment of likeli-
condition recordings from all speakers in the sample as “reference
hood ratios in forensic speaker recognition. Comput. Speech Lang. 20, 331–355
population” data and to use questioned-speaker-condition record- http://dx.doi.org/10.1016/j.csl.20 05.08.0 05.
ings from all speakers in the sample as “imposter” data. Hermansky, H., Morgan, N., 1994. RASTA processing of speech. IEEE Trans. Speech
Audio Proc. 2, 578–589. http://dx.doi.org/10.1109/89.326616.
Kenny, P., 2010. Bayesian speaker verification with heavy tailed priors. In: Proc.
References
Odyssey 2010 – the Speaker and Language Recognition Workshop. Brno, Czech
Republic.
Auckenthaler, R., Carey, M., Lloyd-Thomas, H., 20 0 0. Score normalization for text- Kullback, S., Leibler, R.A., 1951. On information and sufficiency. Ann. Math. Stat. 22
independent speaker verification systems. Digital Signal Process 10, 42–54. http: (1), 79–86. http://dx.doi.org/10.1214/aoms/1177729694.
//dx.doi.org/10.1006/dspr.1999.0360. Morrison, G.S., Enzinger, E., 2016. Multi-laboratory evaluation of forensic voice com-
Brümmer, N., Swart, A., 2014. Bayesian calibration for forensic evidence reporting. parison systems under conditions reflecting those of a real forensic case (foren-
In: Proc. Fifteenth Annual Conference of the International Speech Communica- sic_eval_01) – Introduction. Speech Communication. http://dx.doi.org/10.1016/j.
tion Association (Interspeech). Singapore, pp. 388–392. specom.2016.07.006.
Davis, S., Mermelstein, P., 1980. Comparison of parametric representations for Pelecanos, J., Sridharan, S., 2001. Feature warping for robust speaker verification.
monosyllabic word recognition in continuously spoken sentences. IEEE Trans. In: Proceedings of Odyssey 2001: The Speaker Recognition Workshop. Crete,
Acoust. Speech Signal Proc. 28, 357–366. http://dx.doi.org/10.1109/TASSP.1980. Greece, pp. 213–218.
1163420. Prince, S.J.D., Elder, J.H., 2007. Probabilistic linear discriminant analysis for infer-
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P., 2011. Front-end factor ences about identity. In: IEEE 11th International Conference on Computer Vision
analysis for speaker verification. IEEE Trans. Audio Speech Lang. Proc. 19, 788– (ICCV), pp. 1–8. http://dx.doi.org/10.1109/ICCV.2007.4409052.
798. http://dx.doi.org/10.1109/TASL.2010.2064307. Ramos-Castro, D., Fierrez-Aguilar, J., González-Rodríguez, J., Ortega-García, J., 2007.
Speaker verification using speaker- and test-dependent fast score normalization.
Pattern Recognit. Lett. 28, 90–98. http://dx.doi.org/10.1016/j.patrec.2006.06.008.