You are on page 1of 4

Speech Communication 85 (2016) 127–130

Contents lists available at ScienceDirect

Speech Communication
journal homepage: www.elsevier.com/locate/specom

Evaluation of Batvox 4.1 under conditions reflecting those of a real


forensic voice comparison case (forensic_eval_01)R
David van der Vloed
Netherlands Forensic Institute, Laan van Ypenburg 6, 2497 GB Den Haag, Netherlands

a r t i c l e i n f o

Article history:
Accepted 6 October 2016
Available online 14 October 2016

1. Introduction Prince and Elder, 2007; Kenny, 2010; Garcia-Romero and Espy-
Wilson, 2011). These are widely used statistical modelling proce-
The present paper reports on an evaluation of a commercial dures in automatic speaker recognition. The data used to train the
forensic voice comparison system, Batvox version 4.1 by Agnitio, statistical models (the universal background model and the T ma-
under conditions reflecting those of a real forensic voice com- trix for generating the i-vectors, and PLDA) came from a large
parison case (forensic_eval_01). The introduction to the evaluation, database containing a diverse set of speakers and recording con-
rules for the evaluation, description of the training and test data, ditions. These data are not case specific. The product shipped to
and explanation of the performance metrics and graphics appear in customers includes models trained using these data, not the raw
the introduction to the virtual special issue of which this paper is data themselves.
a part (Morrison and Enzinger, 2016). The research presented here The user has the option of entering case specific “reference
is also part of the validation procedure of Batvox that is being con- population” data. These data should represent the relevant pop-
ducted at the Netherlands Forensic Institute (NFI). The purpose of ulation and the conditions of the known-speaker recording from
such evaluations is validation of automatic forensic voice compari- the case. We entered 105 recordings, one known-speaker-condition
son for use in casework. recording from each of the 105 speakers in the training set that
was provided with the data for the experiments. Batvox provides
2. Description and use of Batvox the option of using a subset of these recordings. Two variants of
the system were tested, one using all 105 recordings and one us-
Version 4.1 of Batvox was tested. The following description rep- ing 30 recordings selected by Batvox. Batvox selects the members
resents our understanding of its basic architecture based on publi- of the subset based on their similarity with the known-speaker
cally available information, user manuals, and responses from Ag- recording (the Kullback–Leibler, KL, divergence is used as a similar-
nito with respect to an earlier draft of this description. ity metric; Kullback and Leibler, 1951; Ramos-Castro et al., 2007).
Acoustic information is extracted from recordings in the form Use of the subset option may be more advantageous if the user
of mel frequency cepstral coefficients (MFCCs; Davis and Mermel- enters a relatively diverse set of reference data, as opposed to the
stein, 1980), deltas, and double deltas (Furui, 1986). Cepstral mean carefully selected training data provided for the evaluation. Batvox
subtraction (CMS; Furui, 1981), relative spectral filtering (RASTA; ran the selection process afresh for each known-speaker recording
Hermansky and Morgan, 1994), and feature warping (Pelecanos in the test set.
and Sridharan, 2001) are applied as feature level mismatch com- The user also has the option of entering a set of “imposter
pensation techniques. These types of features and mismatch recordings” which should represent the relevant population and
compensation techniques are widely used in automatic speaker the conditions of the questioned-speaker recording from the case.
recognition. The system calculates scores via i-vectors (Dehak We tested two variants, one without imposter recordings and an-
et al., 2011) and probabilistic linear discriminant analysis (PLDA; other using one questioned-speaker-condition recording from each
of the 105 speakers in the training set. Since the use of imposters
R
This paper is part of the Virtual Special Issue entitled: Multi-laboratory is recommended by Agnitio when dealing with mismatched mate-
eval-uation of forensic voice comparison systems under conditions reflecting those rial as is the case here, we expected that using the imposter set
of a real forensic case (forensic_eval_01), [http://www.sciencedirect.com/science/ would result in better performance from Batvox.
journal/01676393/vsi], Guest Edited by G. S. Morrison and E. Enzinger.
E-mail addresses: david@holmes.nl, d.van.der.vloed@nfi.minvenj.nl

http://dx.doi.org/10.1016/j.specom.2016.10.001
0167-6393/© 2016 Elsevier B.V. All rights reserved.
128 D. van der Vloed / Speech Communication 85 (2016) 127–130

Table 1
Exact values of the accuracy and precision metrics shown in Fig. 2.

System variant Cllr pooled Cllr mean 95% CI

Reference data Imposters

Selected None 0.646 0.604 1.382


All None 0.456 0.391 1.477
Selected All 0.431 0.377 1.148
All All 0.365 0.304 1.156

quarter of the original recording as the questioned-speaker record-


ing, and ends up with four same-speaker scores (see González-
Rodríguez et al., 2006).
For mismatch compensation in the score domain, a T-norm
transformation is calculated using the mean and standard devia-
tion of the different-speaker scores, and is applied to all scores.
A new T-norm transformation is calculated for each questioned-
Fig. 1. Plot showing Cllr mean versus 95% CI (left panel) and Cllr pooled (right panel). speaker recording in the test set. If an imposter set is provided, the
known-speaker recording is compared with each imposter record-
ing, the mean and standard deviation of the resulting scores are
calculated and a Z-norm transformation is applied to all scores. A
new Z-norm transformation is calculated for each known-speaker
The factorial combination of full reference set versus subset, recording in the test set. The Z-norm transformation is calculated
and no imposter set versus imposter set resulted in four system on and applied to scores which have already been T-norm trans-
variants. formed. A unique T-norm Z-norm pair is therefore applied to each
Batvox compares each of the recordings in the reference pair of test recordings. See Auckenthaler et al. (20 0 0) for details of
set/subset with the questioned-speaker recording to generate a set T-norm and Z-norm.
of different-speaker scores. When there is a single known-speaker A Bayesian approach (Brümmer and Swart, 2014) is used to
recording (as in the conditions being tested here), Batvox cuts the calculate posterior probability densities consisting of a Gaussian
known-speaker recording into four non-overlapping parts (quar- distribution for different-speaker scores and a Gaussian distri-
ters), and compares these to generate same-speaker scores. Three bution for same-speaker scores. The prior probability distribu-
of the quarters are concatenated and used as a known-speaker tions are such that when the amount of training data are small
recording, and the other quarter is used as a questioned-speaker (e.g., 4 same-speaker scores) the resulting posterior probability
recording. Batvox cycles through using the first through fourth distribution is substantially wider and has a substantially lower

all reference data + no impostor data all reference data + impostor data
1
0.9
0.8
Cumulative Proportion

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
selected reference data + no impostor data selected reference data + impostor data
1
0.9
0.8
Cumulative Proportion

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−4 −3 −2 −1 0 1 2 3 −4
4 −3 −2 −1 0 1 2 3 4
log10 Likelihood ratio log10 Likelihood ratio

Fig. 2. Tippett plots (no precision).


D. van der Vloed / Speech Communication 85 (2016) 127–130 129

all reference data + no impostor data all reference data + impostor data
1
0.9
0.8

Cumulative Proportion
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
selected reference data + no impostor data selected reference data + impostor data
1
0.9
0.8
Cumulative Proportion

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−4 −3 −2 −1 0 1 2 3 −4
4 −3 −2 −1 0 1 2 3 4
log10 Likelihood ratio log10 Likelihood ratio

Fig. 3. Tippett plots (with precision).

mean value than would a distribution calculated solely on the


data.1
The score for the comparison of the known-speaker recording 40
and questioned-speaker recording is calculated. The T-norm and
30
Z-norm transformations are applied. The relative likelihood of the
same-speaker versus the different-speaker posterior distributions
20
at the value of the transformed known-speaker versus questioned-
speaker score is then calculated. The resulting likelihood ratio is
reported as the output of the system.
miss rate (in %)

10

3. Results
5
Accuracy and precision metrics, log likelihood ratio costs and
credible intervals (Cllr mean versus 95% CI, and Cllr pooled ), are repre- 2
sented graphically in Fig. 1, and their numeric values are provided
in Table 1. Tippett plots, a Detection Error Tradeoff (DET) plot, 1
and Empirical Cross-Entropy (ECE) plots are provided in Figs. 2
0.5 selected reference data + no impostor data
through 5.
all reference data + no impostor data
0.2 selected reference data + impostor data
4. Discussion and conclusion
0.1 all reference data + impostor data
The performance of Batvox 4.1 was evaluated under conditions
reflecting those of a real forensic case. 0.1 0.2 0.5 1 2 5 10 20 30 40
false alarm rate (in %)
Better performance was achieved when recordings from all 105
speakers in the training set were used as “reference population” Fig. 4. DET plot.
data than when a subset of 30 of these were selected by Batvox.
The recordings in the training data had previously been selected to
be a sample representative of the relevant population for the case.
When this is done it is probably not appropriate to have Batvox have been designed with a more diverse set of input data in mind.
further select and reduce the size of the sample – the option may A bias towards lower valued log likelihood ratios is apparent in all
the Tippett plots, but this is much larger when the subset of 30
recordings was used.
1
In previous versions of Batvox, to compensate for the fact that there is no Better performance was also achieved when an “imposter” set
between-session or recording-condition mismatch between the parts of the record- of recordings from all 105 speakers in the training set was used
ing used to calculate the same-speaker scores, a within source degradation proce-
dure was applied (González-Rodríguez et al, 2006). This is not used in version 4.1.
than when no “imposter” set was used.
It is implied that for the new version this problem is addressed by the Bayesian The best performance was achieved by the system variant
approach employed. which included recordings from all speakers in the “reference pop-
130 D. van der Vloed / Speech Communication 85 (2016) 127–130

all reference data + no impostor data all reference data + impostor data
LR values LR values
0.6 After PAV 0.6 After PAV
LR=1 always LR=1 always

0.5 0.5
Empirical cross−entropy

Empirical cross−entropy
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
log 10 prior odds log 10 prior odds
selected reference data + no impostor data selected reference data + impostor data
LR values LR values
0.6 After PAV 0.6 After PAV
LR=1 always LR=1 always

0.5 0.5
Empirical cross−entropy

0.4 Empirical cross−entropy 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
log 10 prior odds log 10 prior odds

Fig. 5. ECE plots.

ulation” data and recordings from all speakers in the “imposter” Furui, S., 1981. Cepstral analysis technique for automatic speaker verification. IEEE
set. Although one should be cautious about generalising from the trans. Acoust. Speech Signal Proc 29, 254–272. http://dx.doi.org/10.1109/TASSP.
1981.1163530.
performance observed under the conditions of one case to the con- Furui, S., 1986. Speaker-independent isolated word recognition using dynamic fea-
ditions of other cases, the results suggest that if one has a sam- tures of speech spectrum. IEEE Trans. Acoust. Speech Sig. Proc. 34, 52–59.
ple of recordings of speakers representative of the relevant popula- http://dx.doi.org/10.1109/TASSP.1986.1164788.
Garcia-Romero, D., Espy-Wilson, C., 2011. Analysis of i-vector length normaliza-
tion for the case and reflecting the known-speaker and questioned- tion in speaker recognition systems. In: Proc. Interspeech. Florence, Italy,
speaker speaking styles and recording conditions for the case, then pp. 249–252.
best practice for using Batvox 4.1 would be to use known-speaker- González-Rodríguez, J., Drygajlo, A., Ramos-Castro, D., Garcia-Gomar, M., Ortega-
García, J., 2006. Robust estimation, interpretation and assessment of likeli-
condition recordings from all speakers in the sample as “reference
hood ratios in forensic speaker recognition. Comput. Speech Lang. 20, 331–355
population” data and to use questioned-speaker-condition record- http://dx.doi.org/10.1016/j.csl.20 05.08.0 05.
ings from all speakers in the sample as “imposter” data. Hermansky, H., Morgan, N., 1994. RASTA processing of speech. IEEE Trans. Speech
Audio Proc. 2, 578–589. http://dx.doi.org/10.1109/89.326616.
Kenny, P., 2010. Bayesian speaker verification with heavy tailed priors. In: Proc.
References
Odyssey 2010 – the Speaker and Language Recognition Workshop. Brno, Czech
Republic.
Auckenthaler, R., Carey, M., Lloyd-Thomas, H., 20 0 0. Score normalization for text- Kullback, S., Leibler, R.A., 1951. On information and sufficiency. Ann. Math. Stat. 22
independent speaker verification systems. Digital Signal Process 10, 42–54. http: (1), 79–86. http://dx.doi.org/10.1214/aoms/1177729694.
//dx.doi.org/10.1006/dspr.1999.0360. Morrison, G.S., Enzinger, E., 2016. Multi-laboratory evaluation of forensic voice com-
Brümmer, N., Swart, A., 2014. Bayesian calibration for forensic evidence reporting. parison systems under conditions reflecting those of a real forensic case (foren-
In: Proc. Fifteenth Annual Conference of the International Speech Communica- sic_eval_01) – Introduction. Speech Communication. http://dx.doi.org/10.1016/j.
tion Association (Interspeech). Singapore, pp. 388–392. specom.2016.07.006.
Davis, S., Mermelstein, P., 1980. Comparison of parametric representations for Pelecanos, J., Sridharan, S., 2001. Feature warping for robust speaker verification.
monosyllabic word recognition in continuously spoken sentences. IEEE Trans. In: Proceedings of Odyssey 2001: The Speaker Recognition Workshop. Crete,
Acoust. Speech Signal Proc. 28, 357–366. http://dx.doi.org/10.1109/TASSP.1980. Greece, pp. 213–218.
1163420. Prince, S.J.D., Elder, J.H., 2007. Probabilistic linear discriminant analysis for infer-
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P., 2011. Front-end factor ences about identity. In: IEEE 11th International Conference on Computer Vision
analysis for speaker verification. IEEE Trans. Audio Speech Lang. Proc. 19, 788– (ICCV), pp. 1–8. http://dx.doi.org/10.1109/ICCV.2007.4409052.
798. http://dx.doi.org/10.1109/TASL.2010.2064307. Ramos-Castro, D., Fierrez-Aguilar, J., González-Rodríguez, J., Ortega-García, J., 2007.
Speaker verification using speaker- and test-dependent fast score normalization.
Pattern Recognit. Lett. 28, 90–98. http://dx.doi.org/10.1016/j.patrec.2006.06.008.

You might also like