Professional Documents
Culture Documents
net/publication/327389281
CITATIONS READS
4 146
3 authors:
Victor Sreeram
University of Western Australia
297 PUBLICATIONS 3,756 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Yuanjun Zhao on 22 October 2018.
School of Electrical, Electronic and Computer Engineering, The University of Western Australia
(UWA), Australia
yuanjun.zhao@research.uwa.edu.au, (roberto.togneri, victor.sreeram)@uwa.edu.au
626 10.21437/Interspeech.2018-1042
2. Adaptive weighting framework and speech in the training and development sets only consist of the
clustering approach five known attacks. While the spoofed speech in the evaluation
set are made up of unseen known attack speech data and five un-
Herein the motivation for the application of the weighting known attack speech data. The database protocol motivates the
framework is described. The discussion starts with a treatment design of a clustering method for the task of weighting factors
of original CQCCs and SCCs before the introduction to the pro- assignment. The training and the development sets are prepared
posed adaptive weighting scheme and the clustering approach. for training and tuning the detector while the evaluation set is
for testing.
2.1. CQCC and SCC complementary features
In [10], the authors analyzed the effectiveness of the time- Table 1: Statistics of the ASVspoof 2015 database
frequency representation and the constant Q transform to speech
and music signals. Comparing with the conventional cepstral #Speaker #Utterances
Subset
analysis, a conversion from geometric series space to linear
Male Female Genuine Spoofed
space was adopted to match the equal-tempered scale corre-
sponded to a bin spacing. The orthogonality of the discrete Training 10 15 3750 12625
cosine transform (DCT) basis was preserved by performing a Development 15 20 3497 49875
linearisation of the frequency scale of the constant Q transform. Evaluation 20 26 9404 ≈ 200000
The extraction of the CQCC applied a polyphase anti-aliasing
filter and a spline interpolation to complete the resampling of
each signal with a uniform sampling rate. Actually, CQCCs are Figure 1 illustrates the scheme of the proposed adaptive
extracted in a traditional manner like MFCCs but with a new weighting framework. The CQCCs and the SCCs of the raw
uniform resampling process. Due to the adjustment of the fre- speech signals from all the sets in ASVspoof 2015 corpus are
quency resampling, the CQCC achieves an impressive perfor- extracted with the same settings in [10] [11]. Note that the voice
mance on the ASVspoof 2015 database. activity detection (VAD) is discarded here to take the silent re-
gions into consideration. In the step of weighting factors se-
While in [11], the scattering spectrum [12] with the same
lection, the training and development sets are utilized for the
nature as cochlear filters was used for the creation of the SCC.
tuning of weighting factors αi (i = 1, 2, . . . , 5) for five spoof-
In addition to that, the close relation to the modulation spectra
ing categories and factor α0 for the human subset provided in
also contributed to enhance the spoofing detection ability. With
the development protocol, thus we have six weighting factors.
a hierarchical structure based on wavelet filter-banks and mod-
The weighting formula is shown as below:
ulus operators, raw speech signals were transferred to scalo-
grams of different depth. Then scattering coefficients at each
final score = αi · scoreCQCC + (1 − αi ) · scoreSCC (1)
level can be estimated by windowing and computing the aver-
age value. Logarithms of the scattering coefficients from several where the scores are represented by the log-likelihood ratio
levels were concatenated to build a feature vector. By taking a (LLR). For all the subsets in the development set, we search
DCT of this vector, the SCC was eventually collected. over all the available values of weighting factors αi ranging
From the detection results of the evaluation set in [10] [11], from 0 to 1. When the EER is minimized by a certain weighted
it is shown that a lower EER can be reached on the subset S10 score, the associated weighting factor is fixed for that subset.
compared to the SCC which, conversely, takes the lead to the
other nine subsets. Specifically, the CQCC captures more in-
formation of artifacts hidden in unit boundaries than the SCC
and this brings on a significantly decreased EER in detecting
unit-selection based spoofing, S10. For S10 using CQCC the
EER is 1.07% to be compared to SCC with an EER of 3.94%
[11]. By contrast, an analysis of the F-ratios for the first three
levels of the SCC reveals that the forged components in both the
low frequency and the high frequency region are discriminative.
When the first several levels are used in conjunction, the accu-
racy of the detection increases greatly. For the subsets S1 to S9,
the EER is 0.02% with SCC compared to CQCC with EER of
0.17% [11]. Figure 1: Scheme of the proposed adaptive weighting frame-
These results confirm the complementary performance of work.
CQCC and SCC and present the question as to whether these
can be successfully fused. We propose an approach to do just
To arrange a weighting factor for every speech in the evalu-
this as described in the next section.
ation set without any prior knowledge, a new clustering method
is proposed in this paper. The flowchart of the clustering pro-
2.2. Proposed framework
cess is shown in Figure 2. After the extraction of the CQCC, the
In this work, the dataset used is the ASVspoof 2015 Corpus [3]. cepstral mean and variance normalization (CMVN) is applied.
The detailed information of this database is shown in Table 1. Then two universal background models (UBM) are trained for
Three sets are included consisting of genuine (human) speech human and spoofed speech respectively. Following this, the
and spoofed speech. No channel or background noise effects GMM mean supervectors are estimated by the EM algorithm.
are added. There are 10 types of spoofing attack, namely S1 to To reduce the computation complexity and acquire a more com-
S10. In all three speech sets, the S1-S5 are set as known attacks pact representation, the locality preserving projections (LPP)
and the S6-S10 are the unknown attacks. Note that the spoofed [13] is employed. The newly generated short vectors are named
627
performed and the first 60 coefficients are retained as the final
scattering cepstral coefficients. The voice activity detector
applied in [11] is discarded in this work.
3. Experimental results
3.1. Experiments settings
3.1.1. Feature extraction
The original CQCC features are extracted by an open-source Figure 3: Cluster distribution of the evaluation set
MATLAB toolkit (http://audio.eurecom.fr/content/software).
The parameters are the same as the configuration in [10]. The
Comparing Figure 3 with Figure 4, it is obvious that there is
maximum and the minimum frequency in the constant Q trans-
a correlation of the five unknown attacks with the known attacks.
form are set as Fmax = Fsample /2 and Fmin = Fmax /29 re-
It is also noted that S3 and S4 are overlapped in most parts be-
spectively. The Nyquist frequency of the database is Fsample =
cause they are generated by the same spoofing technique with
16kHz. The number of octaves is 9 and the number of bins per
different amounts of data. And the subset S10 appears to over-
octave B is set to 96, which results in a time
shift of 8 ms. Pa-
lap the vast majority of the human subset. Spoofed speech of
rameter γ is set to γ = Γ = 228.7 ∗ 2(1/B) − 2(−1/B) . subset S10 are fabricated by the unit-selection synthesis. The
d = 16 is the re-sampling period. The dimension of the CQCC sub-words units are from databases of natural speech. This is
static coefficients is set to 19 with appended C0 which makes the main reason that most published spoofing detector systems
its length is 20. Acceleration coefficients, namely delta-delta perform poorly in differentiating S10 from the human subset.
(∆∆), are calculated and used in isolation. Experiments later The result of the mapping in Figure 4 confirms our assumption
are performed with only the CQCC acceleration coefficients in section 2. That is, the spoofed speech in the evaluation set
(CQCC-A). across both known and unknown attacks contain similar infor-
The extraction of the scattering coefficients is com- mation to the known attacks in the training and development
pleted with the publicly available toolbox Scattering sets. Given this condition, they can be re-clustered into the ex-
(http://www.di.ens.fr/data/scattering/). The first three lev- isting labeled clusters (e.g. known S2 and unknown S7 map to
els of coefficients are concatenated together. A DCT is known S2).
628
Table 2: The EERs(%) of the development set and the evaluation set of the ASVspoof 2015 corpus.
629
[4] Z. Wu, P. L. D. Leon, C. Demiroglu, A. Khodabakhsh, S. King,
Z. H. Ling, D. Saito, B. Stewart, T. Toda, M. Wester, and J. Ya-
magishi, “Anti-spoofing for text-independent speaker verification:
An initial database, comparison of countermeasures, and human
performance,” IEEE/ACM Transactions on Audio, Speech, and
Language Processing, vol. 24, no. 4, pp. 768–783, April 2016.
[5] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li,
“Spoofing and countermeasures for speaker verification: A sur-
vey,” Speech Communication, vol. 66, pp. 130 – 153, 2015.
[6] D. L. Donoho, “Compressed sensing,” IEEE Transactions on in-
formation theory, vol. 52, no. 4, pp. 1289–1306, 2006.
[7] X. Tian, Z. Wu, X. Xiao, E. S. Chng, and H. Li, “Spoofing de-
tection from a feature representation perspective,” in 2016 IEEE
International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), March 2016, pp. 2119–2123.
[8] Y. Zhao, R. Togneri, and V. Sreeram, “Compressed high dimen-
sional features for speaker spoofing detection,” in 2017 Asia-
Pacific Signal and Information Processing Association Annual
Summit and Conference (APSIPA ASC), Dec 2017, pp. 569–572.
[9] T. B. Patel and H. A. Patil, “Combining evidences from mel cep-
stral, cochlear filter cepstral and instantaneous frequency features
for detection of natural vs. spoofed speech,” in Sixteenth Annual
Conference of the International Speech Communication Associa-
tion, 2015.
[10] M. Todisco, H. Delgado, and N. Evans, “Constant Q cepstral co-
efficients: A spoofing countermeasure for automatic speaker ver-
ification,” Computer Speech and Language, vol. 45, pp. 516–535,
Sep 2017.
[11] K. Sriskandaraja, V. Sethu, E. Ambikairajah, and H. Li, “Front-
end for antispoofing countermeasures in speaker verification:
Scattering spectral decomposition,” IEEE Journal of Selected
Topics in Signal Processing, vol. 11, no. 4, pp. 632–643, June
2017.
[12] J. Andén and S. Mallat, “Deep scattering spectrum,” IEEE Trans-
actions on Signal Processing, vol. 62, no. 16, pp. 4114–4128, Aug
2014.
[13] Y. Tang and R. Rose, “A study of using locality preserving projec-
tions for feature extraction in speech recognition,” in 2008 IEEE
International Conference on Acoustics, Speech and Signal Pro-
cessing, March 2008, pp. 1569–1572.
[14] S. O. Sadjadi, M. Slaney, and L. Heck, “Msr identity tool-
box v1. 0: A matlab toolbox for speaker-recognition research,”
Speech and Language Processing Technical Committee Newslet-
ter, vol. 1, no. 4, pp. 1–32, 2013.
630
View publication stats