Professional Documents
Culture Documents
Recognition
Jiajun Deng1∗ , Xurong Xie2∗ , Tianzi Wang1 , Mingyu Cui1 , Boyang Xue1 , Zengrui Jin1 ,
Mengzhe Geng1 , Guinan Li1 , Xunying Liu1 , Helen Meng1
1
The Chinese University of Hong Kong, Hong Kong SAR, China
2
Institute of Software, Chinese Academy of Sciences, China
{jjdeng,tzwang,mycui,byxue,zrjin,mzgeng,gnli,xyliu,hmmeng}@se.cuhk.edu.hk, xurong@iscas.ac.cn
Abstract maximum likelihood linear regression [9], and vocal tract length
A key challenge for automatic speech recognition (ASR) sys- normalization [10,11], aim to produce speaker-independent (SI)
tems is to model the speaker level variability. In this paper, features by applying feature transform in the acoustic front-
ends. 3) Model-based adaptation that uses separate SD model
arXiv:2206.12045v1 [eess.AS] 24 Jun 2022
Predictor
Bayesian 2Sigmoid Deterministic Spec Convolution Encoder Layer CTC Module Confidence
sampling 1.0 Linear features
Aug Subsampling Block Norm Decoder Module Estimation Module
𝑝(𝑟 ! |𝑋 ! , 𝑌 ! ) 𝑝(𝑟 ! |𝑋 ! , 𝑌 ! ) - Data selection
E2E ASR
Speaker-dependent LHUC transform
Figure 1: An example Conformer E2E architecture is shown in the grey box at the bottom centre. Its encoder internal components are
also shown in the green coloured box on the top right. The standard Transformer encoder without the additional Convolutional module
is shown in the light yellow coloured box, top left. Bayesian estimation or fixed value, deterministic LHUC speaker adaptation are
shown on the left and right of the red coloured box in the bottom left corner.
WER (%)
WER (%)
11.5
WER (%)
13.5
12.5
11.0
13.0
10.5 12.0
12.5
10.0 11.5
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
The amounts (%) of adaptation utterances (Hub5'00) The amounts (%) of adaptation utterances (RT02) The amounts (%) of adaptation utterances (RT03)
Figure 2: Performance (WER%) of adapted LHUC-SAT Conformer systems using various amounts of adaptation data selected using
different methods (top down over legends): ground truth WER “Ground Truth WER”; original Conformer decoder attention Softmax
output probabilities “Att. Prob”, or their combination with CTC “Att+CTC Prob”; the CEM smoothed attention scores “CEM Att.
Prob”. “+Bayesian” denotes Bayesian LHUC test adaptation.
Table 2: Speaker adapted performance (WER%) of LHUC-SAT Conformer systems with/without confidence score-based data selection
and Bayesian LHUC adaptation on the Hub5’00, RT02 and RT03 data sets, before and after external Transformer plus LSTM LM
rescoring. † denotes a statistically significant (MAPSSWE, α=0.05) WER difference over the baseline SI systems (sys. 1, 6).
Speaker Adaptation Data Language Hub5’00 RT02 RT03
Sys.
Adaptation Parameters Selection Model CHE SWBD O.V. SWBD1 SWBD2 SWBD3 O.V. FSH SWBD O.V.
1 7 7 7 15.0 7.3 11.1 8.7 12.9 15.6 12.6 10.4 16.4 13.5
2 i-vector2 7 7 15.8 7.8 11.8 9.3 13.6 16.3 13.3 10.9 17.3 14.2
3 LHUC-SAT Deterministic 7 7 13.8† 7.1 10.5† 8.6 12.3 †
14.7† 12.0† 10.0† 15.7† 12.9†
4 LHUC-SAT Deterministic X 13.7† 6.9† 10.3† 8.3† 12.0 †
14.5† 11.8† 9.8† 15.4† 12.7†
5 LHUC-SAT Bayesian X 13.4† 6.8† 10.1† 8.3† 11.9† 14.1† 11.6† 9.5† 14.8† 12.3†
6 7 7 7 Transformer + BiLSTM 14.2 6.8 10.5 8.5 12.3 14.8 12.1 9.6 15.3 12.6
7 LHUC-SAT Bayesian X (cross-utterance) 13.0† 6.3† 9.6† 8.1† 11.3† 13.4† 11.1† 8.7† 13.8† 11.3†
SI baseline (sys. 3 vs. sys. 1). The LHUC-SAT Conformer plus LSTM LM rescoring (sys. 7 vs. sys. 6). Finally, the perfor-
system (sys. 5) produced the lowest WER and was used in the mance of the best confidence LHUC adapted LHUC-SAT Con-
following experiments on confidence score-based adaptation. former system (sys. 7, Table 2) is further contrasted in Table 3
with those of the state-of-the-art performance obtained on the
4.3. Performance of Confidence Based Adaptation same task using the most recent hybrid and end-to-end systems
The efficacy of the proposed confidence score-based speaker reported in the literature to demonstrate its competitiveness.
level data selection is first assessed via an ablation study on
Table 3: Performance (WER%) contrast between the best confi-
the three test sets, as shown in Figure 2, where the WERs of
dence score-based Bayesian LHUC-SAT conformer system and
adapted LHUC-SAT Conformer systems using various amounts
other state-of-the-art systems on the Hub5’00 and RT03 data.
of adaptation data selected using different methods are shown. Hub5’00 RT03
ID System # Param.
The horizontal coordinates represent speaker level selected sub- CHE SWBD O.V. FSH SWBD O.V.
1 RWTH-2019 Hybrid [50] - 13.5 6.7 10.2 - - -
sets of utterances ranked by their confidence scores (e.g., top 80 2 CUHK-2021 BLHUC-Hybrid [14] 15.2M 12.7 6.7 9.7 7.9 13.4 10.7
percentile) of varying sizes used for adaptation. 3
4
Google-2019 LAS [48] -
29M
14.1
14.6
6.8
7.4
(10.5)
(11.0) - - -
A general trend can be found that across all three test sets 5 IBM-2020 AED [51] 75M 13.4 6.8 (10.1) - - -
6 280M 12.5 6.4 9.5 8.4 14.8 (11.7)
over various percentiles based subset selection, the proposed 7 IBM-2021 CFM-AED [25] 68M 11.2 5.5 (8.4) 7.0 12.6 (9.9)
8 Salesforce-2020 [52] - 13.3 6.3 (9.8) - - 11.4
CEM smoothed Conformer attention Softmax output proba- 9 Baseline {Table 2-Sys. 6} (ours) 45M 14.2 6.8 10.5 9.6 15.3 12.6
+Adapt. {Table 2-Sys. 7} (ours)
bilities (orange lines) consistently produced better LHUC test 10 + 0.74M 13.0 6.3 9.6 8.7 13.8 11.3