Professional Documents
Culture Documents
2210 17189
2210 17189
2.2.2. Math definition The whole process of 2-speaker DiaCorrect is shown in Fig.2. It
has two encoders, one is the “SAS encoder” to model the statistics
Given a T -length, F -dimensional input acoustic feature sequence of initial diarization results, the other is a “speech encoder” to cap-
X = (xt ∈ RF | t = 1, ..., T ) and its initial speaker activity se- ture the original speaker interaction acoustics. The “SAS encoder”
quence Z = (zt | t = 1, ..., T ), DiaCorrect aims to align the initial is able to accept two types of initial diarization outputs, where they
can be the general 0 or 1 speaker activity sequence SASr that ob- 3.3. Results and discussion
tained from a Rich Transcription Time Marked (RTTM) file, or the
posterior probability sequence SASp that directly derived from the 3.3.1. Baseline
pre-trained SA-EEND model. The input of “speech encoder” is the As the diarization network output is a probability sequence of
same Log-Mel feature as SA-EEND. speaker activity for each speaker, a threshold is required to deter-
After transforming the initial speaker activity and acoustic fea- mine whether the current probability value denotes active or not.
tures into a high-dimensional embedding space by two encoders, a Therefore, the setup of the decision threshold can greatly affect the
decoder is designed to learn the alignment or dependence between overall diarization performance.
these concatenated high-level feature embeddings and the ground-
truth speaker label sequence. If necessary, the DiaCorrect process
can be performed iteratively during speaker activity inference. Table 2. Performance of SA-EEND on different thresholds.
Threshold FA MISS SC DER JER
Detailed structure of “SAS encoder” is shown in Fig.2(b), where
we use temporal convolutional network (TCN) [26, 27] to model 0.3 22.69 1.38 0.38 24.46 24.12
each speaker activity along time. For speech encoder, as shown in 0.4 17.57 2.10 0.51 20.17 22.16
Fig.2(c), we choose 2-D convolutional network to attend both tem- 0.5 12.83 3.07 0.67 16.57 20.57
poral and spectral dynamics in a spectrogram. Finally, a 2-layer 0.6 8.20 4.58 0.82 13.61 19.56
transformer [21] based decoder shown in Fig.2(d) is used to predict 0.7 4.28 7.02 1.00 12.31 19.97
the refined speaker activity detection results. 0.8 1.54 11.64 0.99 14.18 22.92