You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/337184235

Effectiveness of self-supervised pre-training for speech recognition

Preprint · November 2019

CITATIONS READS
0 150

3 authors, including:

Alexei Baevski Abdelrahman Mohamed


Meta Institute of Applied arts 6th of october
56 PUBLICATIONS   5,198 CITATIONS    10 PUBLICATIONS   658 CITATIONS   

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Alexei Baevski on 10 February 2021.

The user has requested enhancement of the downloaded file.


EFFECTIVENESS OF SELF-SUPERVISED PRE-TRAINING FOR SPEECH RECOGNITION

Alexei Baevski, Michael Auli, Abdelrahman Mohamed

Facebook AI Research
{abaevski, michaelauli, abdo}@fb.com
arXiv:1911.03912v3 [cs.CL] 18 May 2020

ABSTRACT tasks include predicting masked parts of the input, reconstruct-


We compare self-supervised representation learning algo- ing inputs through low bit-rate channels, or contrasting similar
rithms which either explicitly quantize the audio data or learn data points against different ones.
representations without quantization. We find the former to In this work we compare different approaches of self-
be more accurate since it builds a good vocabulary of the data supervised pre-training for speech data. We consider learn-
through vq-wav2vec [1] to enable learning of effective repre- ing discrete units to represent the audio data through either
sentations in subsequent BERT training. Different to previous self-supervision [1] or through clustering spectral features, fol-
work, we directly fine-tune the pre-trained BERT models on lowed by pre-training over these units using a bi-directional
transcribed speech using a Connectionist Temporal Classifi- transformer (BERT) model [19]. This is compared to directly
cation (CTC) loss instead of feeding the representations into learning representations without explicit quantization over the
a task-specific model. We also propose a BERT-style model raw audio as well as spectral features. Previous work fed the
learning directly from the continuous audio data and compare representations of the pre-trained model into a task-specific
pre-training on raw audio to spectral features. Fine-tuning a architecture for speech recognition instead of the raw wave-
BERT model on 10 hour of labeled Librispeech data with a form [12, 1]. Instead we directly fine-tune the pre-trained
vq-wav2vec vocabulary is almost as good as the best known BERT model on transcribed speech data using a CTC loss. Our
reported system trained on 100 hours of labeled data on test- experiments demonstrate that discrete unit discovery, followed
clean, while achieving a 25% WER reduction on test-other. by BERT training achieves better results than representations
When using only 10 minutes of labeled data, WER is 25.2 learned without explicit quantization. Disentangling acoustic
on test-other and 16.3 on test-clean. This demonstrates that unit discovery from learning the sequential relationship be-
self-supervision can enable speech recognition systems trained tween them, enables better representations of the data which
on a near-zero amount of transcribed data. in turn improves down-stream model accuracy.
1 Introduction We pre-train our models on the unlabeled 960h Lib-
rispeech [21] data and follow the Libri-light [22] limited
Representation learning has been an active research area for resource supervised training sets of 10 hours, 1 hour, and 10
more than 30 years [2], with the goal of learning high level mins. Our best model fine-tuned only on 1 hour of labeled data
representations which separates different explanatory factors can outperform the best known result from the literature rely-
of the phenomena represented by the input data [3, 4]. Build- ing on 100h of labeled data [23] on the standard Librispeech
ing Automatic Speech Recognition (ASR) systems, typically test-other subset. Using only 10 minutes of labeled data the
requires a large volume of training data to represent differ- approach achieves 16.3/25.2 WER on test-clean/other.
ent factors contributing to the creation of speech signals, e.g.
background noise, recording channel, speaker identity, ac- 2 Preliminaries
cent, emotional state, topic under discussion, and the language
used in communication. The practical need for building ASR 2.1 BERT
systems for new conditions with limited resources spurred
a lot of work focused on unsupervised speech recognition Using self-supervision, BERT [19], a deep bidirectional trans-
and representation learning [5, 6, 7, 8, 9, 10, 11, 12], in ad- former model, builds its internal language representation that
dition to semi- and weakly-supervised learning techniques generalizes to other downstream NLP tasks. Self-attention
to reduce the supervised data needed in real-world scenarios over the whole input word sequence enables BERT to jointly
[13, 14, 15, 16, 17, 18]. condition on both the left and right context of data. For train-
Recently impressive results have been reported for repre- ing, it uses both a masked language model loss, by randomly
sentation learning, that generalizes to different downstream removing some input words for the model to predict, and a con-
tasks, through self-supervised learning for NLP and speech trastive loss to distinguish the next sentence in the document
[19, 20, 11, 12, 1]. Self-supervised representation learning from a randomly selected one.
2.2 Wav2Vec To understand the impact of discrete acoustic representa-
tions of vq-wav2vec [1], as alternatives, we explore quantizing
Wav2Vec [12] learns representations of audio data by solving
the standard mel-frequency cepstral coefficients (MFCC) and
a self-supervised context-prediction task with the same loss
log-mel filterbanks coefficients (FBANK), choosing a subset
function as word2vec [24, 11]. The model is based on two
small enough to fit into GPU memory and running k-means
convolutional neural networks where the encoder f : X 7→ Z
with 13.5k centroids (to match the vq-wav2vec setup) to con-
produces a representation zi for each time step i at a rate of
vergence. We then assign the index of the closest centroid to
100 Hz and the aggregator g : Z 7→ C combines multiple
represent each time-step.
encoder time steps into a new representation ci for each time
step i. Given ci , the model is trained to distinguish a sample We train a standard BERT model [19, 25] with only the
zi+k that is k steps in the future from distractor samples z̃ masked language modeling task on each set of inputs in a
drawn from a distribution pn , by minimizing the contrastive similar way as described in [1], namely by choosing tokens
loss for steps k = 1, . . . , K: for masking with probability of 0.05, expanding each chosen
token to a span of a length sampled from a normal distribution
T
X −k  with mean 10 and standard deviation 10 (spans may overlap)
Lk = − log σ(z>
i+k hk (ci )) and then computing a cross-entropy loss which attempts to
i=1 (1) maximize the likelihood of predicting the true token for each

>
+ λ E [log σ(−z̃ hk (ci ))] one that was masked (Figure 1a).
z̃∼pn
Following [26], we replace the fixed positional embeddings
where T is the sequence length, σ(x) = 1/(1 + exp(−x)), in the BERT model with a single group convolutional layer
and where σ(z> that is applied directly on the embeddings before any of the
i+k hk (ci )) is the probability of zi+k being the
true sample. A step-specific affine transformation hk (ci ) = transformer blocks. The convolutional layer has a kernel size
PK of 128 and group size of 16 to reduce the number of added
Wk ci + bk is applied to ci [11]. The loss L = k=1 Lk
parameters.
is optimized by summing (1) over different step sizes. The
learned high level features produced by the context network
3.2 Continuous BERT
ci are shown to be better acoustic representations for speech
recognition compared to standard spectral features. A masked language modeling task cannot be performed with
2.3 vq-wav2vec continuous inputs and outputs, as there are no targets to predict
in place of the masked tokens. Instead of reconstructing the in-
vq-wav2vec [1] learns vector quantized (VQ) representations put as in [27], we classify the masked positive example among
of audio data using a future time-step prediction task. Similar a set of negatives. The inputs to the model are dense wav2vec
to Wav2Vec, there are a convolutional encoder and decoder features [12], MFCC or FBANK features representing 10ms
networks f : X 7→ Z and g : Ẑ 7→ C for feature extraction and of audio data. Some of these inputs are replaced with a mask
aggregation. However, in between them there is a quantization embedding and are then fed into a transformer encoder. We
module q : Z 7→ Ẑ to build discrete representations which are then compute the dot product between the outputs correspond-
input to the aggregator. ing to each masked input, the true input that was masked, and
First, 30ms segments of raw speech are mapped to a dense a set of negatives sampled from other masked inputs within
feature representation z at a stride of 10ms using the encoder f . the same batch. The model is optimized with the InfoNCE
Next, the quantizer (q) turns these dense representations into loss [11] where given one positive sample zi and N negative
discrete indices which are mapped to a reconstruction ẑ of the samples z̃ we minimize:
original representation z. The ẑ is fed into the aggregator g and
the model is optimized via the same context prediction task
T
as wav2vec (cf. §2.2). The quantization module replaces the X exp(zi )
Lk = PN (2)
original representation z by ẑ = ei from a fixed size codebook j
i=1 j=1 exp(z̃ )
e ∈ RV ×d which contains V representations of size d.
3 Approach where each sample zi is computed as a dot product of the out-
put of the model at timestep i and the true unmasked value of
3.1 Discrete BERT
positive example at timestep i or a randomly sampled negative
Our work builds on the recently proposed work in [1] where au- example. To stabilize training, we add the squared sum of
dio is quantized using a contrastive loss, then features learned logits produced by the dot-product to the loss, and then apply a
on top by a BERT model [19]. For the vq-wav2vec quantiza- soft clamp sˆi = λ tanh(si /λ) for each logit si to prevent the
tion, we use the gumbel-softmax variant with the same setup as model’s tendency to continually increase the magnitude of log-
described in [1]. This model quantizes the Librispeech dataset its during training [28]. We use the same kind of convolutional
into 13.5k unique codes. positional layer as described in section 3.1.
t1 t1 t1 ln

… … … softmax logits
lp
softmax tv tv tv

t3 t4 t5 f3 f5 r5

output f5 f4 ⨂ r4

Transformer Encoder f9 f3 r3

neg pos output


t1 t2 m3 m4 m5 t6

discrete tokens Transformer Encoder

f1 f2 m3 m4 m5 f6
quantizer
dense features

(a) Quantized Inputs (b) Continuous Inputs

Fig. 1: Illustration of BERT pre-training. mi refers to masked time-steps chosen at random during each forward pass. (a) Inputs
are quantized with a vq-wav2vec quantizer or, for MFCC and FBANK variants, by finding the closest centroids and are then
used for training a BERT model with a masked language model objective. (b) wav2vec, MFCC or FBANK inputs are masked,
encoded by a transformer, and then fed into a classifier to predict which features were masked at each time-step.

3.3 Supervised fine-tuning ing channel index from all channels with a probability of 0.4%
and the length of the channel mask by sampling from a normal
The pre-trained models are fine-tuned to perform the ASR task distribution with a mean of 64 and standard deviation of 64.
by adding a randomly initialized linear projection on top of the The chosen (and possibly overlapping) spans of channels are
features computed by the transformer models into V classes then zeroed out.
representing the vocabulary of the task. The vocabulary is We apply a dropout at every layer of the transformer of
29 tokens for character targets plus a word boundary token. 0.1 for 10 minute and 1 hour setup, but we disable it for other
The models are optimized by minimizing the CTC loss. We subsets as masking described above appears to provide enough
apply SpecAugment [29] inspired masking to time-steps and regularization.
channels during training which delays overfitting and signifi-
cantly improves the final accuracy numbers, especially on the 4 Experiments
smallest subsets. We implement our models in the fairseq [30] toolkit.
We train a single seed for all subsets except the 10 minute
4.1 Data
one, where we train 5 seeds and choose the best one. For 10
minute subset, some of the seeds fail by entering the overfit All experiments are performed by pre-training on the 960
regime very early. We train on a single GPU using the Adam hours of audio only data of the Librispeech [21] training set,
optimizer with a tri-state learning scheduler where the learning fine-tuning on the Libri-light [22] limited resource supervised
rate is linearly increased from 1e-7 to 2e-05 in the first stage, training sets of 10 hours (24 speakers), 1 hour (24 speakers),
held at 2e-5 in the second, and finally linearly decayed to 0 and 10 minutes (4 speakers). The Libri-light training sets
in the third. For 1h and 10min subsets we train for 1250 / are sampled equally from the two clean and noisy portions, a
6600 / 12150 updates in each respective stage, for 10h subset balance of male and female speakers. We also report results of
we train for 5000 / 16500 / 28500 updates and for 100h we models fine-tuned on 100 hours following the “train-clean-100”
train for 8000 / 52800 / 91200 updates. We use a batch size subset. All models are evaluated on the standard Librispeech
of 6144 timesteps (61.44 seconds worth of audio) for the 100 dev and test splits.
hour subset and 3072 timesteps for other subsets.
4.2 Models
During fine-tuning, we apply a modified SpecAugment
4.2.1 Quantized Inputs Training
[29] policy, where we randomly choose a number of starting
timesteps to mask, with each timestep having a chance of We first train the vq-wav2vec quantization model following
3.75% of being chosen. A span of 20 timesteps starting at each the gumbel-softmax setup described in [1]. After training
of the chosen position is then replaced with the mask embed- this model on 960h of Librispeech and quantizing the training
ding used during unsupervised training (spans may overlap). dataset, we are left with 13.5k unique codewords combina-
We also apply channel masking, in which we choose the start- tions.
For quantizing MFCC and FBANK features extracted us- cessed by the BERT model. We add the squared sum of logits
ing the Kaldi [31] toolkit, we use 8 Volta GPUs with 32GB from these computations multiplied by λ = 0.04 to the loss,
memory to compute 13.5k K-Means centroids matching the and then apply a smooth clamp by recomputing each logit
number of unique tokens produced by the vq-wav2vec model. sˆi = 20 tanh(si /20).
To fit into GPU memory, we subsample 50% of MFCC fea- The learning rate is warmed up over the first 10,000 up-
tures and 25% of FBANK features from the training set before dates to a peak value of 1 × 10−5 , and then linearly decayed
running the clustering algorithm. over a total of 250k updates.
The model we use for the masked language modeling task 4.3 Methodology
is a standard BERT model with 12 transformer layers, model
dimension 768, inner dimension (FFN) 3072 and 12 attention For quantized inputs, we compute token indices using the
heads [19]. The learning rate is warmed up over the first gumbel-softmax based vq-wav2vec model. For MFCC and
10,000 updates to a peak value of 1 × 10−5 , and then linearly FBANK features we take the index of the closest centroid, as
decayed over a total of 250k updates. We train on 128 GPUs measured by finding the minimum Euclidean distance, to each
with a batch size of 3072 tokens per GPU giving a total batch corresponding feature in the Librispeech dataset. We then train
size of 393k tokens [32] where each token represents 10ms of a BERT model as descirbed in §4.2.1.
audio data. For wav2vec continuous inputs, we use features extracted
by the publicly available wav2vec [12] model which contains
To mask the input sequence, we follow [1] and randomly
6 convolutional blocks in the feature extractor and 11 convo-
sample p = 0.05 of all tokens to be a starting index, without
lutional blocks in the aggregator module. We use the outputs
replacement, and mask M consecutive tokens from every sam-
of the aggregator as features. For MFCC and FBANK, we use
pled index; spans may overlap. M is sampled from a Gaussian
those features directly after applying a single linear projection
distribution with µ = 10 and σ = 10, rounded to the nearest
to upsample them to the model dimensionality.
integer greater than or equal to zero.
We fine-tune our pre-trained models on either 100 hours
Different from [1], we do not concatenate different utter-
of Librispeech train-clean-100 subset, 10 hours, 1 hour, or 10
ances to form examples for training, instead each utterance
minutes of labelled data following the Libri-light [22] limited
is treated as a single example, as we find that this approach
training sets. We use the standard CTC loss and train for up
produces better results after fine-tuning.
to 20k updates. We find that the pre-trained models converge
4.2.2 Continuous Inputs Training after only around 4k updates, while the models trained from
scratch tend to converge much later, around 18k updates. We
For training on dense features, we use a model similar to a
fine-tune all models with a learning rate of 0.0001 that is
standard BERT model with the same parameterization as the
linearly warmed up over the first 2k updates and then annealed
one used for quantized input training, but we use the wav2vec,
following a cosine learning rate schedule over the last 18k
MFCC or FBANK inputs directly. We add 128 relative po-
updates. We set the dropout of the pre-trained BERT models
sitional embeddings at every multi-head attention block [33]
to 0.1 and sweep on dropout of the BERT model outputs before
instead of fixed positional embeddings to make it easier to
the final projection layer over values between 0.0 and 0.4 in
handle longer examples. We train this model on 8 GPUs with
increments of 0.1. For each model, we choose a single best
a batch size of 9,600 inputs per GPU, resulting in a total batch
checkpoint that has the best loss on the validation set, which is
size of 76,800. We find that increasing the number of GPUs
a combination of dev-clean and dev-other standard Librispeech
(which increases the effective batch size) does not lead to better
splits.
results with this particular setup.
We use the publicly available wav2letter++ [34] decoder
Wav2vec features are 512-dimensional, while MFCC fea- integrated into the Fairseq framework with the official Lib-
tures have 39 dimensions and FBANK features have 80. We rispeech 4-gram language model. We run a sweep on weights
introduce a simple linear projection from the feature dimension for language model score, word score and silence token
to BERT dimension (768) for all models. weights for each model, where parameters are chosen ran-
Similar to 4.2.1, we mask time-steps by randomly sam- domly and evaluated on the dev-other Librispeech set. We
pling, without replacement, p = 0.05 of all time-steps to be a use the weights found by these sweeps to evaluate and report
starting index, and mask M consecutive time-steps from every results for all other splits. The sweeps are run with beam size
sampled index; spans may overlap, where M is sampled from of 250, while the final decoding uses a beam size of 1500.
a Gaussian distribution with µ = 10 and σ = 10, rounded
to the nearest integer greater than or equal to zero. We sam- 4.4 Results
ple 10 negative examples from other masked time-steps from In our first experiment, we compare unit discovery followed
the same example, and an additional 10 negative examples by BERT training over the resulting discrete units (Discrete
from masked time-steps occurring anywhere in the batch. We BERT) to directly learning representations from the audio
compute a dot product between the original features and the inputs (Continuous BERT) in different simulated labeled data
output corresponding to the same time-step after they are pro- scenarios ranging from 100 hours to 10 minutes. We compare
Input dev test Labeled test
Model
features clean other clean other data clean other
10 mins of labeled data Wang et al. (2019) [35] 960h 2.6 5.6
vq-wav2vec 15.7 24.1 16.3 25.2 Irie et al. (2019) [36] (No LM) 100h 12.9 35.5
Discrete BERT MFCC 30.3 48.8 31.5 49.1 Kahn et al. (2019) [37] (Conv LM) 100h 5.9 24.1
FBANK 35.7 53.9 35.5 55.0 Panayotov et al. (2015) [21] 100h 6.6 22.5
Lscher et al. (2019) [23] 100h 5.8 18.6
wav2vec 49.1 66.0 49.5 66.3
Continuous BERT MFCC 82.2 91.7 80.4 90.4 Kawakami et al. (2019) [38] 96h 9.4 26.8
FBANK 92.9 96.1 92.3 95.9
10m 16.3 25.2
1 hour of labeled data 1h 9.0 17.6
vq-wav2vec + Discrete BERT (Ours)
10h 5.9 14.1
vq-wav2vec 8.5 16.4 9.0 17.6
100h 4.5 12.1
Discrete BERT MFCC 14.1 32.1 14.3 32.5
FBANK 14.2 30.6 14.6 31.7
Table 2: Comparison to previously published results in terms
wav2vec 22.1 42.0 22.4 44.0 of WER on Librispeech. All results use 4-gram language
Continuous BERT MFCC 53.8 75.0 52.8 74.5 models, except [36, 37].
FBANK 56.7 76.2 55.0 76.1
10 hours of labeled data
vq-wav2vec 5.3 13.2 5.9 14.1 rectly learning representations from the continuous unlabeled
Discrete BERT MFCC 9.8 26.6 9.9 27.8 data. The initial unit discovery builds a vocabulary that makes
FBANK 9.8 25.7 10.1 26.6 the subsequent BERT pre-training more effective.
wav2vec 13.6 31.7 14.1 34.3 The best input features are obtained through self-supervised
Continuous BERT MFCC 27.5 54.2 27.4 55.7 learning through vq-wav2vec [1] for Discrete BERT, or
FBANK 25.0 50.2 24.9 51.7 wav2vec [12] for Continuous BERT.
100 hours of labeled data For Discrete BERT, vq-wav2vec provides about 40% of
relative error reduction for both test sutsets compared to clus-
vq-wav2vec 4.0 10.9 4.5 12.1
tered spectral features across all training set sizes, with bigger
Discrete BERT MFCC 7.6 24.2 7.8 24.4
gains on the noisy test-other subset.
FBANK 7.2 22.8 7.8 23.3
Pre-training brings clear benefits: When reducing the
wav2vec 11.3 26.4 11.8 28.3
amount of labeled training data from 100h to 10h results in an
Continuous BERT MFCC 11.6 34.0 12.4 35.5
increase of only 2 WER on test-other and 1.4 WER on test-
FBANK 10.1 30.9 11.0 31.8
clean for Discrete BERT with vq-wav2vec inputs. This shows
that pre-training is effective and particularly so when little
Table 1: Librispeech WER of BERT models taking discretized
labeled data is available. When reducing the amount of labeled
or continuous input features: for discretized BERT quantized
data to only 10 minutes, Discrete BERT with va-wav2vec in-
input units are obtained via vq-wav2vec or by k-means cluster-
puts can still achieve a WER of 16.3/25.2 on test-clean/other.
ing of MFCC/FBANK features; for continuous BERT wav2vec
features are learned following [12]. Pre-trained models are Table 2 shows a comparison of Discrete BERT to results
fine-tuned on various sizes of labeled data. from the literature. Fine-tuning Discrete BERT on only 10
hour of labeled data can nearly match the best known result
on 100 hours of labeled Librispeech data [23] on test-clean,
quantization with vq-wav2vec to clustered MFCC and FBANK while achieving a 25% relative WER reduction on test-other.
features. The continuous BERT variant learns directly from Moreover, when using the same train-clean-100 subset for fine-
the audio representations without explicit quantization and tuning, Discrete BERT with vq-wav2vec inputs improves by
we experiment with inputting wav2vec, MFCC and FBANK 6.5 WER (35% relative WER reduction) on test-other and 1.3
features. WER (22% relative WER reduction) on test-clean over [23].
Table 1 compares WERs of different input features and The closest setup to ours is [38] which learns representa-
pre-training methods on the standard Librispeech clean and tions using CPC [11] and then feed these into an ASR system
other subsets. The first observation is that Discrete BERT trained on about 96h of labeled data, which is surpassed on
outperforms Continuous BERT in all settings. This shows that test-clean by our Discrete-BERT model trained on vq-wav2vec
pre-training over meaningful discrete units outperforms di- representations fine-tuned on 10mins and 1 hour.
Input dev test representation [39, 40, 41, 42, 10, 43, 12, 11, 44, 1], either
features clean other clean other by predicting masked discrete or continuous input, or by con-
trastive prediction of neighboring or similarly sounding seg-
Discrete input (no BERT pre-training) ments using distant supervision or proximity in the audio signal
vq-wav2vec 99.7 99.3 99.3 99.4 as an indication of similarity. In [45] a dynamic time warping
MFCC 100.0 100.0 99.9 100.0 alignment is used to discover similar segment pairs.
FBANK 98.8 99.6 98.9 99.0 Our work is inspired by research efforts reducing the de-
Continuous input (no BERT pre-training) pendence on labeled data for building ASR systems through
unsupervised unit discovery and acoustic representation lean-
wav2vec 27.5 47.2 28.4 48.5 ing [5, 6, 7, 8, 9], and through multi- and cross-lingual transfer
MFCC 30.6 56.7 32.2 58.2 learning in low-resource conditions [46, 47, 48, 49, 50, 51],
FBANK 20.2 47.6 21.3 49.2 and semi-supervised learning [13, 14, 15, 16].

Table 3: WER on Librispeech with no pre-training for con- 6 Conclusion and Future work
tinuous and discrete input features on 100 hours of labeled We presented a systematic comparison of self-supervised pre-
data. training approaches for speech recognition. The most effective
method is to first learn a discrete vocabulary of the data with
dev test vq-wav2vec followed by standard BERT training over these
Pre-training discrete units. This performs much better than directly learn-
clean other clean other
ing from the continuous audio data. Different to previous work
wav2vec 51.1 71.4 52.6 71.9 which relied on task-specific ASR models, we directly fine-
+ Continuous BERT 13.3 31.9 14.0 35.0 tune the resulting BERT model on transcribed speech data to
act as speech recognition models. This approach can achieve
Table 4: Comparison of single-step pre-training (wav2vec) to better accuracy on test-other than the best known result with
two-step pre-training (wav2vec + Continuous BERT). wav2vec 100 hours of labeled data while relying on two orders mag-
is fine-tuned on 10 hours of labeled data and so is the Continu- nitude less labeled data. When the model is fine-tuned on
ous BERT model with wav2vec inputs. only 10 minutes of data, it can still achieve a WER 25.2 on
test-other and WER 16.3 on test-clean.

4.5 Ablations 7 References


[1] Alexei Baevski, Steffen Schneider, and Michael Auli, “vq-wav2vec:
To better understand the impact of BERT pre-training in our
Self-supervised learning of discrete speech representations,” in Proc. of
representation learning approach, we remove the BERT pre- ICLR, 2020.
training step and only perform unit discovery through vq- [2] G. E. Hinton, J. L. McClelland, and D. E. Rumelhart, Parallel Dis-
wav2vec, for discrete inputs, and fine-tuning, for both discrete tributed Processing: Explorations in the Microstructure of Cognition,
and continuous inputs on the 10 hour labeled setup. The Vol. 1, MIT Press, 1986.
vocabulary for discrete inputs is still built on the unlabeled [3] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,”
data. Table 3 shows that training with discrete inputs fails. This Nature Cell Biology, 2015.
is likely because the representations of the input discrete units [4] Yoshua Bengio, Aaron Courville, and Pascal Vincent, “Representation
are random and training on the labeled data is not sufficient. learning: A review and new perspectives,” IEEE Trans. Pattern Anal.
Continuous inputs do not suffer from this issue. Mach. Intell., 2013.

Next, we shed some light on how a two-step pre-training [5] A. Park and J. Glass, “Unsupervised pattern discovery in speech,” IEEE
TASLP, vol. 16, no. 1, pp. 186–197, 2008.
approach compares to a single-step approach. Specifically,
we compare Continuous BERT with wav2vec input features [6] Aren Jansen, Kenneth Church, and Hynek Hermansky, “Towards spoken
term discovery at scale with zero resources.,” in Proc. of Interspeech,
(requiring separate learning of the wav2vec features) to just 2010.
wav2vec features fine-tuned with a CTC loss on labeled data.
[7] J. Glass, “Towards unsupervised speech processing,” in ISSPA, 2012.
The results (Table 4) show that Continuous BERT + wav2vec
[8] Aren Jansen, Emmanuel Dupoux, Sharon Goldwater, Mark Johnson,
provides substantial gains. A second step of representation
Sanjeev Khudanpur, Kenneth Church, Naomi Feldman, Hynek Herman-
learning more than halved the WER, with more gains observed sky, Florian Metze, Richard Rose, et al., “A summary of the 2012 JHU
in the “clean” subset (cf. 4.4). CLSP workshop on zero resource speech technologies and models of
early language acquisition,” in Proc. of ICASSP, 2013.
5 Discussion and Related Work [9] Odette Scharenborg, Laurent Besacier, Alan Black, Mark Hasegawa-
Johnson, Florian Metze, Graham Neubig, Sebastian Stüker, Pierre Go-
The success of BERT [19] and Word2Vec [24] for NLP tasks dard, Markus Müller, Lucas Ondel, et al., “Linguistic unit discov-
motivated more research on self-supervised approaches for ery from multi-modal inputs in unwritten languages: Summary of the
acoustic word embedding and unsupervised acoustic feature ”speaking rosetta” JSALT 2017 workshop,” in Proc. of ICASSP, 2018.
[10] Yu-An Chung and James Glass, “Speech2vec: A sequence-to-sequence [31] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej
framework for learning word embeddings from speech,” in Proc. of Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin
Interspeech, 2018. Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely,
[11] Aäron van den Oord, Yazhe Li, and Oriol Vinyals, “Representation “The kaldi speech recognition toolkit,” in Proc. of ASRU, 2011.
learning with contrastive predictive coding,” arXiv, 2018. [32] Myle Ott, Sergey Edunov, David Grangier, and Michael Auli, “Scaling
[12] Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli, neural machine translation,” in Proc. of WMT, 2018.
“wav2vec: Unsupervised pre-training for speech recognition,” in Proc. [33] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le,
of Interspeech, 2020. and Ruslan Salakhutdinov, “Transformer-xl: Attentive language models
[13] K. Vesely, M. Hannemann, and L. Burget, “Semi-supervised training of beyond a fixed-length context,” arXiv, vol. abs/1901.02860, 2019.
deep neural networks,” in Proc. of ASRU, 2013. [34] V. Pratap, A. Hannun, Q. Xu, J. Cai, J. Kahn, G. Synnaeve, V. Liptchin-
[14] Sheng Li, Xugang Lu, Shinsuke Sakai, Masato Mimura, and Tatsuya sky, and R. Collobert, “Wav2letter++: A fast open-source speech
Kawahara, “Semi-supervised ensemble dnn acoustic model training,” in recognition system,” in Proc. of ICASSP, 2019.
Proc. of ICASSP, 2017. [35] Yongqiang Wang, Abdelrahman Mohamed, Duc Le, Chunxi Liu, Alex
[15] S. Krishnan Parthasarathi and N. Strom, “Lessons from building acoustic Xiao, Jay Mahadeokar, Hongzhao Huang, Andros Tjandra, Xiaohui
models with a million hours of speech,” in Proc. of ICASSP, 2019. Zhang, Frank Zhang, Christian Fuegen, Geoffrey Zweig, and Michael L.
[16] Bo Li, Ruoming Pang, Tara Sainath, and Zelin Wu, “Semi-supervised Seltzer, “Transformer-based acoustic modeling for hybrid speech recog-
training for end-to-end models via weak distillation,” in Proc. of ICASSP, nition,” arXiv, vol. abs/1910.09799, 2019.
2019. [36] Kazuki Irie, Rohit Prabhavalkar, Anjuli Kannan, Antoine Bruguier,
[17] Grzegorz Chrupała, Lieke Gelderloos, and Afra Alishahi, “Represen- David Rybach, and Patrick Nguyen, “On the choice of modeling unit for
tations of language in a model of visually grounded speech signal,” in sequence-to-sequence speech recognition,” in Interspeech 2019, 2019.
Proc. of ACL, 2017. [37] Jacob Kahn, Ann Lee, and Awni Hannun, “Self-training for end-to-end
[18] Herman Kamper, Shane Settle, Gregory Shakhnarovich, and Karen speech recognition,” arXiv, vol. abs/1909.09116, 2019.
Livescu, “Visually grounded learning of keyword prediction from un- [38] Kazuya Kawakami, Luyu Wang, Chris Dyer, Phil Blunsom, and Aaron
transcribed speech,” in Proc. of Interspeech, 2017. van den Oord, “Unsupervised learning of efficient and robust speech
[19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, representations,” 2019.
“Bert: Pre-training of deep bidirectional transformers for language un- [39] Samy Bengio and Georg Heigold, “Word embeddings for speech recog-
derstanding,” in Proc. of NAACL, 2019. nition,” in Proc. of Interspeech, 2014.
[20] Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and
[40] Keith Levin, Katharine Henry, Aren Jansen, and K. Livescu, “Fixed-
Michael Auli, “Cloze-driven pretraining of self-attention networks,” in
dimensional acoustic embeddings of variable-length segments in low-
Proc. of EMNLP, 2019.
resource settings,” in Proc. of ASRU, 2013.
[21] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur,
“Librispeech: An asr corpus based on public domain audio books,” in [41] Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen, Hung yi Lee, and
Proc. of ICASSP, 2015. L. Lee, “Audio word2vec: Unsupervised learning of audio segment
representations using sequence-to-sequence autoencoder,” in Proc. of
[22] Jacob Kahn et al., “Libri-light: A benchmark for asr with limited or no Interspeech, 2016.
supervision,” arXiv preprint arXiv:1912.07875, 2019.
[42] Wanjia He, Weiran Wang, and Karen Livescu, “Multi-view recurrent
[23] Christoph Lscher, Eugen Beck, Kazuki Irie, Markus Kitza, Wilfried neural acoustic word embeddings,” in Proc. of ICLR, 2016.
Michel, Albert Zeyer, Ralf Schlter, and Hermann Ney, “Rwth asr
systems for librispeech: Hybrid vs attention,” in Interspeech 2019, [43] Shane Settle, Kartik Audhkhasi, Karen Livescu, and Michael Picheny,
2019. “Acoustically grounded word embeddings for improved acoustics-to-
word speech recognition,” in Proc. of ICASSP, 2019.
[24] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff
Dean, “Distributed representations of words and phrases and their [44] Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James R. Glass, “An
compositionality,” in Proc. of NIPS, 2013. unsupervised autoregressive model for speech representation learning,”
in Proc. of Interspeech, 2019.
[25] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi
Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, [45] Herman Kamper, Micha Elsner, Aren Jansen, and Sharon Goldwater,
“Roberta: A robustly optimized bert pretraining approach,” arXiv, vol. “Unsupervised neural network based feature extraction using weak top-
abs/1907.11692, 2019. down constraints,” in Proc. of ICASSP, 2015.
[26] Abdelrahman Mohamed, Dmytro Okhonko, and Luke Zettlemoyer, [46] Haihua Xu, Hang Su, Chongjia Ni, Xiong Xiao, Hao Huang, Eng Siong
“Transformers with convolutional context for asr,” arXiv, 2019. Chng, and Haizhou Li, “Semi-supervised and cross-lingual knowledge
[27] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu, “Neural transfer learnings for dnn hybrid acoustic models under low-resource
discrete representation learning,” arXiv, vol. abs/1711.00937, 2017. conditions,” in Proc. of Interspeech, 2016.
[28] Philip Bachman, R Devon Hjelm, and William Buchwalter, “Learning [47] Jia Cui, Brian Kingsbury, Bhuvana Ramabhadran, Abhinav Sethy, Kar-
representations by maximizing mutual information across views,” arXiv, tik Audhkhasi, Xiaodong Cui, Ellen Kislal, Lidia Mangu, Markus
vol. abs/1906.00910, 2019. Nussbaum-Thom, Michael Picheny, et al., “Multilingual representations
for low resource speech recognition and keyword search,” in Proc. of
[29] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret
ASRU, 2015.
Zoph, Ekin D. Cubuk, and Quoc V. Le, “Specaugment: A simple data
augmentation method for automatic speech recognition,” in Proc. of [48] Georg Heigold, Vincent Vanhoucke, Alan Senior, Patrick Nguyen, Mar-
Interspeech. 2019, ISCA. cAurelio Ranzato, Matthieu Devin, and Jeffrey Dean, “Multilingual
acoustic models using distributed deep neural networks,” in Proc. of
[30] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross,
ICASSP, 2013.
Nathan Ng, David Grangier, and Michael Auli, “fairseq: A fast, ex-
tensible toolkit for sequence modeling,” in Proc. of NAACL System [49] A. Ghoshal, P. Swietojanski, and S. Renals, “Multilingual training of
Demonstrations, 2019. deep neural networks,” in Proc. of ICASSP, 2013.
[50] Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, and Yifan Gong, “Cross-
language knowledge transfer using multilingual deep neural network
with shared hidden layers,” in Proc. of ICASSP, 2013.
[51] Ngoc Thang Vu et. al., “Multilingual deep neural network based acoustic
modeling for rapid language adaptation,” in Proc. of ICASSP, 2014.

View publication stats

You might also like