Do self-supervised speech models develop human-like perception biases?
Juliette Millet,1 2 3 Ewan Dunbar, 1 4
1 CoML, ENS/CNRS/EHESS/INRIA/PSL, Paris, France 2 LLF, University of Paris, CNRS, Paris, France 3 CRI, FAN, IIFR, University of Paris, Paris, France 4 University of Toronto, Toronto, Canada juliette.millet@cri-paris.org, ewan.dunbar@utoronto.ca
Abstract Vinyals, and Kavukcuoglu 2017; Rivière and Dupoux
Figure 2: Average of French listeners’ results (higher: better
discrimination) against average δ from (left) supervised ref- Spearman correlation
erence trained on phonemic transcriptions (right) wav2vec
trained on non-speech recordings. Each point is a contrast. Measures are normalised by dividing by standard deviation over the entire data set, so the two scales are comparable. Black circles are non-native contrasts, white ones are native (French). Figure 1: Log-likelihood values (top: shorter/higher bars are better) and Spearman correlation (bottom: taller bars are bet- All WorldVowels ter) for French (left) and English participants (right). Stars indicate that the pairwise difference is significant. The su- pervised reference is in white to distinguish it from the na- Native effect tive self-supervised models in light grey and the baselines in darker grey (neutral acoustic features and models trained on acoustic scenes).
across language-training conditions, on the one hand, and
the differences in accuracy for the two listener groups, on the Figure 3: Native language effect for each model, the big- other. As before, DeepSpeech performs well; unlike some ger the bar, the better the models capture language specifici- of the previous results at the phone contrast level, CPC is ties in the discrimination behaviour between the two groups. competitive with DeepSpeech at predicting differences in Stars indicate that the pairwise difference is significant. The groups. HuBERT and wav2vec 2.0, on the other hand, ap- supervised reference is in white to distinguish it from the pear to be learning a language-neutral speech perception self-supervised models in light grey. space. Figure 3 (right) shows the same analysis, but on only the WorldVowels dataset. The stimuli in this dataset are supervised reference, their similarity to human perceptual constructed to specifically induce different discrimination spaces is limited to capturing the discriminability of spe- behaviour between the two language groups. Here, Deep- cific individual stimuli. The models tested were similar, but Speech shows a much better ability to predict native lan- wav2vec 2.0 showed a slight advantage for predicting this guage effects, both in the absolute, and relative to the other kind of behaviour. models. As this analysis is done at the level of phone con- We have also shown that training on speech data is es- trasts, and not individual stimuli, the fact that our supervised sential to obtaining a human-like perceptual space: for all of reference model is trained to produce phonemic transcrip- our metrics, training on speech leads to better results than tions probably gives it a head start at predicting differences training on acoustic scenes. This strongly suggests that the in discrimination behaviour driven by phone categories. benefits of self-supervised speech models comes from learn- ing characteristics of human speech, not simply the fact that Discussion they are better general audio features. We speculate that this We showed that the self-supervised models we tested seem is not just important to their ability to predict human speech to learn representational spaces relevant for predicting hu- perception and to discriminate phones, but also of their (re- man phone discrimination. However, while human speech lated) utility for doing downstream tasks such as ASR. perception is known to be influenced by the set of phones in What these models learn about speech, however, is not the native phone inventory—which renders many unfamil- typically language-specific—at least, not in the same way iar phones systematically difficult to distinguish—the self- that human perception is. Wav2vec 2.0 and HuBERT do not supervised models we test do not capture systematic effects model language-specific differences in human speech per- of contrasts between specific pairs of phones. Unlike our ception, and can be seen as modelling a language-neutral or universal speech perception space. We note that the idea of Weber, G. 2019. Common voice: A massively-multilingual self-supervised models learning universal speech features is speech corpus. arXiv preprint arXiv:1912.06670. consistent with the fact that models trained on one language, Baevski, A.; Zhou, H.; Mohamed, A.; and Auli, M. 2020. or multilingually, have proven useful for representing speech wav2vec 2.0: A framework for self-supervised learning of in unseen languages (Riviere et al. 2020). speech representations. arXiv preprint arXiv:2006.11477. CPC does capture effects of native language on percep- Cooke, M.; and Scharenborg, O. 2008. The interspeech 2008 tion, but to a far lesser extent than our supervised reference. consonant challenge. Our CPC model differs from the other models tested in its Dunbar, E.; Bernard, M.; Hamilakis, N.; Nguyen, T.; small size, its causal architecture (wav2vec and HuBERT de Seyssel, M.; Rozé, P.; Rivière, M.; Kharitonov, E.; and use transformers), and in that it does not use masking during Dupoux, E. 2021. The Zero Resource Speech Challenge its training. 2021: Spoken language modelling. IEEE Transactions on One possible explanation for the limitations we observe Pattern Analysis and Machine Intelligence, 1–1. is insufficiency of training data: the models in question have Feather, J.; Durango, A.; Gonzalez, R.; and McDermott, J. generally shown good performance on downstream tasks 2019. Metamers of neural networks reveal divergence from when pre-trained on large amounts of data. We tested this human perceptual systems. In NeurIPS, 10078–10089. using available pretrained wav2vec and HuBERT models trained on much larger amounts of data. The detailed results Gemmeke, J. F.; Ellis, D. P.; Freedman, D.; Jansen, A.; can be found in the appendix. The models show a slight im- Lawrence, W.; Moore, R. C.; Plakal, M.; and Ritter, M. provement, but, when looking at the ρ statistic at the phone 2017. Audio set: An ontology and human-labeled dataset contrast level, they are still worse than MFCCs. for audio events. In 2017 IEEE International Conference Contrary to previous results (Millet and Dunbar 2020a,b), on Acoustics, Speech and Signal Processing (ICASSP), 776– our supervised reference system is quite good at predicting 780. IEEE. human discrimination behaviour, and clearly predicts a na- Hsu, W.-N.; Bolte, B.; Tsai, Y.-H. H.; Lakhotia, K.; tive language effect. The main differences in our experiment Salakhutdinov, R.; and Mohamed, A. 2021a. HuBERT: Self- are the type of model (DeepSpeech), the type of training ob- Supervised Speech Representation Learning by Masked Pre- jective (phone recognition rather than prediction of ortho- diction of Hidden Units. arXiv preprint arXiv:2106.07447. graphic text), and the size of the training corpora (we use Hsu, W.-N.; Tsai, Y.-H. H.; Bolte, B.; Salakhutdinov, R.; less data). Predicting phones rather than orthography seems and Mohamed, A. 2021b. HuBERT: How much can a bad to be critical (as we demonstrate in the appendix). teacher benefit ASR pre-training? In ICASSP 2021-2021 Given the advantage supervised phone recognizers show, IEEE International Conference on Acoustics, Speech and a different approach to developing more human-like repre- Signal Processing (ICASSP), 6533–6537. IEEE. sentational spaces in self-supervised models might be the Lakhotia, K.; Kharitonov, E.; Hsu, W.-N.; Adi, Y.; Polyak, inclusion of tasks or constraints that push them to take into A.; Bolte, B.; Nguyen, T.-A.; Copet, J.; Baevski, A.; Mo- account longer time scales in order to encourage them to hamed, A.; et al. 2021. Generative spoken language model- construct longer, more phone-like units. ing from raw audio. arXiv preprint arXiv:2102.01192. Levy, E. S. 2009. On the assimilation-discrimination rela- Acknowledgements tionship in American English adults’ French vowel learning. The Journal of the Acoustical Society of America, 126(5): This research was supported by the École Doctorale 2670–2682. Frontières du Vivant (FdV) – Programme Bettencourt, by Matusevych, Y.; Schatz, T.; Kamper, H.; Feldman, N. H.; the Connaught Fund and the Arts and Science Tri-Council and Goldwater, S. 2020. Evaluating computational models Bridging Fund, University of Toronto, and by French of infant phonetic learning across languages. arXiv preprint Agence Nationale de la Recherche grants ANR-17-CE28- arXiv:2008.02888. 0009 (GEOMPHON), ANR-11-IDFI-023 (IIFR), ANR-11- IDEX-0005 (USPC), ANR-10-LABX-0083 (EFL), ANR- Meyer, B. T.; Jürgens, T.; Wesker, T.; Brand, T.; and 17-EURE-0017 Frontcog, ANR-10-IDEX-0001-02 PSL*, Kollmeier, B. 2010. Human phoneme recognition depending ANR-19-P3IA-0001 PRAIRIE 3IA Institute. This work was on speech-intrinsic variability. The Journal of the Acoustical performed using HPC resources from GENCI-IDRIS (Grant Society of America, 128(5): 3126–3141. 20XX-AD011012415). Millet, J.; Chitoran, I.; and Dunbar, E. 2021. Predicting non-native speech perception using the Perceptual Assimila- tion Model and state-of-the-art acoustic models. In CoNLL References 2021 Proceedings, 25th Conference on Computational Nat- Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; ural Language Learning. Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Millet, J.; and Dunbar, E. 2020a. Perceptimatic: A hu- Q.; Chen, G.; et al. 2016. Deep speech 2: End-to-end speech man speech perception benchmark for unsupervised sub- recognition in english and mandarin. In International con- word modelling. 2020 Interspeech Conference Proceedings. ference on machine learning, 173–182. PMLR. Millet, J.; and Dunbar, E. 2020b. The Perceptimatic English Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, Benchmark for Speech Perception Models. 2020 Cogsci M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F. M.; and Conference Proceedings. Millet, J.; Jurov, N.; and Dunbar, E. 2019. Comparing unsu- Conference on Acoustics, Speech and Signal Processing pervised speech learning directly to human performance in (ICASSP), 3030–3034. IEEE. speech perception. In CogSci 2019-41st Annual Meeting of Yamada, R. A.; and Tohkura, Y. 1990. Perception and pro- Cognitive Science Society. duction of syllable-initial English/r/and/l/by native speakers Oord, A. v. d.; Vinyals, O.; and Kavukcuoglu, K. 2017. of Japanese. In First international conference on spoken Neural discrete representation learning. arXiv preprint language processing. arXiv:1711.00937. Zhang, Y.; Qin, J.; Park, D. S.; Han, W.; Chiu, C.-C.; Pang, Panayotov, V.; Chen, G.; Povey, D.; and Khudanpur, S. 2015. R.; Le, Q. V.; and Wu, Y. 2020. Pushing the limits of semi- Librispeech: an asr corpus based on public domain audio supervised learning for automatic speech recognition. arXiv books. In 2015 IEEE international conference on acoustics, preprint arXiv:2010.10504. speech and signal processing (ICASSP), 5206–5210. IEEE. Park, D. S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E. D.; and Le, Q. V. 2019. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779. Rivière, M.; and Dupoux, E. 2021. Towards unsupervised learning of speech features in the wild. In 2021 IEEE Spoken Language Technology Workshop (SLT), 156–163. IEEE. Riviere, M.; Joulin, A.; Mazaré, P.-E.; and Dupoux, E. 2020. Unsupervised pretraining transfers well across languages. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7414– 7418. IEEE. Rivière, M.; Joulin, A.; Mazaré, P.-E.; and Dupoux, E. 2020. Unsupervised pretraining transfers well across languages. arXiv:2002.02848. Scharenborg, O.; Tiesmeyer, S.; Hasegawa-Johnson, M.; and Dehak, N. 2018. Visualizing Phoneme Category Adaptation in Deep Neural Networks. In Interspeech, 1482–1486. Schatz, T. 2016. ABX-discriminability measures and appli- cations. Ph.D. thesis, Université Paris 6 (UPMC). Schatz, T.; and Feldman, N. H. 2018. Neural network vs. HMM speech recognition systems as models of human cross-linguistic phonetic perception. In Proceedings of the conference on cognitive computational neuroscience. Schatz, T.; Feldman, N. H.; Goldwater, S.; Cao, X.-N.; and Dupoux, E. 2021. Early phonetic learning without phonetic categories: Insights from large-scale simulations on realistic input. Proceedings of the National Academy of Sciences, 118(7). Wang, C.; Riviere, M.; Lee, A.; Wu, A.; Talnikar, C.; Haziza, D.; Williamson, M.; Pino, J.; and Dupoux, E. 2021. Vox- Populi: A Large-Scale Multilingual Speech Corpus for Rep- resentation Learning, Semi-Supervised Learning and Inter- pretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th In- ternational Joint Conference on Natural Language Process- ing (Volume 1: Long Papers), 993–1003. Online: Associa- tion for Computational Linguistics. Weerts, L.; Rosen, S.; Clopath, C.; and Goodman, D. F. 2021. The Psychometrics of Automatic Speech Recogni- tion. bioRxiv. Xu, Q.; Baevski, A.; Likhomanenko, T.; Tomasello, P.; Con- neau, A.; Collobert, R.; Synnaeve, G.; and Auli, M. 2021. Self-training and pre-training are complementary for speech recognition. In ICASSP 2021-2021 IEEE International