Analysis and control techniques for spoken language applications

We will soon have a PhD position in analysis and control techniques for deep-learning spoken language applications. This vacancy is part of the InDeep consortium project. Open until filled.

Project description

Speech processing is increasingly done via end-to-end rather than modular models: this makes is hard to understand what is causing the model’s decisions in general and specifically why it fails when it does. In the context of Automatic Speech Recognition (ASR), end-to-end typically means that the waveform is the input, and the transcription is the output. Other types of end-to-end speech models cover an even larger range distance between the input and output:

visually grounded models start with the speech signal and end with visual-semantic embeddings;
spoken language translation systems start with the speech signal in the source language and end with the text in the target language;
spoken command understanding start with the speech signal and end with a structured representation of the information conveyed in the utterance.

Given the opacity of such end-to-end models, it is desirable to develop and test methods for analyzing the intermediate representations they learn, and interpreting the decisions they make. The objective of this WP is to develop and test methods for manipulating intermediate representations learned by end-to-end speech-understanding models in order to make it possible for users to debug them, to control them, and explain their output.

References

Chrupała, G., Higy, B., & Alishahi, A. (2020). Analyzing analytical methods: The case of phonology in neural models of spoken language. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 4146-4156). http://dx.doi.org/10.18653/v1/2020.acl-main.381
Higy, B., Gelderloos, L., Alishahi, A., & Chrupała, G. (2021). Discrete representations in neural models of spoken language. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (pp. 163-176). language. http://dx.doi.org/10.18653/v1/2021.blackboxnlp-1.11
Ankita Pasad, Ju-Chieh Chou, and Karen Livescu. (2021). Layer-wise analysis of a self-supervised speech representation model. 2021 IEEE Au- tomatic Speech Recognition and Understanding Workshop (ASRU), pages 914–921. https://par.nsf.gov/biblio/10303839
Puyuan Peng and David F. Harwath. (2022). Word discovery in visually grounded, self-supervised speech models. https://arxiv.org/abs/2203.15081
Chrupała, G. (2022). Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques. Journal of Artificial Intelligence Research, 73, 673-707. https://doi.org/10.1613/jair.1.12967