Analysis and control techniques for spoken language applications
We will soon have a PhD position in analysis and control
techniques for deep-learning spoken language applications. This
vacancy is part of
the InDeep consortium
project. Open until filled.
Project description
Speech processing is increasingly
done via end-to-end rather than modular models: this makes is hard
to understand what is causing the model’s decisions in general and
specifically why it fails when it does. In the context of
Automatic Speech Recognition (ASR), end-to-end typically means
that the waveform is the input, and the transcription is the
output. Other types of end-to-end speech models cover an even
larger range distance between the input and output:
- visually grounded models start with the speech signal and
end with visual-semantic embeddings;
- spoken language translation systems start with the speech
signal in the source language and end with the text in the
target language;
- spoken command understanding start with the speech signal
and end with a structured representation of the information
conveyed in the utterance.
Given the opacity of such end-to-end models, it is desirable
to develop and test methods for analyzing the intermediate
representations they learn, and interpreting the decisions they
make. The objective of this WP is to develop and test methods for
manipulating intermediate representations learned by end-to-end
speech-understanding models in order to make it possible for users to
debug them, to control them, and explain their output.
References
- Chrupała, G., Higy, B., & Alishahi, A. (2020). Analyzing
analytical methods: The case of phonology in neural models of
spoken language. In Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics
(pp. 4146-4156). http://dx.doi.org/10.18653/v1/2020.acl-main.381
- Higy, B., Gelderloos, L., Alishahi, A., & Chrupała, G. (2021).
Discrete representations in neural models of spoken
language. In Proceedings of the Fourth BlackboxNLP Workshop on
Analyzing and Interpreting Neural Networks for NLP (pp. 163-176).
language. http://dx.doi.org/10.18653/v1/2021.blackboxnlp-1.11
- Ankita Pasad, Ju-Chieh Chou, and Karen Livescu.
(2021). Layer-wise analysis of a self-supervised speech representation
model. 2021 IEEE Au- tomatic Speech Recognition and Understanding
Workshop (ASRU), pages
914–921. https://par.nsf.gov/biblio/10303839
- Puyuan Peng and David F. Harwath. (2022). Word discovery in
visually grounded, self-supervised speech
models. https://arxiv.org/abs/2203.15081
- Chrupała, G. (2022). Visually grounded models of spoken
language: A survey of datasets, architectures and evaluation
techniques. Journal of Artificial Intelligence Research, 73,
673-707. https://doi.org/10.1613/jair.1.12967