new! Summary presentation from a machine learning perspective: Dupoux. E. (2022). Textless NLP: towards language processing from raw audio. LREC 22 (video lecture, 32min) (PDF).
Summary article from a cognitive science perspective: Dupoux (2018). Cognitive Science in the era of Artificial Intelligence: A roadmap for reverse-engineering the infant language-learner, Cognition. Associated presentation: (PDF)
Somewhat older (video lecture): Dupoux. E. (2009). How Do Infants Bootstrap into Spoken Language?: Models and Challenges. ICML, McGill, June 2009.
Marvin Lavechin, , Maureen de Seyssel, , Marianne Métais, , Florian Metze, , Abdelrahman Mohamed, , Hervé Bredin, , Emmanuel Dupoux, & Alejandrina Cristia, (2024). Modeling early phonetic acquisition from child-centered audio data. Cognition, 245, 105734. [abstract] Infants learn their native language(s) at an amazing speed. Before they even talk, their perception adapts to the language(s) they hear. However, the mechanisms responsible for this perceptual attunement and the circumstances in which it takes place remain unclear. This paper presents the first attempt to study perceptual attunement using ecological child-centered audio data. We show that a simple prediction algorithm exhibits perceptual attunement when applied on unrealistic clean audio-book data, but fails to do so when applied on ecologically-valid child-centered data. In the latter scenario, perceptual attunement only emerges when the prediction mechanism is supplemented with inductive biases that force the algorithm to focus exclusively on speech segments while learning speaker-, pitch-, and room-invariant representations. We argue these biases are plausible given previous research on infants and non-human animals. More generally, we show that what our model learns and how it develops through exposure to speech depends exquisitely on the details of the input signal. By doing so, we illustrate the importance of considering ecologically valid input data when modeling language acquisition.
de Seyssel, M., Lavechin, M., Titeux, H., Thomas, A., Virlet, G., Revilla, A.S., Wisniewski, G., Ludusan, B. & Dupoux, E. (2023). ProsAudit, a prosodic benchmark for self-supervised speech models. In INTERSPEECH-2023, (pp 2963-2967) . [abstract] We present ProsAudit, a benchmark in English to assess structural prosodic knowledge in self-supervised learning (SSL) speech models. It consists of two subtasks, their corresponding metrics, and an evaluation dataset. In the protosyntax task, the model must correctly identify strong versus weak prosodic boundaries. In the lexical task, the model needs to correctly distinguish between pauses inserted between words and within words. We also provide human evaluation scores on this benchmark. We evaluated a series of SSL models and found that they were all able to perform above chance on both tasks, even when evaluated on an unseen language. However, non-native models performed significantly worse than native ones on the lexical task, highlighting the importance of lexical knowledge in this task. We also found a clear effect of size with models trained on more data performing better in the two subtasks.
de Seyssel, M., Lavechin, M. & Dupoux, E. (2023). Realistic and broad-scope learning simulations: first results and challenges. Journal of Child Language. [abstract] There is a current `theory crisis' in language acquisition research, resulting from fragmentation both at the level of the approaches and the linguistic level studied. We identify a need for integrative approaches that go beyond these limitations, and propose to analyse the strengths and weaknesses of current theoretical approaches of language acquisition. In particular, we advocate that language learning simulations, if they integrate realistic input and multiple levels of language, have the potential to contribute significantly to our understanding of language acquisition. We then review recent results obtained through such language learning simulations. Finally, we propose some guidelines for the community to build better simulations.
Taillandier, V., Hupkes, D., Sagot, B., Dupoux, E. & Michel, P. (2023). Neural Agents Struggle to Take Turns in Bidirectional Emergent Communication. In ICLR. [abstract] Parts of the brain that carry sensory tasks are organized topographically: nearby neurons are responsive to the same properties of input signals. Thus, in this work, inspired by the neuroscience literature, we proposed a new topographic inductive bias in Convolutional Neural Networks (CNNs). To achieve this, we introduced a new topographic loss and an efficient implementation to topographically organize each convolutional layer of any CNN. We benchmarked our new method on 4 datasets and 3 models in vision and audio tasks and showed equivalent performance to all benchmarks. Besides, we also showcased the generalizability of our topographic loss with how it can be used with different topographic organizations in CNNs. Finally, we demonstrated that adding the topographic inductive bias made CNNs more resistant to pruning. Our approach provides a new avenue to obtain models that are more memory efficient while maintaining better accuracy.
Sy, Y., Havard, W.N., Lavechin, M., Dupoux, E. & Cristia, A. (2023). Measuring language development from child-centered recordings. In Proceedings of INTERSPEECH, (pp 4618-4622) . [abstract] Standard ways to measure child language development from spontaneous corpora rely on detailed linguistic descriptions of a language as well as exhaustive transcriptions of the child's speech, which today can only be done through costly human labor. We tackle both issues by proposing (1) a new language development metric (based on entropy) that does not require linguistic knowledge other than having a corpus of text in the language in question to train a language model, (2) a method to derive this metric directly from speech based on a smaller text-speech parallel corpus. Here, we present descriptive results on an open archive including data from six English-learning children as a proof of concept. We document that our entropy metric documents a gradual convergence of children's speech towards adults' speech as a function of age, and it also correlates moderately with lexical and morphosyntactic measures derived from morphologically-parsed transcriptions.
Poli, M., Dupoux, E. & Riad, R. (2023). Introducing Topography in Convolutional Neural Networks. In Proc. of ICASSP, (pp 1--5) IEEE. [abstract] Parts of the brain that carry sensory tasks are organized topographically: nearby neurons are responsive to the same properties of input signals. Thus, in this work, inspired by the neuroscience literature, we proposed a new topographic inductive bias in Convolutional Neural Networks (CNNs). To achieve this, we introduced a new topographic loss and an efficient implementation to topographically organize each convolutional layer of any CNN. We benchmarked our new method on 4 datasets and 3 models in vision and audio tasks and showed equivalent performance to all benchmarks. Besides, we also showcased the generalizability of our topographic loss with how it can be used with different topographic organizations in CNNs. Finally, we demonstrated that adding the topographic inductive bias made CNNs more resistant to pruning. Our approach provides a new avenue to obtain models that are more memory efficient while maintaining better accuracy.
Nguyen, T.A., Kharitonov, E., Copet, J., Adi, Y., Hsu, W.N., Elkahky, A., Tomasello, P., Algayres, R., Sagot, B., Mohamed, A. & Dupoux, E. (2023). Generative Spoken Dialogue Language Modeling. Transactions of the Association for Computational Linguistics. [abstract] We introduce dGSLM, the first ``textless'' model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn taking compared to a text-based cascaded model.
Nguyen, T.A., Hsu, W.N., d'Avirro, A., Shi, B., Gat, I., Fazel-Zarani, M., Remez, T., Copet, J., Synnaeve, G., Hassid, M. & others, (2023). Expresso: A benchmark and analysis of discrete expressive speech resynthesis. In INTERSPEECH-2023, (pp 4823-4827) . [abstract] Recent work has shown that it is possible to resynthesize highquality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthesis datasets are read, severely limiting spontaneity and expressivity. Here, we introduce EXPRESSO, a high-quality expressive speech dataset for textless speech synthesis that includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We illustrate the challenges and potentials of this dataset with an expressive resynthesis benchmark where the task is to encode the input in lowbitrate units and resynthesize it in a target voice while preserving content and style. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders, and explore tradeoffs between quality, bitrate and invariance to speaker and style. All the dataset, evaluation metrics and baseline models are open sourced.
Lavechin, M., Sy, Y., Titeux, H., Cruz Blandón, M.A., Räsänen, O., Bredin, H., Dupoux, E. & Cristia, A. (2023). BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models. In INTERSPEECH-2023, (pp 4588--4592) . [abstract] Standard ways to measure child language development from spontaneous corpora rely on detailed linguistic descriptions of a language as well as exhaustive transcriptions of the child's speech, which today can only be done through costly human labor. We tackle both issues by proposing (1) a new language development metric (based on entropy) that does not require linguistic knowledge other than having a corpus of text in the language in question to train a language model, (2) a method to derive this metric directly from speech based on a smaller text-speech parallel corpus. Here, we present descriptive results on an open archive including data from six English-learning children as a proof of concept. We document that our entropy metric documents a gradual convergence of children's speech towards adults' speech as a function of age, and it also correlates moderately with lexical and morphosyntactic measures derived from morphologically-parsed transcriptions.
Lavechin, M., Métais, M., Titeux, H., Boissonnet, A., Copet, J., Rivière, M., Bergelson, E., Cristia, A., Dupoux, E. & Bredin, H. (2023). Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), (pp 1--7) . [abstract] Most automatic speech processing systems register degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a neural network jointly trained to extract speech/non-speech segments, speech-to-noise ratios, and C50 room acoustics from single-channel recordings. Brouhaha is trained using a data-driven approach in which noisy and reverberant audio segments are synthesized. We first evaluate its performance and demonstrate that the proposed multi-task regime is beneficial. We then present two scenarios illustrating how Brouhaha can be used on naturally noisy and reverberant data: 1) to investigate the errors made by a speaker diarization model (pyannote.audio); and 2) to assess the reliability of an automatic speech recognition model (Whisper from OpenAI). Both our pipeline and a pretrained model are open source and shared with the speech community.
Hassid, M., Remez, T., Nguyen, T.A., Gat, I., Conneau, A., Kreuk, F., Copet, J., Defossez, A., Synnaeve, G., Dupoux, E., Schwartz, R. & Adi, Y. (2023). Textually pretrained speech language models. In NeurIPS, 36, (pp 63483--63501) . [abstract] Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. We make speech samples, code and models publicly available.
Hallap, M., Dupoux, E. & Dunbar, E. (2023). Evaluating context-invariance in unsupervised speech representations. In INTERSPEECH-2023, (pp 2973--2977) . [abstract] Unsupervised speech representations have taken off with benchmarks demonstrating major progress on semi-supervised speech recognition, speech synthesis, and speech-only language modelling. Inspiration comes from the promise of discovering the phonemes of a language or a similar low-bitrate encoding. However, one of the critical properties of phoneme transcriptions is context-invariance: the phonetic context of a speech sound can have massive influence on the way it is pronounced while text remains stable. This is why tokens of the same word have the same transcriptions---key to language understanding. Current benchmarks do not measure context-stability. We develop a new version of the ZeroSpeech ABX benchmark that does, and apply it to recent self-supervised representations. We show that context-independence of representations is predictive of the stability of word-level representations. We suggest research concentrate on improving context-independence of unsupervised representations.
Gat, I., Kreuk, F., Nguyen, T.A., Lee, A., Copet, J., Synnaeve, G., Dupoux, E. & Adi, Y. (2023). Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling. In The 20th International Conference on Spoken Language Translation, (pp 465-477) . [abstract] Generative Spoken Language Modeling research focuses on optimizing speech Language Models (LMs) using raw audio recordings without accessing any textual supervision. Such speech LMs usually operate over discrete units obtained from quantizing internal representations of self-supervised models. Although such units show impressive modeling results, their robustness capabilities have not been extensively investigated. This work focuses on improving the invariance of discrete input representations to non-spoken augmentations for generative spoken language modeling. First, we formally define how to measure the robustness of such representations to various signal variations that do not alter the spoken information (e.g., time-stretch). Next, we empirically demonstrate how current state-of-the-art representation models lack robustness to such variations. To overcome this, we propose an effective and efficient method to learn invariant discrete speech representation for generative spoken language modeling. The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudolabeling scheme. Our method significantly improves over the evaluated baselines when considering encoding and modeling metrics. We additionally evaluate our method on the speechto-speech translation task, considering SpanishEnglish and French-English translations, and show the proposed approach outperforms the evaluated baselines.
Elkahky, A., Hsu, W.N., Tomasello, P., Nguyen, T.A., Algayres, R., Adi, Y., Copet, J., Dupoux, E. & Mohamed, A. (2023). Do Coarser Units Benefit Cluster Prediction-Based Speech Pre-Training? In Proc. of ICASSP, (pp 1--5) . [abstract] The research community has produced many successful self-supervised speech representation learning methods over the past few years. Discrete units have been utilized in various self-supervised learning frameworks, such as VQ-VAE [1], wav2vec 2.0 [2], Hu-BERT [3], and Wav2Seq [4]. This paper studies the impact of altering the granularity and improving the quality of these discrete acoustic units for pre-training encoder-only and encoder-decoder models. We systematically study the current proposals of using Byte-Pair Encoding (BPE) and new extensions that use cluster smoothing and Brown clustering. The quality of learned units is studied intrinsically using zero speech metrics and on the down-stream speech recognition (ASR) task. Our results suggest that longer-range units are helpful for encoder-decoder pre-training; however, encoder-only masked-prediction models cannot yet benefit from self-supervised word-like targets.
Bernard, M., Poli, M., Karadayi, J. & Dupoux, E. (2023). Shennong: a Python toolbox for audio speech features extraction". Behavior Research Methods. [abstract] We introduce Shennong, a Python toolbox and command-line utility for audio speech features extraction. It implements a wide range of well-established state-of-the-art algorithms: spectro-temporal filters such as Mel-Frequency Cepstral Filterbank or Predictive Linear Filters, pre-trained neural networks, pitch estimators, speaker normalization methods, and post-processing algorithms. Shennong is an open source, reliable and extensible framework built on top of the popular Kaldi speech processing library. The Python implementation makes it easy to use by non-technical users and integrates with third-party speech modeling and machine learning tools from the Python ecosystem. This paper describes the Shennong software architecture, its core components, and implemented algorithms. Then, three applications illustrate its use. We first present a benchmark of speech features extraction algorithms available in Shennong on a phone discrimination task. We then analyze the performances of a speaker normalization model as a function of the speech duration used for training. We finally compare pitch estimation algorithms on speech under various noise conditions.
Algayres, R., Adi, Y., Nguyen, T., Copet, J., Synnaeve, G., Sagot, B. & Dupoux, E. (2023). Generative Spoken Language Model based on continuous word-sized audio tokens. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, (pp 3008--3028) . [abstract] In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio tokens that can generate diverse and expressive language output. This is obtained by replacing lookup table for lexical types with a Lexical Embedding function, the cross entropy loss by a contrastive loss, and multinomial sampling by k-NN sampling. The resulting model is the first generative language model based on word-size continuous tokens. Its performance is on par with discrete unit GSLMs regarding generation quality as measured by automatic metrics and subjective human judgements. Moreover, it is five times more memory efficient thanks to its large 200ms units. In addition, the embeddings before and after the Lexical Embedder are phonetically and semantically interpretable.
de Seyssel, M., Wisniewski, G., Dupoux, E. & Ludusan, B. (2022). Investigating the usefulness of i-vectors for automatic language characterization. In Proceesings of Speech Prosody, (pp 460-464) . [abstract] Work done in recent years has shown the usefulness of using automatic methods for the study of linguistic typology. However, the majority of proposed approaches come from natural language processing and require expert knowledge to predict typological information for new languages. An alternative would be to use speech-based methods that do not need extensive linguistic annotations, but considerably less work has been done in this direction. The current study aims to reduce this gap, by investigating a promising speech representation, i-vectors, which by capturing suprasegmental features of language, can be used for the automatic characterization of languages. Employing data from 24 languages, covering several linguistic families, we computed the i-vectors corresponding to each sentence and we represented the languages by their centroid i-vector. Analyzing the distance between the language centroids and phonological, inventory and syntactic distances between the same languages, we observed a significant correlation between the i-vector distance and the syntactic distance. Then, we explored in more detailed a number of syntactic features and we proposed a method for predicting the value of the most promising feature, based on the i-vector information. The obtained results, an 87% classification accuracy, are encouraging and we envision to extend this method further.
de Seyssel, M., Wisniewski, G. & Dupoux, E. (2022). Is the Language Familiarity Effect gradual ? A computational modelling approach. In A.P. J. Culbertson(ed) Proceedings of Cognitive Science, (pp 1728-1735) . [abstract] According to the Language Familiarity Effect (LFE), people are better at discriminating between speakers of their native language. Although this cognitive effect was largely studied in the literature, experiments have only been conducted on a limited number of language pairs and their results only show the presence of the effect without yielding a gradual measure that may vary across language pairs. In this work, we show that the computational model of LFE introduced by Thorburn, Feldman, and Schatz (2019) can address these two limitations. In a first experiment, we attest to this model's capacity to obtain a gradual measure of the LFE by replicating behavioural findings on native and accented speech. In a second experiment, we evaluate LFE on a large number of language pairs, including many which have never been tested on humans. We show that the effect is replicated across a wide array of languages, providing further evidence of its universality. Building on the gradual measure of LFE, we also show that languages belonging to the same family yield smaller scores, supporting the idea of an effect of language distance on LFE.
de Seyssel, M., Lachevin, M., Adi, Y., Dupoux, E. & Wisniewski, G. (2022). Probing phoneme, language and speaker information in unsupervised speech representations. In INTERSPEECH, (pp 1402-1406) . [abstract] Unsupervised models of representations based on Contrastive Predictive Coding (CPC)[1] are primarily used in spoken language modelling in that they encode phonetic information. In this study, we ask what other types of information are present in CPC speech representations. We focus on three categories: phone class, gender and language, and compare monolingual and bilingual models. Using qualitative and quantitative tools, we find that both gender and phone class information are present in both types of models. Language information, however, is very salient in the bilingual model only, suggesting CPC models learn to discriminate languages when trained on multiple languages. Some language information can also be retrieved from monolingual models, but it is more diffused across all features. These patterns hold when analyses are carried on the discrete units from a downstream clustering model. However, although there is no effect of the number of target clusters on phone class and language information, more gender information is encoded with more clusters. Finally, we find that there is some cost to being exposed to two languages on a downstream phoneme discrimination task.
Tomasello, P., Shrivastava, A., Lazar, D., Le, D., Sagar, A., Elkahky, A., Copet, J., Hsu, W.N., Adi, Y., Algayres, R., Nguyen, T.A., Dupoux, E., Zettlemoyer, L. & Mohamed, A. (2022). STOP: A dataset for Spoken Task Oriented Semantic Parsing. In IEEE SLT-2022, (pp 991-998) . [abstract] End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. It promises to improve the performance of assistant systems by leveraging acoustic information lost in the intermediate textual representation and preventing cascading errors from Automatic Speech Recognition (ASR). Further, having one unified model has efficiency advantages when deploying assistant systems on-device. Unfortunately, the limited number of public audio datasets with semantic parse labels hinders the research progress in this area. In this paper, we release the Spoken task-oriented semantic parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available. Additionally, we define low-resource splits to establish a benchmark for improving SLU when limited labeled data is available. Furthermore, in addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
Rita, M., Tallec, C., Michel, P., Grill, J.B., Pietquin, O., Dupoux, E. & Strub, F. (2022). Emergent communication: generalization and overfitting in Lewis games. In NeurIPS. [abstract] Lewis signaling games are a class of simple communication games for simulating the emergence of language. In these games, two agents must agree on a communication protocol in order to solve a cooperative task. Previous work has shown that agents trained to play this game with reinforcement learning tend to develop languages that display undesirable properties from a linguistic point of view (lack of generalization, lack of compositionality, etc). In this paper, we aim to provide better understanding of this phenomenon by analytically studying the learning problem in Lewis games. As a core contribution, we demonstrate that the standard objective in Lewis games can be decomposed in two components: a co-adaptation loss and an information loss. This decomposition enables us to surface two potential sources of overfitting, which we show may undermine the emergence of a structured communication protocol. In particular, when we control for overfitting on the co-adaptation loss, we recover desired properties in the emergent languages: they are more compositional and generalize better.
Rita, M., Strub, F., Grill, J.B., Pietquin, O. & Dupoux, E. (2022). On the role of population heterogeneity in emergent communication. In ICLR. [abstract] Populations have often been perceived as a structuring component for language to emerge and evolve: the larger the population, the more structured the language. While this observation is widespread in the sociolinguistic literature, it has not been consistently reproduced in computer simulations with neural agents. In this paper, we thus aim to clarify this apparent contradiction. We explore emergent language properties by varying agent population size in the speaker-listener Lewis Game. After reproducing the experimental difference, we challenge the simulation assumption that the agent community is homogeneous. We then investigate how speaker-listener asymmetry alters language structure through the analysis a potential diversity factor: learning speed. From then, we leverage this observation to control population heterogeneity without introducing confounding factors. We finally show that introducing such training speed heterogeneities naturally sort out the initial contradiction: larger simulated communities start developing more stable and structured languages.
Riad, R. (2022). Automatic speech and language processing for precision medicine in Huntington's disease. (Unpublished doctoral dissertation) Ecole normale supérieure - ENS PARIS. [abstract] Neurodegenerative diseases are a major social issue and public health priority worldwide. Huntington Disease (HD) is a rare disease of genetic origin that causes cognitive, behavioural and motor disorders due to brain lesions, in particular in the striatum. People with the genetic mutation of HD have a pre-symptomatic phase of several decades during which they have no neurological disorder before the symptomatic phase occurs. The symptoms of this disease have many implications in the life activities of the patient, with a gradual loss of autonomy, until the death of the patient. This makes HD a potential model of neurodegenerative diseases that could lead to the development of new clinical monitoring tools. The current medical monitoring in HD is expensive and requires the patient to travel regularly to the hospital, generating a significant human and financial burden. The purpose of this thesis is to develop and validate new computational methods for automatically monitoring Huntington's Disease individuals, thanks to the analysis of their spoken language productions. Spoken language production invokes various cognitive, social and motor skills, and its realisation is influenced by the mental state of the individual. Our hypothesis is that through the inspection of the produced speech and its content we can assess these different skills and states. To this date, the analysis of spoken language disorders in HD is only performed in a few clinical departments and specialised research teams, at a small scale without classic clinical validation. In addition, the potential of spoken language markers to predict the different symptoms in HD have not been explored. Therefore in this thesis, we designed a comprehensive spoken language battery, along with a complete annotation protocol that is parsable by a computer program. This battery measures different parameters to obtain a wide clinical picture of spoken language in HD, that varies the linguistic target, the cognitive load, the emotional content, the topics and the materials of the discourse. To speed up the annotations protocol, we designed and developed open-source software to manage linguistic annotation campaigns. This allowed us to collect what is, to the best of our knowledge, the largest database of fine-grained annotated spoken language productions in HD, with 125 annotated interviews of 3 groups of individuals: healthy controls, premanifest individuals carrying the gene that causes HD and manifest HD at different stages. Besides, we also formalized and implemented the tracks of i communication introduced by H. Clark, which allow analyzing the use of spoken language in spontaneous exchanges for HD individuals. Then, to speed up and automate the annotation process, we developed and validated machine learning methods to recognise turn-takings and identify these tracks of communication directly from speech. Finally, thanks to this new database, we assessed the capabilities of spoken language markers to predict the different symptoms in HD. We especially found out that rhythm and articulatory markers extracted from tasks with a cognitive load can predict accurately the global, motor, functional and cognitive components of the disease. We additionally found significant correlations between silence statistics and the volume of the striatum, the neuro-anatomical hallmark of the disease progress. In spontaneous productions, we found that the ratio of tracks of communication was different between HD individuals and other groups. The primary track was diminished, the timing ratio of secondary presentation (filled pauses) also decreased and the timing of incidental elements (ex: vocal noises, audible respiration) greatly increased. We also proposed new methodologies to examine the emotional speech production in HD. Finally, we found out that the manifest individuals with HD have both vocal and linguistic impairments during emotional speech production.
Riad, R., Titeux, H., Lemoine, L., Montillot, J., Sliwinski, A., Xuan-Nga, C., Bachoud-Lévi, A.C. & Dupoux, E. (2022). A comparison study on patient-psychologist voice diarization. In Ninth Workshop on Speech and Language Processing for Assistive Technologies (SLPAT-2022), (pp 30--36) . [abstract] Conversations between a clinician and a patient, in natural conditions, are valuable sources of information for medical follow-up. The automatic analysis of these dialogues could help extract new language markers and speed up the clinicians' reports. Yet, it is not clear which model is the most efficient to detect and identify the speaker turns, especially for individuals with speech disorders. Here, we proposed a split of the data that allows conducting a comparative evaluation of different diarization methods. We designed and trained end-to-end neural network architectures to directly tackle this task from the raw signal and evaluate each approach under the same metric. We also studied the effect of fine-tuning models to find the best performance. Experimental results are reported on naturalistic clinical conversations between Psychologists and Interviewees, at different stages of Huntington's disease, displaying a large panel of speech disorders. We found out that our best end-to-end model achieved 19.5 % IER on the test set, compared to 23.6% achieved by the finetuning of the X-vector architecture. Finally, we observed that we could extract clinical markers directly from the automatic systems, highlighting the clinical relevance of our methods.
Riad, R., Lunven, M., Titeux, H., Xuan-Nga, C., Hamet Bagnou, J., Lemoine, L., Montillot, J., Sliwinski, A., Youssov, K., Cleret de Langavant, L., Dupoux, E. & Bachoud-Lévi, A.C. (2022). Predicting clinical scores in Huntington's disease: a lightweight speech test. Journal of Neurology, 269, 5008--5021. [abstract] Objectives Using brief samples of speech recordings, we aimed at predicting, through machine learning, the clinical performance in Huntington's Disease (HD), an inherited Neurodegenerative disease (NDD). Methods We collected and analyzed 126 samples of audio recordings of both forward and backward counting from 103 Huntington's disease gene carriers [87 manifest and 16 premanifest; mean age 50.6 (SD 11.2), range (27--88) years] from three multicenter prospective studies in France and Belgium (MIG-HD (ClinicalTrials.gov NCT00190450); BIO-HD (ClinicalTrials.gov NCT00190450) and Repair-HD (ClinicalTrials.gov NCT00190450). We pre-registered all of our methods before running any analyses, in order to avoid infated results. We automatically extracted 60 speech features from blindly annotated samples. We used machine learning models to combine multiple speech features in order to make predictions at individual levels of the clinical markers. We trained machine learning models on 86% of the samples, the remaining 14% constituted the independent test set. We combined speech features with demographics variables (age, sex, CAG repeats, and burden score) to predict cognitive, motor, and functional scores of the Unifed Huntington's disease rating scale. We provided correlation between speech variables and striatal volumes. Results Speech features combined with demographics allowed the prediction of the individual cognitive, motor, and functional scores with a relative error from 12.7 to 20.0% which is better than predictions using demographics and genetic information. Both mean and standard deviation of pause durations during backward recitation and clinical scores correlated with striatal atrophy (Spearman 0.6 and 0.5--0.6, respectively). Interpretation Brief and examiner-free speech recording and analysis may become in the future an efcient method for remote evaluation of the individual condition in HD and likely in other NDD
Nguyen, T.A., Sagot, B. & Dupoux, E. (2022). Are discrete units necessary for Spoken Language Modeling? IEEE Journal of Selected Topics in Signal Processing, 16(6), 1415 -- 1423. [abstract] Recent work in spoken language modeling shows the possibility of learning a language unsupervisedly from raw audio without any text labels. The approach relies first on transforming the audio into a sequence of discrete units (or pseudo-text) and then training a language model directly on such pseudo-text. Is such a discrete bottleneck necessary, potentially introducing irreversible errors in the encoding of the speech signal, or could we learn a language model without discrete units at all? In this work, show that discretization is indeed essential for good results in spoken language modeling, but that can omit the discrete bottleneck if we use using discrete target features from a higher level than the input features. We also show that an end-to-end model trained with discrete target like HuBERT achieves similar results as the best language model trained on pseudo-text on a set of zero-shot spoken language modeling metrics from the Zero Resource Speech Challenge 2021.
Millet, J., Caucheteux, C., Orhan, P., Boubenec, Y., Gramfort, A., Pallier, C.C., Dunbar, E. & King, J.R. (2022). Toward a realistic model of speech processing in the brain with self-supervised learning. In NeurIPS 2022 - 36th Conference on Neural Information Processing Systems. [abstract] Several deep neural networks have recently been shown to generate activations similar to those of the brain in response to the same input. These algorithms, however, remain largely implausible: they require extraordinarily large amounts of data, unobtainable supervised labels, textual rather than raw sensory input, and / or implausibly large memory (e.g. thousands of contextual words). These elements highlight the need to identify algorithms that, under these limitations, would suffice to account for both behavioral and brain responses. Focusing on speech processing, we here hypothesize that self-supervised algorithms trained on the raw waveform constitute a promising candidate. Specifically, we compare a recent self-supervised model, wav2vec 2.0, to the brain activity of 412 English, French, and Mandarin individuals recorded with functional Magnetic Resonance Imaging (fMRI), while they listened to approximately one hour of audio books. First, we show that this algorithm learns brain-like representations with as little as 600 hours of unlabelled speech-a quantity comparable to what infants can be exposed to during language acquisition. Second, its functional hierarchy aligns with the cortical hierarchy of speech processing. Third, different training regimes reveal a functional specialization akin to the cortex: wav2vec 2.0 learns sound-generic, speech-specific and language-specific representations similar to those of the prefrontal and temporal cortices. Fourth, we confirm the similarity of this specialization with the behavior of 386 additional participants. These elements, resulting from the largest neuroimaging benchmark to date, show how self-supervised learning can account for a rich organization of speech processing in the brain, and thus delineate a path to identify the laws of language acquisition which shape the human brain.
Ludusan, B., Cristia, A., Mazuka, R. & Dupoux, E. (2022). How much does prosody help word segmentation? A simulation study on infant-directed speech Cognition, 219, 104961. [abstract] Infants come to learn several hundreds of word forms by two years of age, and it is possible this involves carving these forms out from continuous speech. It has been proposed that the task is facilitated by the presence of prosodic boundaries. We revisit this claim by running computational models of word segmentation, with and without prosodic information, on a corpus of infant-directed speech. We use five cognitively-based algorithms, which vary in whether they employ a sub-lexical or a lexical segmentation strategy and whether they are simple heuristics or embody an ideal learner. Results show that providing expert-annotated prosodic breaks does not uniformly help all segmentation models. The sub-lexical algorithms, which perform more poorly, benefit most, while the lexical ones show a very small gain. Moreover, when prosodic information is derived automatically from the acoustic cues infants are known to be sensitive to, errors in the detection of the boundaries lead to smaller positive effects, and even negative ones for some algorithms. This shows that even though infants could potentially use prosodic breaks, it does not necessarily follow that they should incorporate prosody into their segmentation strategies, when confronted with realistic signals.
Lavechin, M., Seyssel, M.D., Gautheron, L., Dupoux, E. & Cristia, A. (2022). Reverse engineering language acquisition with child-centered long-form recordings. Annual Review of Linguistics, 8, 389-407.
Kreuk, F., Polyak, A., Copet, J., Kharitonov, E., Nguyen, T.A., Rivière, M., Hsu, W.N., Mohamed, A., Dupoux, E. & Adi, Y. (2022). Textless Speech Emotion Conversion using Decomposed and Discrete Representations. In Proceedings of EMNLP, (pp 11200 - 11214) . [abstract] Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion. First, we modify the speech content by translating the phonetic-content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows us to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc. We demonstrate objectively and subjectively that the proposed method is vastly superior to current approaches and even beats text-based systems in terms of perceived emotion and audio quality. We rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Samples are available under the following link at this URL: https://speechbot.github.io/emotion/
Kharitonov, E., Lee, A., Polyak, A., Adi, Y., Copet, J., Lakhotia, K., Nguyen, T.A., Rivière, M., Mohamed, A., Dupoux, E. & Hsu, W.N. (2022). Text-Free Prosody-Aware Generative Spoken Language Modeling. In ACL, (pp 8666-8681) . [abstract] Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) (Lakhotia et al., 2021) is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences. Unfortunately, despite eliminating the need of text, the units used in GSLM discard most of the prosodic information. Hence, GSLM fails to leverage prosody for better comprehension, and does not generate expressive speech. In this work, we present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms. We devise a series of metrics for prosody modeling and generation, and re-use metrics from GSLM for content modeling. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt. Audio samples can be found at this https URL.
Kharitonov, E., Copet, J., Lakhotia, K., Nguyen, T.A., Tomasello, P., Lee, A., Elkahky, A., Hsu, W.N., Mohamed, A., Dupoux, E. & Adi, Y. (2022). textless-lib: a Library for Textless Spoken Language Processing. In NAACL: System Demonstrations, (pp 1-9) . [abstract] Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources. In this paper, we introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area. We describe the building blocks that the library provides and demonstrate its usability by discuss three different use-case examples: (i) speaker probing, (ii) speech resynthesis and compression, and (iii) speech continuation. We believe that textless-lib substantially simplifies research the textless setting and will be handful not only for speech researchers but also for the NLP community at large. The code, documentation, and pre-trained models are available at this URL: https://github.com/facebookresearch/textlesslib/
Gallezot, C., Riad, R., Titeux, H., Lemoine, L., Montillot, J., Sliwinski, A., Bagnou Hamet, J., Cao, X.N., Youssov, K., Dupoux, E. & Bachoud-Lévi, A.C. (2022). Emotion expression through spoken language in Huntington disease. Cortex, 155, 150-161. [abstract] Patients with Huntington's disease suffer from disturbances in the perception of emotions; they do not correctly read the body, vocal and facial expressions of others. With regard to the expression of emotions, it has been shown that they are impaired in expressing emotions through face but up until now, little research has been conducted about their ability to express emotions through spoken language. To better understand emotion production in both voice and language in Huntington's Disease (HD), we tested 115 individuals: 68 patients (HD), 22 participants carrying the mutant HD gene without any motor symptoms (pre-manifest HD), and 25 controls in a single-centre prospective observational follow-up study. Participants were recorded in interviews in which they were asked to recall sad, angry, happy, and neutral stories. Emotion expression through voice and language was investigated by comparing the identifiability of emotions expressed by controls, preHD and HD patients in these interviews. To assess separately vocal and linguistic expression of emotions in a blind design, we used machine learning models instead of a human jury performing a forced-choice recognition test. Results from this study showed that patients with HD had difficulty expressing emotions through both voice and language compared to preHD participants and controls, who behaved similarly and above chance. In addition, we did not find any differences in expression of emotions between preHD and healthy controls. We further validated our newly proposed methodology with a human jury on the speech produced by the controls. These results are consistent with the hypothesis that emotional deficits in HD are caused by impaired sensori-motor representations of emotions, in line with embodied cognition theories. This study also shows how machine learning models can be leveraged to assess emotion expression in a blind and reproducible way.
Dunbar, E., Hamilakis, N. & Dupoux, E. (2022). Self-supervised language learning from raw audio: Lessons from the Zero Resource Speech Challenge. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1211 - 1226. [abstract] Recent progress in self-supervised or unsupervised machine learning has opened the possibility of building a full speech processing system from raw audio without using any textual representations or expert labels such as phonemes, dictionaries or parse trees. The contribution of the Zero Resource Speech Challenge series since 2015 has been to break down this long-term objective into four well-defined tasks---Acoustic Unit Discovery, Spoken Term Discovery, Discrete Resynthesis, and Spoken Language Modeling---and introduce associated metrics and benchmarks enabling model comparison and cumulative progress. We present an overview of the six editions of this challenge series since 2015, discuss the lessons learned, and outline the areas which need more work or give puzzling results.
Algayres, R., Ricoul, T., Karadayi, J., Laurençon, H., Zaiem, S., Mohamed, A., Sagot, B. & Dupoux, E. (2022). DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon. Transactions of the Association for Computational Linguistics, 10, 1051--1065. [abstract] Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a `space' delimiter between words. Popular Bayesian non-parametric models for text segmentation (Goldwater et al., 2006, 2009) use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-theart in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark.
Algayres, R., Nabli, A., Sagot, B. & Dupoux, E. (2022). Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning. In INTERSPEECH-2022, (pp 2123-2127) . [abstract] We introduce a simple neural encoder architecture that can be trained using an unsupervised contrastive learning objective which gets its positive samples from data-augmented k-Nearest Neighbors search. We show that when built on top of recent self-supervised audio representations [1, 2, 3], this method can be applied iteratively and yield competitive SSE as evaluated on two tasks: query-by-example of random sequences of speech, and spoken term discovery. On both tasks our method pushes the state-of-the-art by a significant margin across 5 different languages. Finally, we establish a benchmark on a query-byexample task on the LibriSpeech dataset to monitor future improvements in the field.
Wang, C., Rivière, M., Lee, A., Wu, A., Talnikar, C., Haziza, D., Williamson, M., Pino, J. & Dupoux, E. (2021). VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation. In Proceedings of ACL, (pp 993--1003) . [abstract] We introduce VoxPopuli, a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours. We provide speech recognition baselines and validate the versatility of VoxPopuli unlabelled data in semi-supervised learning under challenging out-of-domain settings. We will release the corpus at https://github.com/facebookresearch/voxpopuli under an open license.
Tsuji, S., Cristia, A. & Dupoux, E. (2021). SCALa: A blueprint for computational models of language acquisition in social context. Cognition, 213, 104779. [abstract] Theories and data on language acquisition suggest a range of cues are used, ranging from information on structure found in the linguistic signal itself, to information gleaned from the environmental context or through social interaction. We propose a blueprint for computational models of the early language learner (SCALa, for Socio-Computational Architecture of Language Acquisition) that makes explicit the connection between the kinds of information available to the social learner and the computational mechanisms required to extract language-relevant information and learn from it. SCALa integrates a range of views on language acquisition, further allowing us to make precise recommendations for future large-scale empirical research.
Riochet, R., Ynocente Castro, M., Bernard, M., Lerer, A., Fergus, R., Izard, V. & Dupoux, E. (2021). IntPhys: A Framework and Benchmark for Visual Intuitive Physics Understanding. Transactions on Pattern Analysis and Machine Intelligence, 44(9), 5016 - 5025. [abstract] In order to reach human performance on complex visual tasks, artificial systems need to incorporate a significant amount of understanding of the world in terms of macroscopic objects, movements, forces, etc. Inspired by work on intuitive physics in infants, we propose an evaluation benchmark which diagnoses how much a given system understands about physics by testing whether it can tell apart well matched videos of possible versus impossible events constructed with a game engine. The test requires systems to compute a physical plausibility score over an entire video. To prevent perceptual biases, the dataset is made of pixel matched quadruplets of videos, enforcing systems to focus on high level temporal dependencies between frames rather than pixel-level details. We then describe two Deep Neural Networks systems aimed at learning intuitive physics in an unsupervised way, using only physically possible videos. The systems are trained with a future semantic mask prediction objective and tested on the possible versus impossible discrimination task. The analysis of their results compared to human data gives novel insights in the potentials and limitations of next frame prediction architectures.
Riad, R., Karadayi, J., Bachoud-Lévi, A.C. & Dupoux, E. (2021). Learning spectro-temporal representations of complex sounds with parameterized neural networks. Journal of the Acoustical Society of America, 150(1), 353--366. [abstract] Deep Learning models have become potential candidates for auditory neuroscience research, thanks to their recent successes on a variety of auditory tasks. Yet, these models often lack interpretability to fully understand the exact computations that have been performed. Here, we proposed a parametrized neural network layer, that computes specific spectro-temporal modulations based on Gabor kernels (Learnable STRFs) and that is fully interpretable. We evaluated the predictive capabilities of this layer on Speech Activity Detection, Speaker Verification, Urban Sound Classification and Zebra Finch Call Type Classification. We found out that models based on this learnable parametrized neural network are on par for all tasks with the different toplines, and obtain the best performance for Speech Activity Detection. As this layer is fully interpretable, we used quantitative measures to describe the distribution of the learned spectro-temporal modulations. The filters adapted to each task and focused mostly modulation on low temporal and spectral modulations. The analyses show that the filters learned on human speech have similar spectro-temporal parameters as the ones measured directly in the human auditory cortex. Finally, equipped with the Sinkhorn distance to com- pare the learned STRFs distributions, we observed that the tasks organized in a meaningful way: the human vocalizations tasks closer to each other and bird vocalizatoins far away from human vocalizations and urban sounds tasks.
Polyak, A., Adi, Y., Copet, J., Kharitonov, E., Lakhotia, K., Hsu, W.N., Mohamed, A. & Dupoux, E. (2021). Speech Resynthesis from Discrete Disentangled Self-Supervised Representations. In INTERSPEECH-2021, (pp 3615--3619) . [abstract] We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods. Audio samples can be found under this https URL: https://resynthesis-ssl.github.io/
Ludusan, B., Morii, M., Minagawa, Y. & Dupoux, E. (2021). The effect of different information sources on prosodic boundary perception. JASA Express Letters, 1(11), 115203. [abstract] This study aims to quantify the effect of several information sources: acoustic, higher-level linguistic, and knowledge of the prosodic system of the language, on the perception of prosodic boundaries. An experiment with native and non-native participants investigating the identification of prosodic boundaries in Japanese was conducted. It revealed that non-native speakers as well as native speakers with access only to acoustic information can recognize boundaries better than chance level. However, knowledge of both the prosodic system and of higher-level information are required for a good boundary identifica- tion, each one having similar or higher importance than that of acoustic information.
Lakhotia, K., Kharitonov, E., Hsu, W.N., Adi, Y., Polyak, A., Bolte, B., Nguyen, T.A., Copet, J., Baevski, A., Mohamed, A. & Dupoux, E. (2021). Generative Spoken Language Modeling from Raw Audio. Transactions of the Association for Computational Linguistics, 9, 1336--1354. [abstract] We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.
Feldman, N., Goldwater, S., Dupoux, E. & Schatz, T. (2021). Do Infants Really Learn Phonetic Categories? Open Mind, 5, 113--131. [abstract] Acoustic realizations of a given phonetic segment are typically affected by coarticulation with the preceding and following phonetic context. While coarticulation has been extensively studied using descriptive phonetic measurements, little is known about the functional impact of coarticulation for speech processing. Here, we use DTW-based similarity defined on raw acoustic features and ABX scores to derive a measure of the effect of coarticulation on phonetic discriminability. This measure does not rely on defining segment-specific phonetic cues (formants, duration, etc.) and can be applied systematically and automatically to any segment in large scale corpora. We illustrate our method using stimuli in English and Japanese. We replicate some well-known results, i.e., stronger anticipatory than perseveratory coarticulation and stronger coarticulation for lax/short vowels than for tense/long vowels. We then quantify for the first time the impact of coarticulation across different segment types (like vowels and consonants). We discuss how our metric and its possible extensions can help addressing current challenges in the systematic study of coarticulation.
Dunbar, E., Bernard, M., Hamilakis, N., Nguyen, T.A., de Seyssel, M., Rozé, P., Rivière, M., Kharitonov, E. & Dupoux, E. (2021). The Interspeech Zero Resource Speech Challenge 2021: Spoken language modelling. In INTERSPEECH-2021, (pp 1574--1578) . [abstract] We present the Zero Resource Speech Challenge 2021, which asks participants to learn a language model directly from audio, without any text or labels. The challenge is based on the Libri-light dataset, which provides up to 60k hours of audio from English audio books without any associated text. We provide a pipeline baseline system consisting on an encoder based on contrastive predictive coding (CPC), a quantizer (k-means) and a standard language model (BERT or LSTM). The metrics evaluate the learned representations at the acoustic (ABX discrimination), lexical (spot-the-word), syntactic (acceptability judgment) and semantic levels (similarity judgment). We present an overview of the eight submitted systems from four groups and discuss the main results.
Chaabouni, R., Kharitonov, E., Dupoux, E. & Baroni, M. (2021). Communicating artificial neural networks develop efficient color naming systems. Proceedings of the National Academy of Sciences of the United States of America, 118(12), e2016569118. [abstract] Words categorize the semantic fields they refer to in ways that maximize communication accuracy while minimizing complexity. Focusing on the well-studied color domain, we show that artificial neural networks trained with deep learning techniques to play a discrimination game develop communication systems whose distribution on the accuracy/complexity plane closely matches that of human languages. The observed variation among emergent color-naming systems is explained by different degrees of discriminative need, of the sort that might also characterize different human communities. Like human languages, emergent systems show a preference for relatively low-complexity solutions, even at the cost of imperfect communication. We demonstrate next that the nature of the emergent systems crucially depends on communication being discrete (as is human word usage). When continuous message passing is allowed, emergent systems become more complex, and eventually less efficient. Our study suggests that efficient semantic categorization is a general property of discrete communication systems, not limited to human language. It suggests moreover that it is exactly the discrete nature of such systems that, acting as a bottleneck, pushes them towards low complexity and optimal efficiency.
de Seyssel, M. & Dupoux, E. (2020). Does bilingual input hurt? A simulation of language discrimination and clustering using i-vectors In Proceedings of the Cognitive Science Conference, (pp 2791--2797) . [abstract] The language discrimination process in infants has been successfully modeled using i-vector based systems, with results replicating several experimental findings. Still, recent work found intriguing results regarding the difference between monolingual and mixed-language exposure on language discrimination tasks. We use two carefully designed datasets, with an additional ``bilingual'' condition on the i-vector model of language discrimination. Our results do not show any difference in the ability of discriminating languages between the three backgrounds, although we do replicate past observations that distant languages (English-Finnish) are easier to discriminate than close languages (English-German). We do, however, find a strong effect of background when testing for the ability of the learner to automatically sort sentences in language clusters: bilingual background being generally harder than mixed background (one speaker one language). Other analyses reveal that clustering is dominated by speakers information rather than by languages.
Titeux, H., Riad, R., Cao, X.N., Hamilakis, N., Madden, K., Cristia, A., Bachoud-Lévi, A.C. & Dupoux, E. (2020). Seshat: A tool for managing and verifying annotation campaigns of audio data. In LREC, (pp 6976--6982) . [abstract] We introduce Seshat, a new, simple and open-source software to efficiently manage annotations of speech corpora. The Seshat software allows users to easily customise and manage annotations of large audio corpora while ensuring compliance with the formatting and naming conventions of the annotated output files. In addition, it includes procedures for checking the content of annotations following specific rules are implemented in personalised parsers. Finally, we propose a double-annotation mode, for which Seshat computes automatically an associated inter-annotator agreement with the γ measure taking into account the categorisation and segmentation discrepancies.
Schatz, T., Feldman, N., Goldwater, S., Cao, X.N. & Dupoux, E. (2020). Early phonetic learning without phonetic categories. Insights from large-scale simulations on realistic input Proceedings of the National Academy of Sciences of the United States of America. [abstract] Acoustic realizations of a given phonetic segment are typically affected by coarticulation with the preceding and following phonetic context. While coarticulation has been extensively studied using descriptive phonetic measurements, little is known about the functional impact of coarticulation for speech processing. Here, we use DTW-based similarity defined on raw acoustic features and ABX scores to derive a measure of the effect of coarticulation on phonetic discriminability. This measure does not rely on defining segment-specific phonetic cues (formants, duration, etc.) and can be applied systematically and automatically to any segment in large scale corpora. We illustrate our method using stimuli in English and Japanese. We replicate some well-known results, i.e., stronger anticipatory than perseveratory coarticulation and stronger coarticulation for lax/short vowels than for tense/long vowels. We then quantify for the first time the impact of coarticulation across different segment types (like vowels and consonants). We discuss how our metric and its possible extensions can help addressing current challenges in the systematic study of coarticulation.
Rivière, M., Mazaré, P.E., Joulin, A. & Dupoux, E. (2020). Unsupervised pretraining transfers well across languages. In ICASSP-2020, (pp 7414-7418) . [abstract] Cross-lingual and multi-lingual training of Automatic Speech Recognition (ASR) has been extensively investigated in the supervised setting. This assumes the existence of a parallel corpus of speech and orthographic transcriptions. Recently, contrastive predictive coding (CPC) algorithms have been proposed to pretrain ASR systems with unlabelled data. In this work, we investigate whether unsupervised pretraining transfers well across languages. We show that a slight modification of the CPC pretraining extracts features that transfer well to other languages, being on par or even outperforming supervised pretraining. This shows the potential of unsupervised methods for languages with few linguistic resources.
Rivière, M., Kharitonov, E., Mazaré, P.E., Douze, M. & Dupoux, E. (2020). Towards unsupervised learning of speech features in the wild. In SLT-2020. [abstract] Recent work on unsupervised contrastive learning of speech representation has shown promising results, but so far has mostly been applied to clean, curated speech datasets. Can it also be used with unprepared audio data ``in the wild''? Here, we explore three potential problems in this setting: (i) presence of non-speech data, (ii) noisy or low quality speech data, and (iii) imbalance in speaker distribution. We show that on the Libri-light train set, which is itself a relatively clean speech-only dataset, these problems combined can already have a performance cost of up to 30% relative for the ABX score. We show that the first two problems can be alleviated by data filtering, with voice activity detection selecting speech segments, while perplexity of a model trained with clean data helping to discard entire files. We show that the third problem can be alleviated by learning a speaker embedding in the pre- dictive branch of the model. We show that these techniques build more robust speech features that can be transferred to an ASR task in the low resource setting.
Rita, M., Chaabouni, R. & Dupoux, E. (2020). " LazImpa": Lazy and Impatient neural agents learn to communicate efficiently. In CONLL, (pp 335--343) . [abstract] Previous work has shown that artificial neural agents naturally develop surprisingly non-efficient codes. This is illustrated by the fact that in a referential game involving a speaker and a listener neural networks optimizing accurate transmission over a discrete channel, the emergent messages fail to achieve an optimal length. Furthermore, frequent messages tend to be longer than infrequent ones, a pattern contrary to the Zipf Law of Abbreviation (ZLA) observed in all natural languages. Here, we show that near-optimal and ZLA-compatible messages can emerge, but only if both the speaker and the listener are modified. We hence introduce a new communication system," LazImpa", where the speaker is made increasingly lazy, ie avoids long messages, and the listener impatient, ie, seeks to guess the intended content as soon as possible.
Riad, R., Titeux, H., Lemoine, L., Montillot, J., Xuan-Nga, C., Dupoux, E. & Bachoud-Lévi, A.C. (2020). Vocal markers from sustained phonation in Huntington's Disease. In INTERSPEECH-2020, (pp 1893--1897) . [abstract] Disease-modifying treatments are currently assessed in neurodegenerative diseases. Huntington's Disease represents a unique opportunity to design automatic sub-clinical markers, even in premanifest gene carriers. We investigated phonatory impairments as potential clinical markers and propose them for both diagnosis and gene carriers follow-up. We used two sets of features: Phonatory features and Modulation Power Spectrum Features. We found that phonation is not sufficient for the identification of sub-clinical disorders of premanifest gene carriers. According to our regression results, Phonatory features are suitable for the predictions of clinical performance in Huntington's Disease.
Riad, R., Bachoud-Lévi, A.C., Rudzicz, F. & Dupoux, E. (2020). Identification of primary and collateral tracks in stuttered speech. In LREC, (pp 1681--1688) . [abstract] Disfluent speech has been previously addressed from two main perspectives: the clinical perspective focusing on diagnostic, and the Natural Language Processing (NLP) perspective aiming at modeling these events and detect them for downstream tasks. In addition, previous works often used different metrics depending on whether the input features are text or speech, making it difficult to compare the different contributions. Here, we introduce a new evaluation framework for disfluency detection inspired by the clinical and NLP perspective together with the theory of performance from Clark (1996) which distinguishes between primary and collateral tracks. We introduce a novel forced-aligned disfluency dataset from a corpus of semi-directed interviews, and present baseline results directly comparing the performance of text-based features (word and span information) and speech-based (acoustic-prosodic information). Finally, we introduce new audio features inspired by the word-based span features. We show experimentally that using these features outperformed the baselines for speech-based predictions on the present dataset.
Nguyen, T.A., de Seyssel, M., Rozé, P., Rivière, M., Kharitonov, E., Baevski, A., Dunbar, E. & Dupoux, E. (2020). The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. In NeuRIPS Workshop on Self-Supervised Learning for Speech and Audio Processing. [abstract] We present the Zero Resource Speech Challenge 2020, which aims at learning speech representations from raw audio signals without any labels. It combines the data sets and metrics from two previous benchmarks (2017 and 2019) and features two tasks which tap into two levels of speech representation. The first task is to discover low bit-rate subword representations that optimize the quality of speech synthesis; the second one is to discover word-like units from unsegmented raw speech. We present the results of the twenty submitted models and discuss the implications of the main findings for unsupervised speech learning.
Ludusan, B., Mazuka, R. & Dupoux, E. (2020). Does infant-directed speech help phonetic learning? A machine learning investigation Cognitive Science. [abstract] A prominent hypothesis holds that by speaking to infants in infant-directed speech (IDS) as opposed to adult-directed speech (ADS), parents help them learn phonetic categories. Specifically, two characteristics of IDS have been claimed to facilitate learning: hyperarticulation, which makes the categories more \textitseparable and variability, which makes the generalization more robust. Here, we test the separability and robustness of vowel category learning on acoustic representations of speech uttered by Japanese adults in either ADS, IDS (addressed to 18-24 month olds) or read speech (RS). Separability is determined by means of a distance measure computed between the five short vowel categories of Japanese, while robustness is assessed by testing the ability of six different machine learning algorithms trained to classify vowels to generalize on stimuli spoken by a novel speaker in ADS. Using two different speech representations, we find that hyperarticulated speech, in the case of RS, can yield better separability, and that increased between-speaker variability in ADS, can yield, for some algorithms, more robust categories. However, these conclusions do not apply to IDS, which turned out to yield neither more separable nor more robust categories compared to ADS inputs. We discuss the usefulness of machine learning algorithms run on real data to test hypotheses about the functional role of IDS.
Lavechin, M., Bousbib, R., Bredin, H., Dupoux, E. & Cristia, A. (2020). An open-source voice type classifier for child-centered daylong recordings. In INTERSPEECH-2020, (pp 3072--3076) . [abstract] Spontaneous conversations in real-world settings such as those found in child-centered recordings have been shown to be amongst the most challenging audio files to process. Nevertheless, building speech processing models handling such a wide variety of conditions would be particularly useful for language acquisition studies in which researchers are interested in the quantity and quality of the speech that children hear and produce, as well as for early diagnosis and measuring effects of remediation. In this paper, we present our approach to designing an open-source neural network to classify audio segments into vocalizations produced by the child wearing the recording device, vocalizations produced by other children, adult male speech, and adult female speech. To this end, we gathered diverse child-centered corpora which sums up to a total of 260 hours of recordings and covers 10 languages. Our model can be used as input for downstream tasks such as estimating the number of words produced by adult speakers, or the number of linguistic units produced by children. Our architecture combines SincNet filters with a stack of recurrent layers and outperforms by a large margin the state-of-the-art system, the Language ENvironment Analysis (LENA) that has been used in numerous child language studies.
Kharitonov, E., Rivière, M., Synnaeve, G., Wolf, L., Mazaré, P.E., Douze, M. & Dupoux, E. (2020). Data Augmenting Contrastive Learning of Speech Representationsin the Time Domain. In SLT-2020. [abstract] Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is generally more efficient and yields better performances than other methods. We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC (relative improvement of 18-22\%), beating the reference Libri-light results with 600 times less data. Using an out-of-domain dataset, time-domain data augmentation can push CPC to be on par with the state of the art on the Zero Speech Benchmark 2017. We also show that time-domain data augmentation consistently improves downstream limited-supervision phoneme classification tasks by a factor of 12-15\% relative.
Kahn, J., Rivière, M., Zheng, W., Kharitonov, E., Xu, Q., Mazaré, P.E., Karadayi, J., Liptchinsky, V., Collobert, R., Fuegen, C., Likhomanenko, T., Synnaeve, G., Joulin, A., Mohamed, A. & Dupoux, E. (2020). Libri-Light: A Benchmark for ASR with Limited or No Supervision. In ICASSP-2020, (pp 7669--7674) . [abstract] We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.
Jiang, B., Dunbar, E., Clayards, M., Darcy, I., Sonderegger, M. & Dupoux, E. (2020). Modelling Perceptual Effects of Phonology with ASR Systems. In Proceedings of the Cognitive Science Conference, (pp 2735--2741) . [abstract] This paper explores the minimal knowledge a listener needs to compensate for phonological assimilation, one kind of phonological process responsible for variation in speech. We used standard automatic speech recognition models to represent English and French listeners. We found that, first, some types of models show language-specific assimilation patterns comparable to those shown by human listeners. Like English listeners, when trained on English, the models compensate more for place assimilation than for voicing assimilation, and like French listeners, the models show the opposite pattern when trained on French. Second, the models which best predict the human pattern use contextually-sensitive acoustic models and language models, which capture allophony and phonotactics, but do not make use of higher-level knowledge of a lexicon or word boundaries. Finally, some models overcompensate for assimilation, showing a (super-human) ability to recover the underlying form even in the absence of the triggering phonological context, pointing to an incomplete neutralization not exploited by human listeners.
Fournier, L., Dunbar, E. & Dupoux, E. (2020). Analogies minus analogy test: measuring regularities in word embeddings. In CoNLL 2020, (pp 365--375) Association for Computational Linguistics. [abstract] Vector space models of words have long been claimed to capture linguistic regularities as simple vector translations, but problems have been raised with this claim. We decompose and empirically analyze the classic arithmetic word analogy test, to motivate two new metrics that address the issues with the standard test, and which distinguish between class-wise offset concentration (similar directions between pairs of words drawn from different broad classes, such as France-- London, China--Ottawa, . . . ) and pairing consistency (the existence of a regular transformation between correctly-matched pairs such as France:Paris::China:Beijing). We show that, while the standard analogy test is flawed, several popular word embeddings do nevertheless encode linguistic regularities.
Dunbar, E., Karadayi, J., Bernard, M., Cao, X.N., Algayres, R., Ondel, L., Besacier, L., Sakriani, S. & Dupoux, E. (2020). The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units. In INTERSPEECH-2020, (pp 4831--4835) . [abstract] We present the Zero Resource Speech Challenge 2020, which aims at learning speech representations from raw audio signals without any labels. It combines the data sets and metrics from two previous benchmarks (2017 and 2019) and features two tasks which tap into two levels of speech representation. The first task is to discover low bit-rate subword representations that optimize the quality of speech synthesis; the second one is to discover word-like units from unsegmented raw speech. We present the results of the twenty submitted models and discuss the implications of the main findings for unsupervised speech learning.
Chaabouni, R., Kharitonov, E., Bouchacourt, D., Dupoux, E. & Baroni, M. (2020). Compositionality and Generalization in Emergent Languages. In ACL, (pp 4427--4442) . [abstract] Natural language allows us to refer to novel composite concepts by combining expressions denoting their parts according to systematic rules, a property known as \emphcompositionality. In this paper, we study whether the language emerging in deep multi-agent simulations possesses a similar ability to refer to novel primitive combinations, and whether it accomplishes this feat by strategies akin to human-language compositionality. Equipped with new ways to measure compositionality in emergent languages inspired by disentanglement in representation learning, we establish three main results. First, given sufficiently large input spaces, the emergent language will naturally develop the ability to refer to novel composite concepts. Second, there is no correlation between the degree of compositionality of an emergent language and its ability to generalize. Third, while compositionality is not necessary for generalization, it provides an advantage in terms of language transmission: The more compositional a language is, the more easily it will be picked up by new learners, even when the latter differ in architecture from the original agents. We conclude that compositionality does not arise from simple generalization pressure, but if an emergent language does chance upon it, it will be more likely to survive and thrive.
Algayres, R., Zaiem, S., Sagot, B. & Dupoux, E. (2020). Evaluating the reliability of acoustic speech embeddings. In INTERSPEECH-2020, (pp 4621--4625) . [abstract] Speech embeddings are fixed-size acoustic representations of variable-length speech sequences. They are increasingly used for a variety of tasks ranging from information retrieval to unsupervised term discovery and speech segmentation. However, there is currently no clear methodology to compare or optimise the quality of these embeddings in a task-neutral way. Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods, ranging from supervised to fully unsupervised, and using different loss functions (autoencoders, correspondence autoencoders, siamese). Then we use the ABX and MAP to predict performances on a new downstream task: the unsupervised estimation of the frequencies of speech segments in a given corpus. We find that overall, ABX and MAP correlate with one another and with frequency estimation. However, substantial discrepancies appear in the fine-grained distinctions across languages and/or embedding methods. This makes it unrealistic at present to propose a task-independent silver bullet method for computing the intrinsic quality of speech embeddings. There is a need for more detailed analysis of the metrics currently used to evaluate such embeddings.
Rochereau, C., Sagot, B. & Dupoux, E. (2019). Modeling German Verb Argument Structures: LSTMs vs. Humans In ArXiv, 1912.00239. [abstract] LSTMs have proven very successful at language modeling. However, it remains unclear to what extent they are able to capture complex morphosyntactic structures. In this paper, we examine whether LSTMs are sensitive to verb argument structures. We introduce a German grammaticality dataset in which ungrammatical sentences are constructed by manipulating case assignments (eg substituting nominative by accusative or dative). We find that LSTMs are better than chance in detecting incorrect argument structures and slightly worse than humans tested on the same dataset. Surprisingly, LSTMs are contaminated by heuristics not found in humans like a preference toward nominative noun phrases. In other respects they show human-similar results like biases for particular orders of case assignments.
Riochet, R., Castro, M.Y., Bernard, M., Lerer, A., Fergus, R., Izard, V. & Dupoux, E. (2019). IntPhys: A Benchmark for Visual Intuitive Physics Reasoning. In ArXiv, 1803.07616. [abstract] In order to reach human performance on complex visual tasks, artificial systems need to incorporate a significant amount of understanding of the world in terms of macroscopic objects, movements, forces, etc. Inspired by work on intuitive physics in infants, we propose an evaluation framework which diagnoses how much a given system understands about physics by testing whether it can tell apart well matched videos of possible versus impossible events. The test requires systems to compute a physical plausibility score over an entire video. It is free of bias and can test a range of specific physical reasoning skills. We then describe the first release of a benchmark dataset aimed at learning intuitive physics in an unsupervised way, using videos constructed with a game engine. We describe two Deep Neural Network baseline systems trained with a future frame prediction objective and tested on the possible versus impossible discrimination task. The analysis of their results compared to human data gives novel insights in the potentials and limitations of next frame prediction architectures.
Millet, J., Jurov, N. & Dunbar, E. (2019). Comparing unsupervised speech learning directly to human performance in speech perception. In Proceedings of the Cognitive Science Conference.
Millet, J. & Zeghidour, N. (2019). Learning to detect dysarthria from raw speech. (Unpublished doctoral dissertation) .
Mccoy, R.T., Linzen, T., Dunbar, E. & Smolensky, P. (2019). RNNs Implicitly Implement Tensor Product Representations. In International Conference on Learning Representations.
Maldonado, M., Dunbar, E. & Chemla, E. (2019). Mouse tracking as a window into decision making. Behavior Research Methods, 51(3), 1085-1101.
Kharitonov, E., Chaabouni, R., Bouchacourt, D. & Baroni, M. (2019). EGG: a toolkit for research on Emergence of lanGuage in Games. In Proceedings of the System Demonstrations of EMNLP. [abstract] There is renewed interest in simulating language emergence among deep neural agents that communicate to jointly solve a task, spurred by the practical aim to develop language-enabled interactive AIs, as well as by theoretical questions about the evolution of human language. However, optimizing deep architectures connected by a discrete communication channel (such as that in which language emerges) is technically challenging. We introduce EGG, a toolkit that greatly simplifies the implementation of emergent-language communication games. EGG's modular design provides a set of building blocks that the user can combine to create new games, easily navigating the optimization and architecture space. We hope that the tool will lower the technical barrier, and encourage researchers from various backgrounds to do original work in this exciting area.
Fourtassi, A. & Dupoux, E. (2019). Phoneme learning is influenced by the taxonomic similarity of the semantic referents. In Proceedings of the Cognitive Science Conference, (323-324), Cognitive Science Society. [abstract] Word learning relies on the ability to master the sound contrasts that are phonemic (i.e., signal meaning difference) in a given language. Though the timeline of phoneme development has been studied extensively over the past few decades, the mechanism of this development is poorly understood. Previous work has shown that human learners rely on referential information to differentiate similar sounds, but largely ignored the problem of taxonomic ambiguity at the semantic level (two different objects may be described by one or two words depending on how abstract the meaning intended by the speaker is). In this study, we varied the taxonomic distance of pairs of objects and tested how adult learners judged the phonemic status of the sound contrast associated with each of these pairs.We found that judgments were sensitive to gradients in the taxonomic structure, suggesting that learners use probabilistic information at the semantic level to optimize the accuracy of their judgements at the phonological level. The findings provide evidence for an interaction between phonological learning and meaning generalization, raising important questions about how these two important processes of language acquisition are related.
Dunbar, E. (2019). Generative grammar, neural networks, and the implementational mapping problem: Response to Pater. Language, 95(1), e87-e98.
Dunbar, E., Algayres, R., Karadayi, J., Bernard, M., Benjumea, J., Cao, X.N., Miskic, L., Dugrain, C., Ondel, L., Black, A., Besacier, L., Sakriani, S. & Dupoux, E. (2019). The Zero Resource Speech Challenge 2019: TTS without T. In INTERSPEECH-2019. [abstract] We present the Zero Resource Speech Challenge 2019, which proposes to build a speech synthesizer without any text or phonetic labels: hence, TTS without T (text-to-speech without text). We provide raw audio for a target voice in an unknown language (the Voice dataset), but no alignment, text or labels. Participants must discover subword units in an unsupervised way (using the Unit Discovery dataset) and align them to the voice recordings in a way that works best for the purpose of synthesizing novel utterances from novel speakers, similar to the target speaker's voice. We describe the metrics used for evaluation, a baseline system consisting of unsupervised subword unit discovery plus a standard TTS system, and a topline TTS using gold phoneme transcriptions. We present an overview of the 19 submitted systems from 11 teams and discuss the main results.
Cristia, A., Dupoux, E., Bernstein Ratner, N. & Soderstrom, M. (2019). Segmentability differences between child-directed and adult-directed speech: A systematic test with an ecologically valid corpus. In Open Mind, 3, (pp 13-22) . [abstract] Previous computational modeling suggests it is much easier to segment words from child-directed (CDS) than adult-directed speech (ADS). However, this conclusion is based on data collected in the laboratory, with CDS from play sessions and ADS between a parent and an experimenter, which may not be representative of ecologically-collected CDS and ADS. Fully naturalistic ADS and CDS collected with a non-intrusive recording device as the child went about her day were analyzed with a diverse set of algorithms. The difference between registers was small compared to differences between algorithms, it reduced when corpora were matched, and it even reversed under some conditions. These results highlight the interest of studying learnability using naturalistic corpora and diverse algorithmic definitions.
Chaabouni, R., Kharitonov, E., Lazaric, A., Dupoux, E. & Baroni, M. (2019). Word-order biases in deep-agent emergent communication. In ACL 2019. [abstract] Sequence-processing neural networks led to remarkable progress on many NLP tasks. As a consequence, there has been increasing interest in understanding to what extent they process language as humans do. We aim here to uncover which biases such models display with respect to ``natural" word-order constraints. We train models to communicate about paths in a simple gridworld, using miniature languages that reflect or violate various natural language trends, such as the tendency to avoid redundancy or to minimize long-distance dependencies. We study how the controlled characteristics of our miniature languages affect individual learning and their stability across multiple network generations. The results draw a mixed picture. On the one hand, neural networks show a strong tendency to avoid long-distance dependencies. On the other hand, there is no clear preference for the efficient, non-redundant encoding of information that is widely attested in natural language. We thus suggest inoculating a notion of ``effort'' into neural networks, as a possible way to make their linguistic behavior more human-like.
Chaabouni, R., Kharitonov, E., Dupoux, E. & Baroni, M. (2019). Anti-efficient encoding in emergent communication. In NeuRIPS. [abstract] Despite renewed interest in emergent language simulations with neural networks, little is known about the basic properties of the induced code, and how they compare to human language. One fundamental characteristic of the latter, known as Zipf's Law of Abbreviation (ZLA), is that more frequent words are efficiently associated to shorter strings. We study whether the same pattern emerges when two neural networks, a "speaker" and a "listener", are trained to play a signaling game. Surprisingly, we find that networks develop an *anti-efficient* encoding scheme, in which the most frequent inputs are associated to the longest messages, and messages in general are skewed towards the maximum length threshold. This anti-efficient code appears easier to discriminate for the listener, and, unlike in human communication, the speaker does not impose a contrasting least-effort pressure towards brevity. Indeed, when the cost function includes a penalty for longer messages, the resulting message distribution starts respecting ZLA. Our analysis stresses the importance of studying the basic features of emergent communication in a highly controlled setup, to ensure the latter will not strand too far from human language. Moreover, we present a concrete illustration of how different functional pressures can lead to successful communication codes that lack basic properties of human language, thus highlighting the role such pressures play in the latter.
Bernard, M., Thiollière, R., Saksida, A., Loukatou, G., Larsen, E., Johnson, M., Fibla Reixachs, L., Dupoux, E., Daland, R., Xuan-Nga, C. & Cristia, A. (2019). WordSeg: Standardizing unsupervised word form segmentation from text. Behavior Research Methods, 52, 264--278.
Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R. & Dupoux, E. (2018). End-to-End Speech Recognition from the raw waveform. In Interspeech-2018. [abstract] State-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the wave- form before the training pipeline. In this paper, we study end-to-end systems trained directly from the raw waveform, building on two alternatives for trainable replacements of mel-filterbanks that use a convolutional architecture. The first one is inspired by gammatone filterbanks (Hoshen et al., 2015; Sainath et al, 2015), and the second one by the scattering transform (Zeghidour et al., 2017). We propose two modifications to these architectures and systematically compare them to mel-filterbanks, on the Wall Street Journal dataset. The first modification is the addition of an instance normalization layer, which greatly improves on the gammatone-based trainable filterbanks and speeds up the training of the scattering-based filterbanks. The second one relates to the low-pass filter used in these approaches. These modifications consistently improve performances for both ap- proaches, and remove the need for a careful initialization in scattering-based trainable filterbanks. In particular, we show a consistent improvement in word error rate of the trainable filterbanks relatively to comparable mel-filterbanks. It is the first time end-to-end models trained from the raw signal significantly outperform mel-filterbanks on a large vocabulary task under clean recording conditions.
Zeghidour, N., Usunier, N., Kokkinos, I., Schatz, T., Synnaeve, G. & Dupoux, E. (2018). Learning filterbanks from raw speech for phoneme recognition. In ICASSP-2018. [abstract] In this work we train a bank of complex filters that operates at the level of the raw speech signal and feeds into a convolutional neural network for phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an ap- proximation of MFSC, and then fine-tuned jointly with the remaining convolutional network. We perform phone recognition experiments on TIMIT and show that for several architectures, models trained on TD-filterbanks consistently out-perform their counterparts trained on comparable MFSC. We get our best performance by learning all front-end steps, from pre-emphasis up to averaging. Finally, we observe that the filters at convergence have an asymmetric impulse response while preserving some analyticity.
Thual, A., Dancette, C., Karadayi, J., Benjumea, J. & Dupoux, E. (2018). A K-nearest neighbours approach to unsupervised spoken term discovery. In IEEE SLT-2018. [abstract] Unsupervised spoken term discovery is the task of finding recurrent acoustic patterns in speech without any annotations. Current approaches consists of two steps: (1) discovering similar patterns in speech, and (2) partitioning those pairs of acoustic tokens using graph clustering methods. We propose a new approach for the first step. Previous systems used various approximation algorithms to make the search tractable on large amounts of data. Our approach is based on an optimized k-nearest neighbours (KNN) search coupled with a fixed word embedding algorithm. The results show that the KNN algorithm is robust across languages, consistently outperforms the DTW-based baseline, and is competitive with current state-of-the-art spoken term discovery systems.
Schatz, T., Bach, F. & Dupoux, E. (2018). Evaluating automatic speech recognition systems as quantitative models of cross-lingual phonetic category perception. Journal of the Acoustical Society of America: Express Letters. [abstract] Existing theories of cross-linguistic phonetic category perception agree that listeners perceive foreign sounds by mapping them onto their native phonetic categories. Yet, none of the available theories specify a way to compute this mapping. As a result, they cannot provide systematic quantitative predictions and remain mainly descriptive. In this paper, Automatic Speech Recognition (ASR) systems are used to provide a fully specified mapping between foreign and native sounds. This is shown to provide a quantitative model that can account for several empirically attested effects in human cross-linguistic phonetic category perception.
Scharenborg, O., Besacier, L., Black, A., Hasegawa-Johnson, M., Metze, F., Neubig, G., Stüker, S., Godard, P., Müller, M., Ondel, L., Palaskar, S., Arthur, P., Ciannella, F., Du, M., Larsen, E., Merkx, D., Riad, R., Wang, L. & Dupoux, E. (2018). Linguistic unit discovery from multimodal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop. In ICASSP-2018. [abstract] We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography. We study the re- placement of orthographic transcriptions by images and/or translated text in a well-resourced language to help unsuper- vised discovery from raw speech.
Riad, R., Dancette, C., Karadayi, J., Zeghidour, N., Schatz, T. & Dupoux, E. (2018). Sampling strategies in Siamese Networks for unsupervised speech representation learning. In Interspeech-2018. [abstract] Recent studies have investigated siamese network architectures for learning invariant speech representations using same-different side information at the word level. Here we investigate systematically an often ignored component of siamese networks: the sampling procedure (how pairs of same vs. different tokens are selected). We show that sampling strategies taking into account Zipf's Law, the distribution of speakers and the proportions of same and different pairs of words significantly impact the performance of the network. In particular, we show that word frequency compression improves learning across a large range of variations in number of training pairs. This effect does not apply to the same extent to the fully unsupervised setting, where the pairs of same-different words are obtained by spoken term discovery. We apply these results to pairs of words discovered using an unsupervised algorithm and show an improvement on state-of-the-art in unsupervised representation learning using siamese networks.
Ondel, L., Godard, P., Besacier, L., Larsen, E., Hasegawa-Johnson, M., Scharenborg, O., Dupoux, E., Burget, L.s., Yvon, F.c. & Khudanpur, S. (2018). Bayesian models for unit discovery on a very low resource language. In ICASSP-2018. [abstract] Developing speech technologies for low-resource languages has become a very active research field over the last decade. Among others, Bayesian models have shown some promising results on artificial examples but still lack of in situ exper- iments. Our work applies state-of-the-art Bayesian models to unsupervised Acoustic Unit Discovery (AUD) in a real low-resource language scenario. We also show that Bayesian models can naturally integrate information from other re- sourceful languages by means of informative prior leading to more consistent discovered units. Finally, discovered acoustic units are used, either as the 1-best sequence or as a lattice, to perform word segmentation. Word segmentation results show that this Bayesian approach clearly outperforms a Segmental-DTW baseline on the same corpus.
Holzenberger, N., Du, M., Karadayi, J., Riad, R. & Dupoux, E. (2018). Learning word embeddings: unsupervised methods for fixed-size representations of variable-length speech segments. In Interspeech-2018. [abstract] Fixed-length embeddings of words are very useful for a variety of tasks in speech and language processing. Here we sys- tematically explore two methods of computing fixed-length embeddings for variable-length sequences. We evaluate their sus- ceptibility to phonetic and speaker-specific variability on English, a high resource language, and Xitsonga, a low resource language, using two evaluation metrics: ABX word discrimina- tion and ROC-AUC on same-different phoneme n-grams. We show that a simple downsampling method supplemented with length information can be competitive with the variable-length input feature representation on both evaluations. Recurrent au- toencoders trained without supervision can yield even better re- sults at the expense of increased computational complexity.
Guevara-Rukoz, A., Cristia, A., Ludusan, B., Thiollière, R., Martin, A., Mazuka, R. & Dupoux, E. (2018). Are words easier to learn from infant- than adult- directed speech? A quantitative corpus-based investigation Cognitive Science, 42(5)(1586-1617). [abstract] We investigate whether infant-directed speech (IDS) facilitates lexical learning when compared to adult-directed speech (ADS). To study this, we compare the distinctiveness of the lexicon at two levels, acoustic and phonological, using a large database of spontaneous speech in Japanese. At the acoustic level we show that, as has been documented before for phonemes, the realizations of words are more variable and less discriminable in IDS. At the phonological level, we find that despite a slight increase in the number of phonological neighbors, the IDS lexicon contains more distinctive words (such as onomatopeias). Combining the acoustic and phonological metrics together in a global discrimination score, the two effects cancel each other out and the IDS lexicon winds up being as discriminable as its ADS counterpart. We discuss the implication of these findings for the view of IDS as hyperspeech, i.e., a register whose purpose is to facilitate language acquisition.
Guevara Rukoz, A. (2018). Decoding perceptual epenthesis: Experiments and Modelling. (Unpublished doctoral dissertation) Ecole Normale Supérieure. [abstract] Why do people of different linguistic background sometimes perceive the same acoustic signal differently? For instance, when hearing nonnative speech that does not conform to sound structures allowed in their native language, listeners may report hearing vowels that are not acoustically present. This phenomenon, known as perceptual vowel epenthesis, has been attested in various languages such as Japanese, Brazilian Portuguese, Korean, and English. The quality of the epenthesized vowel varies between languages, but also within languages, given certain phonemic environments. How much of this process is guided by information directly accessible in the acoustic signal? What is the contribution of the native phonology? How are these two elements combined when computing the native percept? Two main families of theories have been proposed as explanations: two-step and one-step theories. The former advocate an initial parsing of the phonetic categories, followed by repairs by an abstract grammar (e.g., epenthesis), while one-step proposals posit that all acoustic, phonetic, and phonological factors are integrated simultaneously in a probabilistic manner, in order to find the optimal percept. In this dissertation, we use a combination of experimental and modelling approaches in order to evaluate whether perceptual vowel epenthesis is a two-step or one-step process. In particular, we investigate this by assessing the role of acoustic details in modulations of epenthetic vowel quality.
Dupoux, E. (2018). Cognitive Science in the era of Artificial Intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition, 173, 34-59. [abstract] Spectacular progress in the information processing sciences (machine learning, wearable sensors) promises to revolutionize the study of cognitive development. Here, we analyse the conditions under which 'reverse engineering' language development, i.e., building an effective system that mimics infant's achievements, can contribute to our scientific understanding of early language development. We argue that, on the computational side, it is important to move from toy problems to the full complexity of the learning situation, and take as input as faithful reconstructions of the sensory signals available to infants as possible. On the data side, accessible but privacy-preserving repositories of home data have to be setup. On the psycholinguistic side, specific tests have to be constructed to benchmark humans and machines at different linguistic levels. We discuss the feasibility of this approach and present an overview of current results.
Défossez, A., Zeghidour, N., Usunier, N., Bottou, L. & Bach, F. (2018). SING: Symbol-to-Instrument Neural Generator. In NIPS.
Carbajal, J. (2018). Separation and acquisition of two languages in early childhood: a multidisciplinary approach. (Unpublished doctoral dissertation) Ecole Normale Supérieure. [abstract] During the first years of life, children rapidly learn to process speech from a continuous acoustic signal, and soon become able to understand and produce the sounds, words and structure of their native language. Children growing up in a bilingual environment face an additional challenge: they must simultaneously discover and separate their bilingual input into individual (yet potentially overlapping) systems, with independent sound units, vocabularies and grammars, without knowing a priori how many languages are spoken in their environment. In spite of this, language acquisition in young bilinguals follows, to an extent, a similar time-line as in monolinguals. Understanding how children come to discover the presence of two languages in their input, and to what extent they are able to keep them apart, are to this day crucial questions to the field of childhood bilingualism. In this thesis we focus on these two questions by exploring how perceptual and environmental properties of the input can help or hinder the discovery and lexical development of two languages, and whether the phonological representations formed by young bilinguals are language-specific. In order to investigate these questions, we take a multidisciplinary approach, using both empirical and computational techniques, which can provide different insights on the task of early language separation.
Cao, X.N., Dakhlia, C., del Carmen, P., Jaouani, M.A., Ould-Arbi, M. & Dupoux, E. (2018). Baby Cloud, a technological platform for parents and researchers. In Nicoletta Calzolari (Conference chair), Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis & Takenobu Tokunaga (eds) Proceedings of LREC 2018, European Language Resources Association (ELRA). [abstract] In this paper, we present BabyCloud, a platform for capturing, storing and analyzing daylong audio recordings and photographs of children's linguistic environments, for the purpose of studying infant's cognitive and linguistic development and interactions with the environment. The proposed platform connects two communities of users: families and academics, with strong innovation potential for each type of users. For families, the platform offers a novel functionality: the ability for parents to follow the development of their child on a daily basis through language and cognitive metrics (growth curves in number of words, verbal complexity, social skills, etc). For academic research, the platform provides a novel means for studying language and cognitive development at an unprecedented scale and level of detail. They will submit algorithms to the secure server which will only output anonymized aggregate statistics. Ultimately, BabyCloud aims at creating an ecosystem of third parties (public and private research labs...) gravitating around developmental data, entirely controlled by the party whose data originate from, i.e. families.
Schatz, T., Turnbull, R., Bach, F. & Dupoux, E. (2017). A Quantitative Measure of the Impact of Coarticulation on Phone Discriminability. In INTERSPEECH-2017. [abstract] Acoustic realizations of a given phonetic segment are typically affected by coarticulation with the preceding and following phonetic context. While coarticulation has been extensively studied using descriptive phonetic measurements, little is known about the functional impact of coarticulation for speech processing. Here, we use DTW-based similarity defined on raw acoustic features and ABX scores to derive a measure of the effect of coarticulation on phonetic discriminability. This measure does not rely on defining segment-specific phonetic cues (formants, duration, etc.) and can be applied systematically and automatically to any segment in large scale corpora. We illustrate our method using stimuli in English and Japanese. We replicate some well-known results, i.e., stronger anticipatory than perseveratory coarticulation and stronger coarticulation for lax/short vowels than for tense/long vowels. We then quantify for the first time the impact of coarticulation across different segment types (like vowels and consonants). We discuss how our metric and its possible extensions can help addressing current challenges in the systematic study of coarticulation.
Michel, P., Räsänen, O., Thiollière, R. & Dupoux, E. (2017). Blind phoneme segmentation with temporal prediction errors. In Proceedings of ACL: Student Research Workshop, 62-68. [abstract] Phonemic segmentation of speech is a crit- ical step of speech recognition systems. We propose a novel unsupervised algo- rithm based on sequence prediction mod- els such as Markov chains and recurrent neural networks. Our approach consists in analyzing the error profile of a model trained to predict speech features frame- by-frame. Specifically, we try to learn the dynamics of speech in the MFCC space and hypothesize boundaries from lo- cal maxima in the prediction error. We evaluate our system on the TIMIT dataset, with improvements over similar methods.
Ludusan, B., Mazuka, R., Bernard, M., Cristia, A. & Dupoux, E. (2017). The Role of Prosody and Speech Register in Word Segmentation: A Computational Modelling Perspective. In ACL 2017, 2, (pp 178-183) . [abstract] This study explores the role of speech register and prosody for the task of word segmentation. Since these two factors are thought to play an important role in early language acquisition, we aim to quantify their contribution for this task. We study a Japanese corpus containing both infant- and adult-directed speech and we apply four different word segmentation models, with and without knowledge of prosodic boundaries. The results showed that the difference between registers is smaller than previously reported and that prosodic boundary information helps more adult- than infant-directed speech.
Le Godais, G., Linzen, T. & Dupoux, E. (2017). Comparing character-level neural language models using a lexical decision task. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics., 2, (pp 125--130) . [abstract] What is the information captured by neural network models of language? We address this question in the case of character-level recurrent neural language models. These models do not have explicit word repre- sentations; do they acquire implicit ones? We assess the lexical capacity of a network using the lexical decision task common in psycholinguistics: the system is required to decide whether or not a string of charac- ters forms a word. We explore how accu- racy on this task is affected by the architec- ture of the network, focusing on cell type (LSTM vs. SRN), depth and width. We also compare these architectural properties to a simple count of the parameters of the network. The overall number of parame- ters in the network turns out to be the most important predictor of accuracy; in partic- ular, there is little evidence that deeper net- works are beneficial for this task.
Larsen, E., Cristia, A. & Dupoux, E. (2017). Relating unsupervised word segmentation to reported vocabulary acquisition. In INTERSPEECH-2017. [abstract] A range of computational approaches have been used to model the discovery of word forms from continuous speech by infants. Typically, these algorithms are evaluated with respect to the ideal 'gold standard' word segmentation and lexicon. These metrics assess how well an algorithm matches the adult state, but may not reflect the intermediate states of the child's lexical development. We set up a new evaluation method based on the correlation between word frequency counts derived from the application of an algorithm onto a corpus of child-directed speech, and the proportion of infants knowing the words according to parental reports. We evaluate a representative set of 4 algorithms, applied to transcriptions of the Brent corpus, which have been phonologized using either phonemes or syllables as basic units. Results show remarkable variation in the extent to which these 8 algorithm-unit combinations predicted infant vocabulary, with some of these predictions surpassing those derived from the adult gold standard segmentation. We argue that infant vocabulary prediction provides a useful complement to traditional evaluation; for example, the best predictor model was also one of the worst in terms of segmentation score, and there was no clear relationship between token or boundary F-score and vocabulary prediction.
Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer, L. & Ranzato, M. (2017). Fader Networks: Generating Image Variations by Sliding Attribute Values. In Advances in Neural Information Processing Systems, (pp 5963--5972) .
Guevara-Rukoz, A., Parlato-Oliveira, E., Yu, S., Hirose, Y., Peperkamp, S. & Dupoux, E. (2017). Predicting epenthetic vowel quality from acoustics. In INTERSPEECH-2017. [abstract] Past research has shown that sound sequences not permitted in our native language may be distorted by our perceptual system. A well documented example is vowel epenthesis, a phenomenon in which non-existent vowels are hallucinated by listeners, in order to repair illegal consonantal sequences. As reported in previous work, this occurs in Japanese (JP) and Brazilian Portuguese (BP), languages for which the 'default' epenthetic vowels are /u/ and /i/, respectively. In a perceptual experiment, we corroborate the finding that the quality of this illusory vowel is language-dependent, but also that this default choice can be overridden by coarticulatory information present on the consonant cluster. In a second step, we analyse recordings of JP and BP speakers producing 'epenthesized' versions of stimuli from the perceptual task. Results reveal that the default vowel corresponds to the vowel with the most reduced acoustic characteristics, also the one for which formants are acoustically closest to formant transitions present in consonantal clusters. Lastly, we model behavioural responses from the perceptual experiment with an exemplar model using dynamic time warping (DTW)-based similarity measures on MFCCs.
Guevara-Rukoz, A., Lin, I., Morii, M., Minagawa, Y., Dupoux, E. & Peperkamp, S. (2017). Which epenthetic vowel? Phonetic categories versus acoustic detail in perceptual vowel epenthesis Journal of the Acoustical Society of America: Express Letters, 142(2), EL211-2017. [abstract] This study aims to quantify the relative contributions of phonetic categories and acoustic detail on phonotactically induced perceptual vowel epenthesis in Japanese listeners. A vowel identification task tested whether a vowel was perceived within illegal consonant clusters and, if so, which vowel was heard. Cross-spliced stimuli were used in which vowel coarticulation present in the cluster did not match the quality of the flanking vowel. Two clusters were used, /hp/ and /kp/, the former containing larger amounts of resonances of the preceding vowel. While both flanking vowel and coarticulation influenced vowel quality, the influence of coarticulation was larger, especially for /hp/.
Dunbar, E., Xuan-Nga, C., Benjumea, J., Karadayi, J., Bernard, M., Besacier, L., Anguera, X. & Dupoux, E. (2017). The Zero Resource Speech Challenge 2017. In ASRU-2017. [abstract] We describe a new challenge aimed at discovering subword and word units from raw speech. This challenge is the fol- lowup to the Zero Resource Speech Challenge 2015. It aims at constructing systems that generalize across languages and adapt to new speakers. The design features and evaluation metrics of the challenge are presented and the results of sev- enteen models are discussed.
Cristia, A., Dupoux, E., Gurven, M. & Stieglitz, J. (2017). Child-directed speech is infrequent in a forager-farmer population: a time allocation study. Child Development. [abstract] This article provides an estimation of how frequently, and from whom, children aged 0-11 years (Ns between 9 and 24) receive one-on-one verbal input among Tsimane forager-horticulturalists of lowland Bolivia. Analyses of systematic daytime behavioral observations reveal < 1 min per daylight hour is spent talking to children younger than 4 years of age, which is 4 times less than estimates for others present at the same time and place. Adults provide a majority of the input at 0--3 years of age but not afterward. When integrated with previous work, these results reveal large cross-cultural variation in the linguistic experiences provided to young children. Consideration of more diverse human populations is necessary to build generalizable theories of language acquisition.
Chaabouni, R., Dunbar, E., Zeghidour, N. & Dupoux, E. (2017). Learning weakly supervised multimodal phoneme embeddings. In INTERSPEECH-2017. [abstract] Recent works have explored deep architectures for learning multimodal speech representation (e.g. audio and images, articulation and audio) in a supervised way. Here we investigate the role of combining different speech modalities, i.e. audio and visual information representing the lips' movements, in a weakly-supervised way using Siamese networks and lexical same-different side information. In particular, we ask whether one modality can benefit from the other to provide a richer representation for phone recognition in a weakly supervised setting. We introduce mono-task and multi-task methods for merging speech and visual modalities for phone recognition. The mono-task learning consists in applying a Siamese network on the concatenation of the two modalities, while the multi-task learning receives several different combinations of modalities at train time. We show that multi-task learning enhances discriminability for visual and multimodal inputs while minimally impacting auditory inputs. Furthermore, we present a qualitative analysis of the obtained phone embeddings, and show that cross-modal visual input can improve the discriminability of phonetic features which are visually discernable (rounding, open/close, labial place of articulation), resulting in representations that are closer to abstract linguistic features than those based on audio only.
Zeghidour, N., Synnaeve, G., Versteegh, M. & Dupoux, E. (2016). A Deep Scattering Spectrum - Deep Siamese Network Pipeline For Unsupervised Acoustic Modeling. In ICASSP-2016, (pp 4965-4969) . [abstract] Recent work has explored deep architectures for learning acoustic features in an unsupervised or weakly supervised way for phone recognition. Here we investigate the role of the input features, and in particular we test whether standard mel-scaled filterbanks could be replaced by inherently richer representations, such as derived from an analytic scattering spectrum. We use a Siamese network using lexical side information similar to a well performing architecture used in the Zero Resource Speech Challenge (2015), and show a substantial improvement when the filterbanks are replaced by scattering features, even though these features yield similar performance when tested without training. This shows that unsupervised and weakly-supervised architectures can benefit from richer features than the traditional ones.
Zeghidour, N., Synnaeve, G., Usunier, N. & Dupoux, E. (2016). Joint Learning of Speaker and Phonetic Similarities with Siamese Networks. In INTERSPEECH-2016, (pp 1295-1299) . [abstract] Recent work has demonstrated, on small datasets, the feasibility of jointly learning specialized speaker and phone embeddings, in a weakly supervised siamese DNN architecture using word and speaker identity as side information. Here, we scale up these architectures to the 360 hours of the Librispeech corpus by implementing a sampling method to efficiently select pairs of words from the dataset and improving the loss function. We also compare the standard siamese networks fed with same (AA) or different (AB) pairs, to a 'triamese' network fed with AAB triplets. We use ABX discrimination tasks to evaluate the discriminability and invariance properties of the obtained joined embeddings, and compare these results with mono-embeddings architectures. We find that the joined embeddings architectures succeed in effectively disentangling speaker from phoneme information, with around 10% errors for the matching tasks and embeddings (speaker task on speaker embeddings, and phone task on phone embedding) and near chance for the mismatched task. Furthermore, the results carry over in out-of-domain datasets, even beating the best results obtained with similar weakly supervised techniques.
Versteegh, M., Anguera, X., Jansen, A. & Dupoux, E. (2016). The Zero Resource Speech Challenge 2015: Proposed Approaches and Results. In SLTU-2016 Procedia Computer Science, 81, (pp 67-72) . [abstract] This paper reports on the results of the Zero Resource Speech Challenge 2015, the first unified benchmark for zero resource speech technology, which aims at the unsupervised discovery of subword and word units from raw speech. This paper dis- cusses the motivation for the challenge, its data sets, tasks and baseline systems. We outline the ideas behind the systems that were submitted for the two challenge tracks: unsuper- vised subword unit modeling and spoken term discovery, and summarize their results. The results obtained by participating teams show great promise; many systems beat the provided baselines and some even perform better than comparable su- pervised systems.
Synnaeve, G. & Dupoux, E. (2016). A temporal coherence loss function for learning unsupervised acoustic embeddings. In SLTU-2016 Procedia Computer Science, 81, (pp 95-100) . [abstract] We train Neural Networks of varying depth with a loss function which imposes the output representations to have a temporal profile which looks like that of phonemes. We show that a simple loss function which maximizes the dissimilarity between near frames and long distance frames helps to construct a speech embedding that improves phoneme discriminability, both within and across speakers, even though the loss function only uses within speaker information. However, with too deep an architecture, this loss function yields overfitting, suggesting the need for more data and/or regularization.
Schatz, T. (2016). ABX-Discriminability Measures and Applications. (Unpublished doctoral dissertation) Ecole Normale Supérieure. [abstract] The starting point for this thesis was the problem of modeling phonetic category acquisition in infancy. Roughly speaking phonetic category acquisition refers to the process by which infants during their first year of life come to process phones (i.e. vowels and consonants) in a manner specific to the language to which they are exposed. For example, a baby exposed to English speech retain the ability to distinguish between the /r/ and /l/ sounds of English, while a baby exposed to Japanese, a language where there is no equivalent of this phonetic contrast, quickly learns to ignore it. This phenomenon has been documented in a large number of studies but the mechanisms underlying it have been less studied. Because of the very early age at which it occurs -before the baby even speaks their first word- it has been proposed to result from some sort of statistical learning performed by the child on the basis of the speech signal reaching their senses. This invites further investigation into what specific input data and learning algorithm can plausibly account for the observed empirical results. A few proposals have been made, but the proposed models were never tested extensively nor compared quantitatively to see whether they are really able to account for a sizable portion of the available empirical observations. The systematic comparison of models of phonetic learning is very important, from a theoretical perspective, but also for generating new empirical predictions that could be put to test in infants. This is why, we devoted this thesis, not to the modeling of phonetic learning itself, but to the preliminary step of developing a sound method to compare these models. To this effect, we introduce in this thesis ABX-discriminability measures, which provide a systematic and flexible way of evaluating candidate models for phonetic category acquisition. We demonstrate the interest of our evaluation framework, by applying it to the evaluation of models of phonetic category processing at birth and in adults. Models of phonetic category processing at birth provide an initial state for models of phonetic category acquisition. Models of phonetic category processing in adults provide a useful baseline against which to compare models of phonetic category acquisition, the difference between the two being that models of phonetic category processing in adults do not have to be based on a plausible learning mechanism, only the learning result needing to be plausible. The next step is to apply our framework to the models of phonetic category acquisition proposed in the literature, which is left for future work. ABX-discriminability measures are useful beyond the particular problem of modeling phonetic category processing in humans and we also present other applications. There are at least two ways in which the interest of ABX-discriminability measures generalizes to other situations. First, it generalizes to application domains beyond cognitive science. In particular, we discuss applications in artificial intelligence, low-resource engineering and data mining. Second, it generalizes to signal beyond speech and to category structures beyond phonetic categories. In this respect, although, we only work out practical examples involving large corpora of speech recordings annotated at the word or phone level, we present the rationale for applications in a fully general way.
Ogawa, T., Mallidi, S.H., Dupoux, E., Cohen, J., Feldman, N. & Hermansky, H. (2016). A new efficient measure for accuracy prediction and its application to multistream-based unsupervised adaptation. In ICPR. [abstract] A new efficient measure for predicting estimation accuracy is proposed and successfully applied to multistream-based unsupervised adaptation of ASR systems to address data uncertainty when the ground-truth is unknown. The proposed measure is an extension of the M-measure, which predicts confidence in the output of a probability estimator by measuring the divergences of probability estimates spaced at specific time intervals. In this study, the M-measure was extended by considering the latent phoneme information, resulting in an improved reliability. Experimental comparisons carried out in a multistream-based ASR paradigm demonstrated that the extended M-measure yields a significant improvement over the original M-measure, especially under narrow-band noise conditions.
Ludusan, B., Cristia, A., Martin, A., Mazuka, R. & Dupoux, E. (2016). Learnability of prosodic boundaries: Is infant-directed speech easier? Journal of the Acoustical Society of America, 140(2), 1239-1250. [abstract] This study explores the long-standing hypothesis that the acoustic cues to prosodic boundaries in infant-directed speech (IDS) make those boundaries easier to learn than those in adult-directed speech (ADS). Three cues (pause duration, nucleus duration and pitch change) were investigated, by means of a systematic review of the literature, statistical analyses of a new corpus, and machine learning experiments. The review of previous work revealed that the effect of register on boundary cues is less well established than previously thought, and that results often vary across studies for certain cues. Statistical analyses run on a large database of mother-child and mother-interviewer interactions showed that the duration of a pause and the duration of the syllable nucleus preceding the boundary are two cues which are enhanced in IDS, while f0 change is actually degraded in IDS. Supervised and unsupervised machine learning techniques applied to these acoustic cues revealed that IDS boundaries were consistently better classified than ADS ones, regardless of the learning method used. The role of the cues examined in this study and the importance of these findings in the more general context of early linguistic structure acquisition is discussed.
Ludusan, B. & Dupoux, E. (2016). The role of prosodic boundaries in word discovery: Evidence from a computational model. Journal of the Acoustical Society of America, 140(1), EL1. [abstract] This study aims to quantify the role of prosodic boundaries in early language acquisition using a computational modeling approach. A spoken term discovery system that models early word learning was used with and without a prosodic component on speech corpora of English, Spanish, and Japanese. The results showed that prosodic information induces a consistent improvement both in the alignment of the terms to actual word boundaries and in the phonemic homogeneity of the discovered clusters of terms. This benefit was found also when automatically discovered prosodic boundaries were used, boundaries which did not perfectly match the linguistically defined ones.
Ludusan, B. & Dupoux, E. (2016). Automatic syllable segmentation using broad phonetic class information. In SLTU-2016 Procedia Computer Science, 81, (pp 101-106) . [abstract] We propose in this paper a language-independent method for syllable segmentation. The method is based on the Sonor- ity Sequencing Principle, by which the sonority inside a syl- lable increases from its boundaries towards the syllabic nu- cleus. The sonority function employed was derived from the posterior probabilities of a broad phonetic class recognizer, trained with data coming from an open-source corpus of En- glish stories. We tested our approach on English, Spanish and Catalan and compared the results obtained to those given by an energy-based system. The proposed method outperformed the energy-based system on all three languages, showing a good generalizability to the two unseen languages. We con- clude with a discussion of the implications of this work for under-resourced languages.
Linzen, T., Dupoux, E. & Spector, B. (2016). Quantificational features in distributional word representations. In Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics, (pp pages 1 -- 1-11) . [abstract] We present in this paper an evaluation of the role of prosodic boundaries in the process of unsupervised word discovery. The tests performed on a corpus of English broadcast news showed that the system precision increases systematically when prosodic boundaries are incorporated, with respect to the baseline. We also investigated whether pauses, a simpler phenomenon to extract automatically, would offer the same advantage, and we discovered that prosodic boundaries offer more information to the word discovery process.
Linzen, T., Dupoux, E. & Goldberg, Y. (2016). Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4, 521-535. [abstract] We present in this paper an evaluation of the role of prosodic boundaries in the process of unsupervised word discovery. The tests performed on a corpus of English broadcast news showed that the system precision increases systematically when prosodic boundaries are incorporated, with respect to the baseline. We also investigated whether pauses, a simpler phenomenon to extract automatically, would offer the same advantage, and we discovered that prosodic boundaries offer more information to the word discovery process.
Fourtassi, A. & Dupoux, E. (2016). The role of word-word co-occurrence in word learning. In Proceedings of the 38th Annual Conference of the Cognitive Science Society, (pp 662-667) . [abstract] A growing body of research on early word learning suggests that learners gather word-object co-occurrence statistics across learning situations. Here we test a new mechanism whereby learners are also sensitive to word-word co-occurrence statistics. Indeed, we find that participants can infer the likely referent of a novel word based on its co-occurrence with other words, in a way that mimics a machine learning algorithm dubbed `zero-shot learning'. We suggest that the interaction between referential and distributional regularities can bring robustness to the process of word acquisition.
Dunbar, E. & Dupoux, E. (2016). Geometric constraints on human speech sound inventories. Frontiers in Psychology, 7(1061). [abstract] We investigate the idea that the languages of the world have developed coherent sound systems in which having one sound increases or decreases the chances of having certain other sounds, depending on shared properties of those sounds. We investigate the geometries of sound systems that are defined by the inherent properties of sounds. We document three typological tendencies in sound system geometries: economy, a tendency for the differences between sounds in a system to be definable on a relatively small number of independent dimensions; local symmetry, a tendency for sound systems to have relatively large numbers of pairs of sounds that differ only on one dimension; and global symmetry, a tendency for sound systems to be relatively balanced. The finding of economy corroborates previous results; the two symmetry properties have not been previously documented. We also investigate the relation between the typology of inventory geometries and the typology of individual sounds, showing that the frequency distribution with which individual sounds occur across languages works in favour of both local and global symmetry.
Carbajal, J., Fér, R. & Dupoux, E. (2016). Modeling language discrimination in infants using i-vector representations. In Proceedings of the 38th Annual Conference of the Cognitive Science Society, (pp 889-896) . [abstract] Experimental research suggests that at birth infants can discriminate two languages if they belong to different rhythmic classes, and by 4 months of age they can discriminate two languages within the same class provided they have been previously exposed to at least one of them. In this paper, we present a novel application of speech technology tools to model language discrimination, which may help to understand how infants achieve this task. By combining a Gaussian Mixture Model of the acoustic space and low-dimensional representations of novel utterances with a model of a habituation paradigm, we show that brief exposure to French does not allow to discriminate between two previously unheard languages belonging to the same rhythmic class, but allows to discriminate two languages across rhythmic class. The implications of these findings are discussed.
Carbajal, J., Dawud, A., Thiollière, R. & Dupoux, E. (2016). The 'Language Filter' Hypothesis: Modeling Language Separation in Infants using I-vectors. In EPIROB 2016, (pp 195-201) . [abstract] Experimental research suggests that at birth infants can discriminate two languages if they belong to different rhythmic classes, and by 4 months of age they can discriminate two languages within the same class provided they have been previously exposed to at least one of them. In this paper, we present a novel application of speech technology tools to model language discrimination, which may help to understand how infants achieve this task. By combining a Gaussian Mixture Model of the acoustic space and low-dimensional representations of novel utterances with a model of a habituation paradigm, we show that brief exposure to French does not allow to discriminate between two previously unheard languages belonging to the same rhythmic class, but allows to discriminate two languages across rhythmic class. The implications of these findings are discussed.
Bergmann, C., Cristia, A. & Dupoux, E. (2016). Discriminability of sound contrasts in the face of speaker variation quantified. In Proceedings of the 38th Annual Conference of the Cognitive Science Society, (pp 1331-1336) . [abstract] How does a naive language learner deal with speaker variation irrelevant to distinguish word meanings? Experimental data is conflicting and incompatible models have been proposed. In this paper we examine the basic assumptions of these models regarding the signal the learner deals with: Is speaker variability a hurdle in discriminating sounds or can it easily be abstracted over? To this end we summarize existing infant data and compare them to machine-based discriminability scores of sound pairs obtained without added language knowledge. Our results show consistently that speaker variability decreases sound contrast discriminability, and that some pairs are affected more than others. Further, chance performance is a rare exception; contrasts remain discriminable in the face of speaker variation. Our data offer a way to reunite seemingly conflicting findings in the infant literature and show a path forward in testing whether and how speaker variation plays a role for language acquisition.
Versteegh, M., Thiollière, R., Schatz, T., Cao, X.N., Anguera, X., Jansen, A. & Dupoux, E. (2015). The Zero Resource Speech Challenge 2015. In INTERSPEECH-2015, (pp 3169-3173) . [abstract] The Interspeech 2015 Zero Resource Speech Challenge aims at discovering subword and word units from raw speech. The challenge provides the first unified and open source suite of evaluation metrics and data sets to compare and analyse the results of unsupervised linguistic unit discovery algorithms. It consists of two tracks. In the first, a psychophysically inspired evaluation task (minimal pair ABX discrimination) is used to assess how well speech feature representations discriminate between contrastive subword units. In the second, several metrics gauge the quality of discovered word-like patterns. Two data sets are provided, one for English, one for Xitsonga. Both data sets are provided without any annotation except for voice activity and talker identity. This paper introduces the evaluation metrics, presents the results of baseline systems and discusses some of the key issues in unsupervised unit discovery.
Thiollière, R., Dunbar, E., Synnaeve, G., Versteegh, M. & Dupoux, E. (2015). A Hybrid Dynamic Time Warping-Deep Neural Network Architecture for Unsupervised Acoustic Modeling. In INTERSPEECH-2015, (pp 3179-3183) . [abstract] We report on an architecture for the unsupervised discovery of talker-invariant subword embeddings. It is made out of two components: a dynamic-time warping based spoken term discovery (STD) system and a Siamese deep neural network (DNN). The STD system clusters word-sized repeated fragments in the acoustic streams while the DNN is trained to minimize the distance between time aligned frames of tokens of the same cluster, and maximize the distance between tokens of different clusters. We use additional side information regarding the average duration of phonemic units, as well as talker identity tags. For evaluation we use the datasets and metrics of the Zero Resource Speech Challenge. The model shows improvement over the baseline in subword unit modeling.
Synnaeve, G. & Dupoux, E. (2015). Weakly Supervised Multi-Embeddings Learning of Acoustic Models. In ICLR Workshop, (pp ArXiv 1412.6645 [cs.SD]) . [abstract] We trained a Siamese network with multi-task same/different information on a speech dataset, and found that it was possible to share a network for both tasks without a loss in performance. The first task was to discriminate between two same or different words, and the second was to discriminate between two same or different talkers.
Michon, E., Dupoux, E. & Cristia, A. (2015). Salient dimensions in implicit phonotactic learning. In INTERSPEECH-2015, (pp 2665-2669) . [abstract] Adults are able to learn sound co-occurrences without conscious knowledge after brief exposures. But which dimensions of sounds are most salient in this process? Using an artificial phonology paradigm, we explored potential learnability differences involving consonant-, speaker-, and tone-vowel co-occurrences. Results revealed that participants, whose native language was not tonal, implicitly encoded consonant-vowel patterns with a high level of accuracy; were above chance for tone-vowel co-occurrences; and were at chance for speaker-vowel co-occurrences. This pattern of results is exactly what would be expected if both language-specific experience and innate biases to encode potentially contrastive linguistic dimensions affect the salience of different dimensions during implicit learning of sound patterns.
Martin, A., Schatz, T., Versteegh, M., Miyazawa, K., Mazuka, R., Dupoux, E. & Cristia, A. (2015). Mothers speak less clearly to infants: A comprehensive test of the hyperarticulation hypothesis. Psychological Science, 26(3), 341-347. [abstract] Infants learn language at an incredible speed, and one of the first steps in this voyage includes learning the basic sound units of their native language. It is widely thought that caregivers facilitate this task by hyperarticulating when speaking to their infants. Utilizing state-of-the-art speech technology, we address this key theoretical question: Are sound categories clearer in infant- than in adult-directed speech? A comprehensive examination of sound contrasts in a large corpus of spontaneous Japanese demonstrates that there is a small but significant tendency for contrasts in infant-directed speech to be less clear than those in adult-directed speech, contrary to the idea that caregivers actively enhance phonetic categories in infant-directed speech. These results suggest that the ability to learn from noisy data must be a crucial component of plausible theories of infant language acquisition.
Ludusan, B., Synnaeve, G. & Dupoux, E. (2015). Prosodic boundary information helps unsupervised word segmentation. In NAACL HLT 2015, (pp 953-963) .
Ludusan, B., Seidl, A., Dupoux, E. & Cristia, A. (2015). Motif discovery in infant- and adult-directed speech. In Proceedings of CogACLL2015, (pp 93-102) . [abstract] Infant-directed speech (IDS) is thought to play a key role in determining infant language acquisition. It is thus important to describe to what extent it differs from adult-directed speech (ADS) in dimensions that could affect learnability. In this paper, we explore how an acoustic motif discovery algorithm fares when presented with spontaneous speech from both registers. Results show small but significant differences in performance, with lower recall and higher fragmentation in IDS than ADS. Such a result is inconsistent with a view of IDS where clarity and ease of lexical recognition is a primary consideration. Additionally, it predicts that learners who extract acoustic word-forms should do worse with IDS than ADS. Similarities and differences with human infants' performance on word segmentation tasks are discussed.
Ludusan, B., Origlia, A. & Dupoux, E. (2015). Rhythm-Based Syllabic Stress Learning without Labelled Data. In Proceedings of Statistical Language and Speech Processing -SLSP 2015, (pp 185-196) . [abstract] In this paper we propose a method for syllabic stress annotation which does not require manual labels for the learning process, but uses stress labels automatically generated from a multiscale model of rhythm perception. The model gives in its output a sequence of events, corresponding the sequences of strong-weak syllables present in speech, based on which a stressed/unstressed decision is taken. We tested our approach on two languages, Catalan and Spanish, and we found that a supervised system employing the automatic labels for learning improves the performance over the baseline, for both languages. We also compared the results of this system with that of an identical learning algorithm, but which employs manual labels for stress, as well as to that of an unsupervised learning algorithm using the same features. It showed that the system using automatic labels has a similar performance to the one using manual labels, with both supervised systems outperforming the clustering algorithm.
Ludusan, B., Caranica, A., Cucu, H., Buzo, A., Burileanu, C. & Dupoux, E. (2015). Exploring multi-language resources for unsupervised spoken term discovery. In Speech Technology and Human-Computer Dialogue (SpeD), 2015 International Conference on, (pp 1-6) . [abstract] With information processing and retrieval of spoken documents becoming an important topic, there is a need of systems performing automatic segmentation of audio streams. Among such algorithms, spoken term discovery allows the extraction of word-like units (terms) directly from the continuous speech signal, in an unsupervised manner and without any knowledge of the language at hand. Since the performance of any downstream application depends on the goodness of the terms found, it is relevant to try to obtain higher quality automatic terms. In this paper we investigate whether the use input features derived from of multi-language resources helps the process of term discovery. For this, we employ an open-source phone recognizer to extract posterior probabilities and phone segment decisions, for several languages. We examine the features obtained from a single language and from combinations of languages based on the spoken term discovery results attained on two different datasets of English and Xitsonga. Furthermore, a comparison to the results obtained with standard spectral features is performed and the implications of the work discussed.
Ludusan, B. & Dupoux, E. (2015). A multilingual study on intensity as a cue for marking prosodic boundaries. In ICPhS, (pp e982) . [abstract] Speech intensity is one of the main prosodic cues, playing a role in most of the suprasegmental phenomena. Despite this, its contribution to the signalling of prosodic hierarchy is still relatively under-studied, compared to the other cues, like duration or fundamental frequency. We present here an investigation on the role of intensity in prosodic boundary detection in four different languages, by testing several intensity measures. The statistical analysis performed showed significant correlates of prosodic boundaries, for most intensity measures employed and in all languages. Our findings were further validated with a classification experiment in which the boundary/non-boundary distinction was learned in unsupervised manner, using only intensity cues. It showed that intensity range measures outperform absolute intensity measures, with the total intensity range being consistently the best feature.
Johnson, M., Pater, J., Staub, R. & Dupoux, E. (2015). Sign constraints on feature weights improve a joint model of word segmentation and phonology. In NAACL HLT 2015, (pp 303-313) . [abstract] This paper describes a joint model of word segmentation and phonological alternations, which takes unsegmented utterances as input and infers word segmentations and underlying phonological representations. The model is a Maximum Entropy or log-linear model, which can express a probabilistic version of Optimality Theory (OT; Prince and Smolensky, 2004), a standard phonological framework. The features in our model are inspired by OT's Markedness and Faithfulness constraints. Following the OT principle that such features indicate ``violations'', we require their weights to be non-positive. We apply our model to a modified version of the Buckeye corpus (Pitt et al., 2007) in which the only phonological alternations are deletions of word-final /d/ and /t/ segments. The model sets a new state-of-the-art for this corpus for word segmentation, identification of underlying forms, and identification of /d/ and /t/ deletions. We also show that the OT-inspired sign constraints on feature weights are crucial for accurate identification of deleted /d/s; without them our model posits approximately 10 times more deleted underlying /d/s than appear in the manually annotated data.
Hermansky, H., Burget, L., Cohen, J., Dupoux, E., Feldman, N., Godfrey, J., Khudanpur, S., Maciejewski, M., Mallidi, S.H., Menon, A., Ogawa, T., Peddinti, V., Rose, R., Stern, R., Wiesner, M. & Vesely, K. (2015). Towards machines that know when they do not know: Summary of work done at 2014 Frederick Jelinek memorial workshop in Prague. In ICASSP-2015 (IEEE International Conference on Acoustics Speech and Signal Processing), (pp 5009-5013) . [abstract] A group of junior and senior researchers gathered as a part of the 2014 Frederick Jelinek Memorial Workshop in Prague to address the problem of predicting the accuracy of a nonlinear Deep Neural Network probability estimator for unknown data in a different application domain from the domain in which the estimator was trained. The paper describes the problem and summarizes approaches that were taken by the group.
Fourtassi, A. (2015). Acquiring phonemes with early semantics. (Unpublished doctoral dissertation) Ecole Normale Supérieure.
Dunbar, E., Synnaeve, G. & Dupoux, E. (2015). Quantitative methods for comparing featural representations. In ICPhS, (pp paper number 1024) . [abstract] The basic representational hypothesis in phonology is that segments are coded using a universal set of discrete features. We propose a method for quantitatively measuring how well such features align with arbitrary segment representations. We assess articulatory, spectral, and phonotactic representations of English consonants. Our procedure constructs a concrete representation of a feature in terms of the pairs it distinguishes, and can be extended to any pair of representations to test the consistency of one with the individual dimensions of the other. We validate the method on our phonetic representations and then show that major natural classes are not well represented in the surface phonotactics.
Synnaeve, G., Versteegh, M. & Dupoux, E. (2014). Learning words from images and speech. In NIPS Workshop on Learning Semantics. [abstract] The Interspeech 2015 Zero Resource Speech Challenge aims at discovering subword and word units from raw speech. The challenge provides the first unified and open source suite of evaluation metrics and data sets to compare and analyse the results of unsupervised linguistic unit discovery algorithms. It consists of two tracks. In the first, a psychophysically inspired evaluation task (minimal pair ABX discrimination) is used to assess how well speech feature representations discriminate between contrastive subword units. In the second, several metrics gauge the quality of discovered word-like patterns. Two data sets are provided, one for English, one for Xitsonga. Both data sets are provided without any annotation except for voice activity and talker identity. This paper introduces the evaluation metrics, presents the results of baseline systems and discusses some of the key issues in unsupervised unit discovery.
Synnaeve, G., Schatz, T. & Dupoux, E. (2014). Phonetics embedding learning with side information. In IEEE Spoken Language Technology Workshop, (pp 106 - 111) . [abstract] We show that it is possible to learn an efficient acoustic model using only a small amount of easily available word-level similarity nnotations. In contrast to the detailed phonetic label- ing required by classical speech recognition technologies, the only information our method requires are pairs of speech ex- cerpts which are known to be similar (same word) and pairs of speech excerpts which are known to be different (different words). An acoustic model is obtained by training shallow and deep neural networks, using an architecture and a cost function well-adapted to the nature of the provided informa- tion. The resulting model is evaluated on an ABX minimal- pair discrimination task and is shown to perform much better (11.8% ABX error rate) than raw speech features (19.6%), not far from a fully supervised baseline (best neural network: 9.2%, HMM-GMM: 11%).
Synnaeve, G., Dautriche, I., Boerschinger, B., Johnson, M. & Dupoux, E. (2014). Unsupervised word segmentation in context. In Proceedings of 25th International Conference on Computational Linguistics (CoLing), (pp 2326-2334) . [abstract] This paper extends existing word segmentation models to take non-linguistic context into account. It improves the token F-score of well-performing segmentation models by 2.5% on a 27k utterances dataset. We posit that word segmentation is easier in-context because the learner is not trying to access irrelevant lexical items. We use topics from Latent Dirichlet Allocation as a proxy for activities context, to label the Providence corpus. We present Adaptor Grammar models that use these context labels, and we study their performance with and without context annotations at test time.
Schatz, T., Peddinti, V., Cao, X.N., Bach, F., Hermansky, H. & Dupoux, E. (2014). Evaluating speech features with the Minimal-Pair ABX task (II): Resistance to noise. In INTERSPEECH-2014, (pp 915-919) . [abstract] The Minimal-Pair ABX (MP-ABX) paradigm has been proposed as a method for evaluating speech features for zero-resource/unsupervised speech technologies. We apply it in a phoneme discrimination task on the Articulation Index corpus to evaluate the resistance to noise of various speech features. In Experiment 1, we evaluate the robustness to additive noise at different signal-to-noise ratios, using car and babble noise from the Aurora-4 database and white noise. In Experiment 2, we examine the robustness to different kinds of convolutional noise. In both experiments we consider two classes of techniques to induce noise resistance: smoothing of the time-frequency representation and short-term adaptation in the time-domain. We consider smoothing along the spectral axis (as in PLP) and along the time axis (as in FDLP). For short-term adaptation in the time-domain, we compare the use of a static compressive non-linearity followed by RASTA filtering to an adaptive compression scheme.
Ludusan, B., Versteegh, M., Jansen, A., Gravier, G., Cao, X.N., Johnson, M. & Dupoux, E. (2014). Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems. In Proceedings of LREC 2014, (pp 560-567) . [abstract] The unsupervised discovery of linguistic terms from either continuous phoneme transcriptions or from raw speech has seen an increasing interest in the past years both from a theoretical and a practical standpoint. Yet, there exists no common accepted evaluation method for the systems performing term discovery. Here, we propose such an evaluation toolbox, drawing ideas from both speech technology and natural language processing. We first transform the speech-based output into a symbolic representation and compute five types of evaluation metrics on this representation: the quality of acoustic matching, the quality of the clusters found, and the quality of the alignment with real words (type, token, and boundary scores). We tested our approach on two term discovery systems taking speech as input, and one using symbolic input. The latter was run using both the gold transcription and a transcription obtained from anautomatic speech recognizer, in order to simulate the case when only imperfect symbolic information is available. The results obtained are analysed through the use of the proposed evaluation metrics and the implications of these metrics are discussed.
Ludusan, B., Gravier, G. & Dupoux, E. (2014). Incorporating Prosodic Boundaries in Unsupervised Term Discovery. In Proceedings of Speech Prosody, 7, (pp 939-943) . [abstract] We present a preliminary investigation on the usefulness of prosodic boundaries for unsupervised term discovery (UTD). Studies in language acquisition show that infants use prosodic boundaries to segment continuous speech into word-like units. We evaluate whether such a strategy could also help UTD algorithms. Running a previously published UTD algorithm (MODIS) on a corpus of prosodically annotated English broadcast news revealed that many discovered terms straddle prosodic boundaries. We then implemented two variants of this algorithm: one that discards straddling items and one that truncates them to the nearest boundary (either prosodic or pause marker). Both algorithms showed a better term matching Fscore compared to the baseline and higher level prosodic boundaries were found to be better than lower level boundaries or pause markers. In addition, we observed that the truncation algorithm, but not the discard algorithm, increased word boundary F-score over the baseline.
Ludusan, B. & Dupoux, E. (2014). Towards Low Resource Prosodic Boundary Detection. In Proceedings of International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU'14), (pp 231-237) . [abstract] In this study we propose a method of prosodic boundary detection based only on acoustic cues which are easily extractable from the speech signal and without any supervision. Drawing a parallel between the process of language acquisition in babies and the speech processing techniques for under-resourced languages, we take advantage of the findings of several psycholinguistic studies relative to the cues used by babies for the identification of prosodic boundaries. Several durational and pitch cues were investigated, by themselves or in a combination, and relatively good performances were achieved. The best result obtained, a combination of all the cues, compares well against a previously proposed approach, without relying on any learning method or any lexical or syntactic cues.
Johnson, M., Christophe, A., Demuth, K. & Dupoux, E. (2014). Modelling function words improves unsupervised word segmentation. In Proceedings of the 52nd Annual meeting of the ACL, (pp 282--292) . [abstract] Inspired by experimental psychological findings suggesting that function words play a special role in word learning, we make a simple modification to an Adaptor Grammar based Bayesian word segmentation model to allow it to learn sequences of monosyllabic "function words" at the beginnings and endings of collocations of (possibly multi-syllabic) words. This modification improves unsupervised word segmentation on the standard Bernstein-Ratner (1987) corpus of child-directed English by more than 4% token f-score compared to a model identical except that it does not special-case "function words", setting a new state-of-the-art of 92.4% token f-score. Our function word model assumes that function words appear at the left periphery, and while this is true of languages such as English, it is not true universally. We show that a learner can use Bayesian model selection to determine the location of function words in their language, even though the input to the model only consists of unsegmented sequences of phones. Thus our computational models support the hypothesis that function words play a special role in word learning.
Johnson, M. & Börschinger, B. (2014). Exploring the Role of Stress in Bayesian Word Segmentation using Adaptor Grammars. In Transactions of the Association for Computational Linguistics-2014, 2(Feb), (pp 93-104) . [abstract] Stress has long been established as a major cue in word segmentation for English infants. We show that enabling a current state-of-the-art Bayesian word segmentation model to take advantage of stress cues noticeably improves its performance. We find that the improvements range from 10 to 4%, depending on both the use of phonotactic cues and, to a lesser extent, the amount of evidence available to the learner. We also find that in particular early on, stress cues are much more useful for our model than phonotactic cues by themselves, consistent with the finding that children do seem to use stress cues before they use phonotactic cues. Finally, we study how the model's knowledge about stress patterns evolves over time. We not only find that our model correctly acquires the most frequent patterns relatively quickly but also that the Unique Stress Constraint that is at the heart of a previously proposed model does not need to be built in but can be acquired jointly with word segmentation.
Fourtassi, A., Schatz, T., Varadarajan, B. & Dupoux, E. (2014). Exploring the Relative Role of Bottom-up and Top-down Information in Phoneme Learning. In Proceedings of the 52nd Annual meeting of the ACL, 2, (pp 1-6) Association for Computational Linguistics. [abstract] We test both bottom-up and top-down approaches in learning the phonemic status of the sounds of English and Japanese. We used large corpora of spontaneous speech to provide the learner with an input that models both the linguistic properties and statistical regularities of each language. We found both approaches to help discriminate between allophonic and phonemic contrasts with a high degree of accuracy, although top-down cues proved to be effective only on an interesting subset of the data. cues based of the properties of the lexicon. We test their performance in a task that consists on discriminating within category contrasts from between category contrasts. Finally we discuss the role and scope of each approach in learning phonemes.
Fourtassi, A., Dunbar, E. & Dupoux, E. (2014). Self Consistency as an Inductive Bias in Early Language Acquisition. In Proceedings of the 36th Annual Meeting of the Cognitive Science Society, (pp 469-474) . [abstract] In this paper we introduce an inductive bias for language acquisition. It is based on a holistic approach, whereby the levels of representations are not treated in isolation, but as different interacting parts. The best representation of the sound system is the one that leads to the best lexicon, defined as the one that sustains the most coherent semantics. We quantify this coherence through an intrinsic and unsupervised measure called "Self Consistency". We found this measure to be optimal under the true phonemic inventory and the correct word segmentation in English and Japanese.
Fourtassi, A. & Dupoux, E. (2014). A Rudimentary Lexicon and Semantics Help Bootstrap Phoneme Acquisition. In Proceedings of the 18th Conference on Computational Natural Language Learning (CoNLL), (pp 191-200) Association for Computational Linguistics. [abstract] Infants spontaneously discover the relevant phonemes of their language without any direct supervision. This acquisition is puzzling because it seems to require the availability of high levels of linguistic structures (lexicon, semantics), that logically suppose the infants having a set of phonemes already. We show how this circularity can be broken by testing, in real-size language corpora, a scenario whereby infants would learn approximate representations at all levels, and then refine them in a mutual constraining way. We start with corpora of spontaneous speech that have been encoded in a varying number of detailed context-dependent allophones. We derive an approximate lexicon and a rudimentary semantic representation. Despite the fact that all these representations are poor approximations of the ground truth, they help reorganize the fine grained categories into phoneme-like categories with a ahigh degree of accuracy.
Dupoux, E. (2014). Towards Quantitative Studies of Early Cognitive Development. Autonomous Mental Development Technical Committee Newsletter, 11(1), 10-11. [abstract] We present a new framework for the evaluation of speech representations in zero-resource settings, that extends and complements previous work by Carlin, Jansen and Hermansky [1]. In particular, we replace their Same/Different discrimination task by several Minimal-Pair ABX (MP-ABX) tasks. We explain the analytical advantages of this new framework and apply it to decompose the standard signal processing pipelines for computing PLP and MFC coefficients. This method enables us to confirm and quantify a variety of well-known and not-so-well-known results in a single framework.
Synnaeve, G. & Dupoux, E. (2013). In Depth Deep Beliefs Networks for Phone Recognition. In Poster presented in NIPS-2013.
Schatz, T., Peddinti, V., Bach, F., Jansen, A., Hermansky, H. & Dupoux, E. (2013). Evaluating speech features with the Minimal-Pair ABX task: Analysis of the classical MFC/PLP pipeline. In INTERSPEECH-2013, (pp 1781-1785) . [abstract] We present a new framework for the evaluation of speech representations in zero-resource settings, that extends and complements previous work by Carlin, Jansen and Hermansky [1]. In particular, we replace their Same/Different discrimination task by several Minimal-Pair ABX (MP-ABX) tasks. We explain the analytical advantages of this new framework and apply it to decompose the standard signal processing pipelines for computing PLP and MFC coefficients. This method enables us to confirm and quantify a variety of well-known and not-so-well-known results in a single framework.
Ontanon, S., Synnaeve, G., Uriarte, A., Richoux, F., Churchill, D. & Preuss, M. (2013). A Survey of Real-Time Strategy Game AI Research and Competition in StarCraft. In Computational Intelligence and AI in Games, IEEE Transactions on, 5(4), (pp 293-311) .
Martin, A., Peperkamp, S. & Dupoux, E. (2013). Learning Phonemes with a Proto-lexicon. Cognitive Science, 37, 103-124. [abstract] Before the end of the first year of life, infants begin to lose the ability to perceive distinctions between sounds that are not phonemic in their native language. It is typically assumed that this developmental change reflects the construction of language-specific phoneme categories, but how these categories are learned largely remains a mystery. Peperkamp, Le Calvez, Nadal, & Dupoux (2006) present an algorithm that can discover phonemes using the distributions of allophones as well as the phonetic properties of the allophones and their contexts. We show that a third type of information source, the occurrence of pairs of minimally-differing word forms in speech heard by the infant, is also useful for learning phonemic categories, and is in fact more reliable than purely distributional information in data containing a large number of allophones. In our model, learners build an approximation of the lexicon consisting of the high-frequency n-grams present in their speech input, allowing them to take advantage of top-down lexical information without needing to learn words. This may explain how infants have already begun to exhibit sensitivity to phonemic categories before they have a large receptive lexicon.
Jansen, A., Dupoux, E., Goldwater, S., Johnson, M., Khudanpur, S., Church, K., Feldman, N., Hermansky, H., Metze, F., Rose, R., Seltzer, M., Clark, P., McGraw, I., Varadarajan, B., Bennett, E., Boerschinger, B., Chiu, J., Dunbar, E., Fourtassi, A., Harwath, D., Lee, C.y., Levin, K., Norouzian, A., Peddinti, V., Richardson, R., Schatz, T. & Thomas, S. (2013). A summary of the 2012 JH CLSP Workshop on zero resource speech technologies and models of early language acquisition. In ICASSP-2013 (IEEE International Conference on Acoustics Speech and Signal Processing), (pp 8111-8115) . [abstract] We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding zero resource (unsupervised) speech technologies and related models of early language acquisition. Centered around the tasks of phonetic and lexical discovery, we consider unified evaluation metrics, present two new approaches for improving speaker independence in the absence of supervision, and evaluate the application of Bayesian word segmentation algorithms to automatic subword unit tokenizations. Finally, we present two strategies for integrating zero resource techniques into supervised settings, demonstrating the potential of unsupervised methods to improve mainstream technologies.
Fourtassi, A., Boerschinger, B., Johnson, M. & Dupoux, E. (2013). WhyisEnglishsoeasytosegment. In Proceedings of the 4th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2013), (pp 1-10) . [abstract] Cross-linguistic studies on unsupervised word segmentation have consistently shown that English is easier to segment than other languages. In this paper, we propose an explanation based on the notion of segmentation ambiguity. We show that English has a very low segmentation ambiguity compared to Japanese and that this difference correlates with the segmentation performance in a unigram model. We suggest that segmentation ambiguity is linked to a trade-off between syllable structure complexity and word length distribution.
Fourtassi, A. & Dupoux, E. (2013). A corpus-based evaluation method for Distributional Semantic Models. In Proceedings of ACL-SRW 2013, (pp 165-171) . [abstract] Evaluation methods for Distributional Semantic Models typically rely on behaviorally derived gold standards. These methods are difficult to deploy in languages with scarce linguistic/behavioral resources. We introduce a corpus-based measure that evaluates the stability of the lexical semantic similarity space using a pseudo-synonym same-different detection task and no external resources. We show that it enables to predict two behavior-based measures across a range of parameters in a Latent Semantic Analysis model.
Dupoux, E., Beraud-Sudreau, G. & Sagayama, S. (2011). Templatic features for modeling phoneme acquisition. In Proceedings of the 33rd Annual Conference of the Cognitive Science Society, Boston, Mass.. [abstract] We describe a model for the coding of speech sounds into a high dimensional space. This code is obtained by computing the similarity between speech sounds and stored syllable-sized templates. We show that this code yields a better linear separation of phonemes than the standard MFCC code. Additional experiments show that the code is tuned to a particular language, and is able to use temporal cues for the purpose of phoneme recognition. Optimal templates seem to correspond to chunks of speech of around 120ms containing transitions between phonemes or syllables.
Boruta, L. (2011). Combining Indicators of Allophony. In Proceedings ACL-SRW, (pp 88-93) .
Boruta, L., Peperkamp, S., Crabbé, B. & Dupoux, E. (2011). Testing the robustness of online word segmentation: effects of linguistic diversity and phonetic variation. In Proceedings of the 2011 Workshop on Cognitive Modeling and Computational Linguistics, ACL, 1-9, Portland, Oregon. [abstract] Models of the acquisition of word segmentation are typically evaluated using phonemically transcribed corpora. Accordingly, they implicitly assume that children know how to undo phonetic variation when they learn to extract words from speech. Moreover, whereas models of language acquisition should perform similarly across languages, evaluation is often limited to English samples. Using child-directed corpora of English, French and Japanese, we evaluate the performance of state-of-the-art statistical models given inputs where phonetic variation has not been reduced. To do so, we measure segmentation robustness across different levels of segmental variation, simulating systematic allophonic variation or errors in phoneme recognition. We show that these models do not resist an increase in such variations and do not generalize to typologically different languages. From the perspective of early language acquisition, the results strengthen the hypothesis according to which phonological knowledge is acquired in large part before the construction of a lexicon.
Varadarajan, B., Khudanpur, S. & Dupoux, E. (2008). Unsupervised Learning of Acoustic Subword Units. In Proceedings of ACL-08: HLT, (pp 165-168) . [abstract] Accurate unsupervised learning of phonemes of a language directly from speech is demonstrated via an algorithm for joint unsupervised learning of the topology and parameters of a hidden Markov model (HMM); states and short state-sequences through this HMM correspond to the learnt sub-word units. The algorithm, originally proposed for unsupervised learning of allophonic variations within a given phoneme set, has been adapted to learn without any knowledge of the phonemes. An evaluation methodology is also proposed, whereby the state-sequence that aligns to a test utterance is transduced in an automatic manner to a phoneme-sequence and compared to its manual transcription. Over 85% phoneme recognition accuracy is demonstrated for speaker-dependent learning from fluent, large-vocabulary speech.
Peperkamp, S. & Dupoux, E. (2007). Learning the mapping from surface to underlying representations in an artificial language. In J. Cole & J. Hualde (eds) Laboratory Phonology, 9, Mouton de Gruyter. [abstract] When infants acquire their native language they not only extract language-specific segmental categories and the words of their language, they also learn the underlying form of these words. This is difficult because words can have multiple phonetic realizations, according to the phonological context. In a series of artificial language-learning experiments with a phrase-picture matching task, we consider the respective contributions of word meaning and distributional information for the acquisition of underlying representations in the presence of an allophonic rule. We show that on the basis of semantic information, French adults can learn to map voiced and voiceless stops or fricatives onto the same underlying phonemes, whereas in their native language voicing is phonemic in all obstruents. They do not extend this knowledge to novel stops or fricatives, though. In the presence of distributional cues only, learning is much reduced and limited to the words subjects are trained on. We also test if phonological naturalness plays a role in this type of learning, and find that if semantic information is present, French adults can learn to map different segments onto a single underlying phoneme even if the mappings are highly unnatural. We discuss our findings in light of current statistical learning approaches to language acquisition.
Le Calvez, R., Peperkamp, S. & Dupoux, E. (2007). Bottom-up learning of phonemes: A computational study. In S. Vosniadou, D. Kayser & A. Protopapas (eds) Proceedings of the Second European Cognitive Science Conference, Taylor and Francis. (French translation in Mathematiques et Sciences Humaines 2007(4), 99-111). [abstract] We present a computational evaluation of a hypothesis according to which distributional information is suffic ient to acquire allophonic rules (and hence phonemes) in a bottom-up fashion. The hypothesis was tested using a measure based on information theory that com- pares distributions. The test was conducted on several artificial language corpora and on two natural corpora containing transcriptions of speech directed to infants from two typologically distant languages (French and Japanese). The measure was complemented with three filters, one concerning the statistical reliability due to sample size and two concerning the following univer- sal properties of allophonic rules: constituents of an al- lophonic rule should be phonetically similar, and allo- phonic rules should be assimilatory in nature.
Peperkamp, S., Le Calvez, R., Nadal, J.P. & Dupoux, E. (2006). The acquisition of allophonic rules: Statistical learning with linguistic constraints. Cognition, 101(3), B31-B41. [abstract] Phonological rules relate surface phonetic word forms to abstract underlying forms that are stored in the lexicon. Infants must thus acquire these rules in order to infer the abstract representation of words. We implement a statistical learning algorithm for the acquisition of one type of rule, namely allophony, which introduces context-sensitive phonetic variants of phonemes. This algorithm is based on the observation that different realizations of a single phoneme typically do not appear in the same contexts (ideally, they have complementary distributions). In particular, it measures the discrepancies in context probabilities for each pair of phonetic segments. In Experiment 1, we test the algorithm's performances on a pseudo-language and show that it is robust to statistical noise due to sampling and coding errors, and to non-systematic rule application. In Experiment 2, we show that a natural corpus of semiphonetically transcribed child-directed speech in French presents a very large number of near-complementary distributions that do not correspond to existing allophonic rules. These spurious allophonic rules can be eliminated by a linguistically motivated filtering mechanism based on a phonetic representation of segments. We discuss the role of a priori linguistic knowledge in the statistical learning of phonology.
Dupoux, E. (2004). The Acquisition of Discrete Segmental Categories: Data and Model. In Proceedings of the 18th International Congress of Acoustics, Kyoto. [abstract] The way in which we parse continuous speech into discrete phonemes is highly language-dependant. Here, we first report that this phenomenon not only depends on the inventory of phonetic distinctions in the language, but also on the inventory of syllabic types. This is illustrated by studies showing that Japanese listeners perceptually insert epenthetic vowels inside illegal consonant clusters in order to make them legal. We then argue that this raises a bootstrapping problem for language acquisition, as the learning of phonetic inventories and syllabic types depend on each other. We present an acquisition model based on the storing and analysis of phonetic syllabic templates. We argue that this model has the potential of solving the bootstrapping problem as well as a range of observation regarding perceptual categorization for speech sounds.
Peperkamp, S. & Dupoux, E. (2003). Reinterpreting loanword adaptations: The role of perception. In Proceedings of the 15th International Congress of Phonetic Sciences, (pp 367-370) . [abstract] Standard phonological accounts of loanword adaptations state that the inputs to the adaptations are constituted by the surface forms of the words in the source language and that the adaptations are computed by the phonological grammar of the borrowing language. In processing terms, this means that in perception, the phonetic form of the source words is faithfully copied onto an abstract underlying form, and that adaptations are produced by the standard phonological processes in production. We argue that this is at odds with speech perception models and propose that loanword adaptations take place in perception and are defined as phonetically minimal transformations.
Peperkamp, S. & Dupoux, E. (2002). Coping with phonological variation in early lexical acquisition. In I. Lasser(ed) The Process of Language Acquisition, (pp 359-385) Berlin: Peter Lang Verlag. [abstract] Models of lexical acquisition assume that infants can somehow extract unique word forms out of the speech stream before they acquire the meaning of words (e.g. Siskind 1996). However, words often surface with different phonetic forms due to the application of postlexical phonological processes; that is, surface word forms exhibit what we call phonological variation. In this paper, we will examine if and how infants that do not have a semantic lexicon might undo phonological variation, i.e. deduce which phonological processes apply and infer unique underlying word forms that will constitute lexical entries. We will propose a learning mechanism that deduces which rule applies and infers underlying phonemes and word forms. This mechanism is based on an examination of the distribution of either surface segments or surface word forms. The distribution of segments will be shown to provide sufficient information in the case of allophonic rules, i.e. rules that involves segments that do not otherwise occur in the language; the distribution of segments that are introduced by this type of rule is complementary to that of segments that are the direct phonetic realization of certain phonemes. The distribution of word forms will be shown to be necessary in cases in which all surface segments have a phonemic status in the language. In particular, infants can make use of the fact that certain word forms - i.e. the ones that have undergone the rule - fail to occur at the left or right edge of certain phrasal constituents, where the context for application of the rule is never met. This proposal makes predictions regarding the order in which various types of phonological variations can be coped with in the infant.
Dupoux, E. & Peperkamp, S. (2002). Fossil markers of language development: phonological deafnesses in adult speech processing. In B. Laks & J. Durand (eds) Phonetics, Phonology, and Cognition, (pp 168-190) Oxford: Oxford University Press.. [abstract] The sound pattern of the language(s) we have heard as infants affects the way in which we perceive linguistic sounds as adults. Typically, some foreign sounds are very difficult to perceive accurately, even after extensive training. For instance, native speakers of French have troubles distinguishing foreign words that differ only in the position of main stress, French being a language in which stress is not contrastive. In this paper, we propose to explore the perception of foreign sounds cross- linguistically in order to understand the processes that govern early language acquisition. Specifically, we propose to test the hypothesis that early language acquisition begins by using only regularities that infants can observe in the surface speech stream (Bottom-Up Bootstrapping), and compare it with the hypothesis that they use all possible sources of information, including, for instance, word boundaries (Interactive Bootstrapping). We set up a research paradigm using the stress system, since it allows to test the various options at hand within a single test procedure. We distinguish four types of regular stress systems the acquisition of which requires different sources of information. We show that the two hypotheses make contrastive predictions as to the pattern of stress perception of adults in these four types of languages. We conclude that cross-linguistic research of adults speech perception, when coupled with detailed linguistic analysis, can be brought to bear on important issues of language acquisition.
Christophe, A., Guasti, T., Nespor, M., Dupoux, E. & Van Ooyen, B. (1997). Reflections on phonological bootstrapping: Its role for lexical and syntactic acquisition. Language and Cognitive Processes, 12(5-6), 585-612. [abstract] ``Phonological bootstrapping'' is the hypothesis that a purely phonological analysis of the speech signal may allow infants to start acquiring the lexicon and syntax of their native language (Morgan & Demuth, 1996a) To assess this hypothesis, a first step is to estimate how much information is provided by a phonological analysis of the speech input conducted in the absence of any prior (language-specific) knowledge in other domains such as syntax or semantics. We first review existing work on how babies may start acquiring a lexicon by relying on distributional regularities, phonotactics, typical word shape and prosodic boundary cues. Taken together, these sources of information may enable babies to learn the sound pattern of a reasonable number of the words in their native language. We then focus on syntax acquisition and discuss how babies may set one of the major structural syntactic parameters, the head direction parameter, by listening to prominence within phonological phrases and before they possess any words. Next, we discuss how babies may hope to acquire function words early, and how this knowledge would help lexical segmentation and acquisition, as well as syntactic analysis and acquisition. We then present a model of phonological bootstrapping of the lexicon and syntax that helps us to illustrate the congruence between problems. Some sources of information appear to be useful for more than one purpose; for example, phonological phrases and function words may help lexical segmentation as well as segmentation into syntactic phrases and labelling (NP, VP, etc.). Although our model derives directly from our reflection on acquisition, we argue that it may also be adequate as a model of adult speech processing. Since adults allow a greater variety of experimental paradigms, an advantage of our approach is that specific hypotheses can be tested on both populations. We illustrate this aspect in the final section of the paper, where we present the results of an adult experiment which indicates that prosodic boundaries and function words play an important role in continuous speech processing.
Christophe, A. & Dupoux, E. (1996). Bootstrapping lexical acquisition: The role of prosodic structure. Linguistic Review, 13(3-4), 383-412.
Schatz, T., Xuan-Nga, C., Kolesnikova, A., Bergvelt, T., Wright, J., & Dupoux, E. (2015). Articulation Index LSCP LDC2015S12. Web Download. Philadelphia: Linguistic Data Consortium. download:catalog.ldc.upenn.edu/LDC2015S12.
Tsuji, S., Fikkert, P., Minagawa-Kawai, Y., Dupoux, E., Filippin, L., Versteegh, M., Hagoort, P. & Cristia, A. (2017). The more, the better? Behavioral and neural correlates of frequent and infrequent vowel exposure Developmental Psychobiology, 59, 603-612. [abstract] A central assumption in the perceptual attunement literature holds that exposure to a speech sound contrast leads to improvement in native speech sound processing. However, whether the amount of exposure matters for this process has not been put to a direct test. We elucidated indicators of frequency-dependent perceptual attunement by comparing 5--8-month-old Dutch infants' discrimination of tokens containing a highly frequent [hɪt-he:t] and a highly infrequent [hYt-h\o:t] native vowel contrast as well as a non-native [hɛt-h\aet] vowel contrast in a behavioral visual habituation paradigm (Experiment 1). Infants discriminated both native contrasts similarly well, but did not discriminate the non-native contrast. We sought further evidence for subtle differences in the processing of the two native contrasts using near-infrared spectroscopy and a within-participant design (Experiment 2). The neuroimaging data did not provide additional evidence that responses to native contrasts are modulated by frequency of exposure. These results suggest that even large differences in exposure to a native contrast may not directly translate to behavioral and neural indicators of perceptual attunement, raising the possibility that frequency of exposure does not influence improvements in discriminating native contrasts.
Gvozdic, K., Moutier, S., Dupoux, E. & Buon, M. (2016). Priming Children's Use of Intentions in Moral Judgement with Metacognitive Training. Frontiers in Language Sciences, 7(190).
Dupoux, E. (2015). Category Learning in Songbirds: top-down effects are not unique to humans. Current Biology, 25(16), R718-R720. [abstract] Human infants use higher order patterns (words) to learn the sound category of their language. A new study using artificial patterns made up of naturally occurring vocalizations shows that a similar mechanism may also exist in songbirds.
Cristia, A., Minagawa-Kawai, Y., Vendelin, I., Cabrol, D. & Dupoux, E. (2014). Responses to vocalizations and auditory controls in the human newborn brain. Plos One, 9(12), e115162. [abstract] The functional organization of the human adult brain allows selective activation of specific regions in response to stimuli. In the adult, linguistic processing has been associated with left-dominant activations in perisylvian regions, whereas emotional vocalizations can give place to right-dominant activation in posterior temporal cortices. Near Infrared Spectroscopy was used to register the response of 40 newborns' temporal regions when stimulated with speech, human and macaque emotional vocalizations, and auditory controls where the formant structure was destroyed but the long-term spectrum was retained. Speech elicited left-dominant activation in one channel in left posterior temporal cortices, as well as in more anterior, deeper tissue with no clear lateralization. Emotional vocalizations induced left-dominant, large activations in more anterior regions, and induced activation. Finally, activation elicited by the control stimuli was right-dominant, and more variable across infants. Overall, these results suggest that left-dominance for speech processing in newborns may be partially modulated by the presence of formant structure, which is shared between speech and non-linguistic vocalizations. Moreover, they indicate that development plays an important role in shaping the cortical networks involved in the processing of emotional vocalizations.
Cristia, A., Minagawa-Kawai, Y., Egorova, N., Gervain, J., Filippin, L., Cabrol, D. & Dupoux, E. (2014). Neural correlates of infant dialect discrimination: A fNIRS study. Developmental Science, 17(4), 628-635. [abstract] The present study investigated the neural correlates of infant discrimination of very similar linguistic varieties (Quebecois and Parisian French) using functional Near InfraRed Spectroscopy. In line with previous behavioral and electrophysiological data, there was no evidence that 3-month-olds discriminated the two regional accents, whereas 5-month-olds did, with the locus of discrimination in left anterior perisylvian regions. These neuroimaging results suggest that a developing language network relying crucially on left perisylvian cortices sustains infants' discrimination of similar linguistic varieties within this early period of infancy.
Ngon, C., Martin, A., Dupoux, E., Cabrol, D. & Peperkamp, S. (2013). Nonwords, nonwords, nonwords: Evidence for a proto-lexicon during the first year of life. Developmental Science, 16(1), 24-34. [abstract] Previous research with artificial language learning paradigms has shown that infants are sensitive to statistical cues to word boundaries (Saffran, Aslin & Newport, 1996) and that they can use these cues to extract word-like units (Saffran, 2001). However, it is unknown whether infants use statistical information to construct a recognition lexicon when acquiring their native language. In order to investigate this issue, we rely on the fact that besides real words a statistical algorithm extracts sound sequences that are highly frequent in infant-directed speech but constitute nonwords. In two experiments, we use a preferential listening paradigm to test French-learning 11-month-old infants' recognition of highly frequent disyllabic sequences from their native language. In Experiment 1, we use nonword stimuli and find that infants listen longer to high-frequency than to low-frequency sequences. In Experiment 2, we compare high-frequency nonwords to real words in the same frequency range, and find that infants show no preference. Thus, at 11 months, French-learning infants recognize highly frequent sound sequences from their native language and fail to differentiate between words and nonwords among these sequences. These results are evidence that they have used statistical information to extract word candidates from their input and store them in a ``proto-lexicon'', containing both words and nonwords.
Minagawa-Kawai, Y., Cristia, A., Long, B., Vendelin, I., Hakuno, Y., Dutat, M., Filippin, L., Cabrol, D. & Dupoux, E. (2013). Insights on NIRS sensitivity from a cross-linguistic study on the emergence of phonological grammar. Frontiers in Language Sciences, 4(170), 10.3389/fpsyg.2013.00170. [abstract] Each language has a unique set of phonemic categories and phonotactic rules which determine permissible sound sequences in that language. Behavioral research demonstrates that one's native language shapes the perception of both sound categories and sound sequences in adults, and neuroimaging results further indicate that the processing of native phonemes and phonotactics involves a left-dominant perisylvian brain network. Recent work using a novel technique, functional Near InfraRed Spectroscopy (NIRS), has suggested that a left-dominant network becomes evident toward the end of the first year of life as infants process phonemic contrasts. The present research project attempted to assess whether the same pattern would be seen for native phonotactics. We measured brain responses in Japanese- and French-learning infants to two contrasts: Abuna vs. Abna (a phonotactic contrast that is native in French, but not in Japanese) and Abuna vs. Abuuna (a vowel length contrast that is native in Japanese, but not in French). Results did not show a significant response to either contrast in either group, unlike both previous behavioral research on phonotactic processing and NIRS work on phonemic processing. To understand these null results, we performed similar NIRS experiments with Japanese adult participants. These data suggest that the infant null results arise from an interaction of multiple factors, involving the suitability of the experimental paradigm for NIRS measurements and stimulus perceptibility. We discuss the challenges facing this novel technique, particularly focusing on the optimal stimulus presentation which could yield strong enough hemodynamic responses when using the change detection paradigm.
Ramus, F., Peperkamp, S., Christophe, A., Jacquemot, C., Kouider, S. & Dupoux, E. (2011). A psycholinguistic perspective on the acquisition of phonology. In C. Fougeron, B. Kühnert, d'Imperio M. & Vallée N. (eds) Laboratory Phonology, 10, Berlin: Mouton de Gruyter. [abstract] This paper discusses the target articles by Fikkert, Vihman, and Goldrick & Larson, which address diverse aspects of the acquisition of phonology. These topics are examined using a wide range of tasks and experimental paradigms across different ages. Various levels of processing and representation are thus involved. The main point of the present paper is that such data can be coherently interpreted only within a particular information-processing model that specifies in sufficient detail the different levels of processing and representation. In this paper, we first present the basic architecture of a model of speech perception and production, justifying it with psycholinguistic and neuropsychological data. We then use this model to interpret data from the target articles relative to the acquisition of phonology.
Minagawa-Kawai, Y., van der Lely, H., Ramus, F., Sato, Y., Mazuka, R. & Dupoux, E. (2011). Optical Brain Imaging Reveals General Auditory and Language-Specific Processing in Early Infant Development. Cerebral Cortex, 21(2), 254-261. [abstract] This study uses near-infrared spectroscopy in young infants in order to elucidate the nature of functional cerebral processing for speech. Previous imaging studies of infants' speech perception revealed left-lateralized responses to native language. However, it is unclear if these activations were due to language per se rather than to some low-level acoustic correlate of spoken language. Here we compare native (L1) and non-native (L2) languages with 3 different nonspeech conditions including emotional voices, monkey calls, and phase scrambled sounds that provide more stringent controls. Hemodynamic responses to these stimuli were measured in the temporal areas of Japanese 4 month-olds. The results show clear left-lateralized responses to speech, prominently to L1, as opposed to various activation patterns in the nonspeech conditions. Furthermore, implementing a new analysis method designed for infants, we discovered a slower hemodynamic time course in awake infants. Our results are largely explained by signal-driven auditory processing. However, stronger activations to L1 than to L2 indicate a language-specific neural factor that modulates these responses. This study is the first to discover a significantly higher sensitivity to L1 in 4 month-olds and reveals a neural precursor of the functional specialization for the higher cognitive network.
Minagawa-Kawai, Y., Cristia, A., Vendelin, I., Cabrol, D. & Dupoux, E. (2011). Assessing signal-driven mechanisms in neonates: Brain responses to temporally and spectrally different sounds. Frontiers in Language Sciences, 2(135). [abstract] Past studies have found that, in adults, the acoustic properties of sound signals (such as fast vs. slow temporal features) differentially activate the left and right hemispheres, and some have hypothesized that left-lateralization for speech processing may follow from left-lateralization to rapidly changing signals. Here, we tested whether newborns' brains show some evidence of signal-specific lateralization responses using near-infrared spectroscopy (NIRS) and auditory stimuli that elicits lateralized responses in adults, composed of segments that vary in duration and spectral diversity. We found significantly greater bilateral responses of oxygenated hemoglobin (oxy-Hb) in the temporal areas for stimuli with a minimum segment duration of 21 ms, than stimuli with a minimum segment duration of 667 ms. However, we found no evidence for hemispheric asymmetries dependent on the stimulus characteristics. We hypothesize that acoustic-based functional brain asymmetries may develop throughout early infancy, and discuss their possible relationship with brain asymmetries for language.
Minagawa-Kawai, Y., Cristià, A. & Dupoux, E. (2011). Cerebral lateralization and early speech acquisition: A developmental scenario. Developmental Cognitive Neuroscience, 1(3), 217-232. [abstract] During the past ten years, research using Near-InfraRed Spectroscopy (NIRS) to study the developing brain has provided groundbreaking evidence of brain functions in infants. We review three competing classes of hypotheses, (signal-driven, domain-driven, and learning biases hypotheses) regarding the causes of hemispheric specialization for speech processing. We assess the fit between each of these hypotheses and neuroimaging evidence in speech perception and show that none of the three hypotheses can account for the entire set of observations on its own. However, we argue that they provide a good fit when combined within a developmental perspec- tive. According to our proposed scenario, lateralization for language emerges out of the interaction between pre-existing left--right biases in generic auditory processing (signal- driven hypothesis), and a left-hemisphere predominance of particular learning mechanisms (learning-biases hypothesis). As a result of thiscompleted developmental process, the native language is represented in the left hemisphere predominantly. The integrated sce- nario enables to link infant and adult data, and points to many empirical avenues that need to be explored more systematically.
Mazuka, R., Cao, Y., Dupoux, E. & Christophe, A. (2011). The development of a phonological illusion: A cross-linguistic study with Japanese and French infants. Developmental Science, 14(4), 693-699. [abstract] In adults, the native language phonology has strong perceptual effects. Previous work showed that Japanese speakers, unlike French speakers, break up illegal sequences of consonants with illusory vowels: they report hearing abna as abuna. To study the development of the phonological grammar, we compared Japanese and French infants in a discrimination task. In Experiment 1, we observed that 14-month-old Japanese infants, in contrast with French infants, failed to discriminate phonetically varied sets of abna-type and abuna-type stimuli. In Experiment 2, 8 month-old French and Japanese did not differ significantly from each other. In Experiment 3, we found that, like adults, Japanese infants can discriminate abna from abuna when phonetic variability is reduced (single item). These results show that the phonologically- induced /u/ illusion is already experienced by Japanese infants at the age of 14 months. Hence, before having acquired many words of their language, they have grasped enough of their native phonological grammar to constrain their perception of speech sound sequences.
Dupoux, E., Peperkamp, S. & Sebastian-Galles, N. (2010). Limits on bilingualism revisited: Stress "deafness" in simultaneous French-Spanish bilinguals. Cognition, 114(2), 266-275. [abstract] We probed simultaneous French-Spanish bilinguals for the perception of Spanish lexical stress using three tasks, two short-term memory encoding tasks and a speeded lexical decision. In all three tasks, the performance of the group of simultaneous bilinguals was intermediate between that of native speakers of Spanish on the one hand and French late learners of Spanish on the other hand. Using a composite stress `deafness' index measure computed over the results of the three tasks, we found that the performance of the simultaneous bilinguals is best fitted by a bimodal distribution that corresponds to a mixture of the performance distributions of the two control groups. Correlation analyses showed that the variables explaining language dominance are linked to early language exposure. These findings are discussed in light of theories of language processing in bilinguals.
Skoruppa, K., Pons, F., Christophe, A., Bosch, L., Dupoux, E., Sebastian-Galles, N., Limissuri, R.A. & Peperkamp, S. (2009). Language-specific stress perception by 9-month-old French and Spanish infants. Developmental Science, 12(6), 914-919. [abstract] During the first year of life, infants begin to have difficulties perceiving non-native vowel and consonant contrasts, thus adapting their perception to the phonetic categories of the target language. In this paper, we examine the perception of a non-segmental feature, i.e. stress. Previous research with adults has shown that speakers of French (a language with fixed stress) have great difficulties in perceiving stress contrasts (Dupoux, Pallier, Sebastian & Mehler, 1997), whereas speakers of Spanish (a language with lexically contrastive stress) perceive these contrasts as accurately as segmental contrasts. We show that language-specific differences in the perception of stress likewise arise during the first year of life. Specifically, 9-month-old Spanish infants successfully distinguish between stress-initial and stress-final pseudo-words, while French infants of this age show no sign of discrimination. In a second experiment using multiple tokens of a single pseudo-word, French infants of the same age successfully discriminate between the two stress patterns, showing that they are able to perceive the acoustic correlates of stress. Their failure to discriminate stress patterns in the first experiment thus reflects an inability to process stress at an abstract, phonological level.
Darcy, I., Ramus, F., Christophe, A., Kinzler, K.D. & Dupoux, E. (2009). Phonological knowledge in compensation for native and non-native assimilation. In F. Kügler, C. Féry & R. van de Vijver (eds) Variation and Gradience in Phonetics and Phonology, (pp 265-309) Berlin: Mouton De Gruyter. [abstract] We investigated whether compensation for phonological assimilation depends on language-universal or language-specific processes. To this end, we tested two different assimilation rules, one that exists in English and involves place of articulation, and another that exists in French and involves voicing. Both contrasts were tested on speakers of French, British English and American English. In three experiments using a word detection task, we observed that monolingual participants showed a significantly higher degree of compensation for phonological changes that correspond to rules existing in their language than to rules that do not exist in their language (even though they are phonologically possible since they exist in another language). Thus, French participants compensated more for voicing than place assimilation, while British and American English participants compensated more for place than voicing assimilation. In all three experiments, we also found that the non-native rule induced a very small but significant compensation effect, suggesting that both a language-specific and a language-universal mechanism are at play. In Experiment 4, we studied native speakers of British English who were late learners of French: they showed the British pattern of results even when listening to French stimuli, confirming that compensation for assimilation is induced by language-specific phonological processes rather than specific phonetic cues. The results are discussed in light of current models of lexical access and phonological processing.
Minagawa-Kawai, Y., Mori, K., Hebden, J.C. & Dupoux, E. (2008). Optical Imaging of infants' neurocognitive development: Recent advances and perspectives. Developmental Neurobiology, 68(6), 712-728. [abstract] Near-infrared spectroscopy (NIRS) provides a unique method of monitoring infant brain function by measuring the changes in the concentrations of oxygenated and deoxygenated hemoglobin. During the past 10 years, NIRS measurement of the developing brain has rapidly expanded. In this article, a brief discussion of the general principles of NIRS, including its technical advantages and limitations, is followed by a detailed review of the role played so far by NIRS in the study of infant perception and cognition, including language, and visual and auditory functions. Results have highlighted, in particular, the developmental changes of cerebral asymmetry associated with speech acquisition. Finally, suggestions for future studies of neurocognitive development using NIRS are presented. Although NIRS studies of the infant brain have yet to fulfill their potential, a review of the work done so far indicates that NIRS is likely to provide many unique insights in the field of developmental neuroscience.
Dupoux, E., Sebastian-Galles, N., Navarrete, E. & Peperkamp, S. (2008). Persistent stress "deafness": The case of French learners of Spanish. Cognition, 106(2), 682-706. [abstract] Previous research by Dupoux et al. [Dupoux, E., Pallier, C., Sebastian, N., & Mehler, J. (1997). A destressing ``deafness{''} in French? Journal of Memory Language 36, 406-421; Dupoux, E., Peperkamp, S., & Sebastian-Galles (2001). A robust method to study stress' deafness. Journal of the Acoustical Society of America 110, 1608-1618.] found that French speakers, as opposed to Spanish ones, are impaired in discrimination tasks with stimuli that vary only in the position of stress. However, what was called stress `deafness' was only found in tasks that used high phonetic variability and memory load, not in cognitively less demanding tasks such as single token AX discrimination. This raised the possibility that instead of a perceptual problem, monolingual French speakers might simply lack a metalinguistic representation of contrastive stress, which would impair them in memory tasks. We examined a sample of 39 native speakers of French who underwent formal teaching of Spanish after age 10, and varied in degree of practice in this language. Using a sequence recall task, we observed in all our groups of late learners of Spanish the same impairment in short-term memory encoding of stress contrasts that was previously found in French monolinguals. Furthermore, using a speeded lexical decision task with word-nonword minimal pairs that differ only in the position of stress, we found that all late learners had much difficulty in the use of stress to access the lexicon. Our results show that stress `deafness' is better interpreted as a lasting processing problem resulting from the impossibility for French speakers to encode contrastive stress in their phonological representations. This affects their memory encoding as well as their lexical access in on-line tasks. The generality of such a persistent suprasegmental `deafness' is discussed in relation to current findings and models on the perception of non-native phonological contrasts.
Minagawa-Kawai, Y., Naoi, N., Nishijima, N., Kojima, S. & Dupoux, E. (2007). Developmental changes in cerebral responses to native and non-native vowels: a NIRS study. In Proceedings of the International Conference of Phonetic Sciences XVI, (pp 1877--1880) Saarbrucken. [abstract] While newborn infants discriminate speech sounds from languages that they have never heard, 6-month-olds demonstrate the beginnings of vowel classification specific to their native-language. The neuronal correlates involved in such a dramatic perceptual reorganization process, however, are not well understood. Using near-infrared spectroscopy (NIRS), this study compares the neural responses of Japanese infants at 3-4 months and 7-8 months of age as well as of adults to native ([i] vs. [w] ) and non-native vowel contrasts ([w] vs. [u]) within pseudo-word contexts. The findings demonstrated longitudinal developmental changes of functional temporal cortex asymmetries associated with the exposure of the native language.
Peperkamp, S., Skoruppa, K. & Dupoux, E. (2006). The role of phonetic naturalness in phonological rule acquisition. In D. Bamman, T. Magnitskaia & C. Zaller (eds) Proceedings of the 30th Annual Boston University Conference on Language Development, Vols 1 and 2, (pp 464-475) . [abstract] The role of naturalness constraints in phonological learning is of considerable theoretical importance for linguistically motivated models of language acquisition. However, the existence of naturalness effects is still not resting on firm empirical grounds. P&D (in press) exposed French subjects to an artificial language consisting of determiner + noun phrases which obey either a natural allophonic rule that voices a subclass of obstruents intervocalically, or an unnatural one that defines arbitrary relationships among certain obstruents intervocalically. After exposure, a phrase-picture matching task was used to assess whether subjects had learned the allophonic distributions and hence distinguished between phonemic and allophonic contrasts among obstruents for the purposes of word identification. Surprisingly, P&D (in press) found that natural assimilatory rules and unnatural arbitrary rules were learned with equal ease. In the present study, we use exactly the same exposure phase, but change the test phase: here, subjects have to produce a noun phrase upon the presentation of a picture, both for nouns that they have been trained on during the exposure phase, and for novel nouns. We find that with this more ecologically valid, but also more demanding task, a naturalness effect emerges: subjects learned the rule on old items and extended it to novel items, but ony for the natural assimilatory rules, not for the nonntatural arbitrary rules. We discuss these findings in relation to existing studies of the acquisition of phonological rules. We distinguish at least three constraints that characterize rule naturalness, and discuss the role of task demands and response strategies in relation to the emergence of naturalness effects in learning studies using artificial languages.
Peperkamp, S., Pettinato, M. & Dupoux, E. (2003). Allophonic variation and the acquisition of phoneme categories. In B. Beachley, A. Brown & F. Conlin (eds) BUCLD 27: Annual Boston University Conference on Language Development, Vols 1 and 2, Proceedings, (pp 650-661) .
Pallier, C., Dahaene, S., Poline, J., LeBihan, D., Argenti, A., Dupoux, E. & Mehler, J. (2003). Brain imaging of language plasticity in adopted adults: Can a second language replace the first? Cerebral Cortex, 13(2), 155-161. [abstract] Do the neural circuits that subserve language acquisition lose plasticity as they become tuned to the maternal language? We tested adult subjects born in Korea and adopted by French families in childhood; they have become fluent in their second language and report no conscious recollection of their native language. In behavioral tests assessing their memory for Korean, we found that they do not perform better than a control group of native French subjects who have never been exposed to Korean. We also used event-related functional magnetic resonance imaging to monitor cortical activations while the Korean adoptees and native French listened to sentences spoken in Korean, French and other, unknown, foreign languages. The adopted subjects did not show any specific activations to Korean stimuli relative to unknown languages. The areas activated more by French stimuli than by foreign stimuli were similar in the Korean adoptees and in the French native subjects, but with relatively larger extents of activation in the latter group. We discuss these data in light of the critical period hypothesis for language acquisition.
Jacquemot, C., Pallier, C., LeBihan, D., Dehaene, S. & Dupoux, E. (2003). Phonological grammar shapes the auditory cortex: A functional magnetic resonance imaging study. Journal of Neuroscience, 23(29), 9541-9546. [abstract] Languages differ depending on the set of basic sounds they use (the inventory of consonants and vowels) and on the way in which these sounds can be combined to make up words and phrases (phonological grammar). Previous research has shown that our inventory of consonants and vowels affects the way in which our brains decode foreign sounds (Goto, 1971; Naatanen et al., 1997; Kuhl, 2000). Here, we show that phonological grammar has an equally potent effect. We build on previous research, which shows that stimuli that are phonologically ungrammatical are assimilated to the closest grammatical form in the language (Dupoux et al., 1999). In a cross-linguistic design using French and Japanese participants and a fast event-related functional magnetic resonance imaging (fMRI) paradigm, we show that phonological grammar involves the left superior temporal and the left anterior supramarginal gyri, two regions previously associated with the processing of human vocal sounds.
The team organized two workshops on zero ressource speech technologies and computational modeling of early language acquisition.
The first one took place in the Center for Language and Speech Processing at John’s Hopkins University on July 16-27, 2012. See Zero Resource Workshop #1.
The second one took place in Paris at the Ecole Normale Supérieure on July 29-August 2, 2013. See Zero Resource Workshop #2.
The team also organized:
a symposium at the ICIS-2014 conference in Berlin (July 3-5, 2014). See the program in the conference website.
the Zero Resource Speech Challenge series, see www.zerospeech.com.