Workshop on Computational Models of Early Language Acquisition and Zero Resource Speech Technologies

Monday July 29 - Friday August 2, 2013 - ENS, 29 rue d’Ulm, 75005 Paris

The unsupervised (zero resource) discovery of linguistic structure from speech is generating a lot of interest from two largely disjoint communities: The cognitive science community (psycholinguists, linguists, neurolinguists) want to understand the mechanisms by which infants spontaneously discover linguistic structure. The machine learning community is more an more interested in deploying language/speech technologies in a variety of languages/dialects with limited or no linguistic resources. The aim of this workshop is to bring together a team of researchers and graduate students from these two communities, to engage into mutual presentations and discussion of current and future issues.

Specifically, this workshop has two aims: 1) identifying key issues and problems to be solved of interest to both communities, and 2) setting up standardized, common resources for comparing the different approaches to solving these problems (databases, evaluation criteria, software, etc). In this workshop, we will focus mainly but not exclusively on the discovery of two levels of linguistic structure: phonetic units (speech coding) and word-like units (higher order units). We are well aware of the fact that the definition of these levels, as well as their segregation from the rest of the linguistic system is itself a matter of debate, and we welcome discussions of these issues as well.

The workshop will start with 2 days of open symposium with formal presentations which is open without registration, and 3 days of hands-on workshop (registration required; write to

This workshop is supported by the Labex IEC program ‘Frontiers in Cognition’ and the ERC Program “Bootphon”. It is the second issue of a series that started as a mini workshop in the Center for Language and Speech Processing at John’s Hopkins University. See Zero Recource workshop #1.

Monday July 29th: Symposium, Day 1

Each presentation will last for 35 min max, followed by 20 mins of discussion.

Morning 9:00a-12:30p

  • 9:00a: welcome / breakfast

  • 9:30a: Emmanuel Dupoux (EHESS, Paris). Welcome and Overview of objectives

  • 10:30a: Hynek Hermansky (J. Hopkins). Speech coding in ASR: some new and some very old ideas

  • 11:30a Mark Johnson (Macquarie U.): Overview of Bayesian Approaches

Lunch Break

Afternoon 2:30-5:30p

  • 2:30p Thomas Schatz (ENS). Evaluating speech features.

  • 3:30p Gabriel Synnaeve (ENS). Cross modal feature learning and Deep Belief Networks.

  • 4:30p Maarten Versteegh (Radboud Univ). A ‘bag-of-events’ approach to representing speech features

Tuesday July 30th: Symposium, Day 2

Morning 9:30a-12:30p

  • 9:30a Sharon Goldwater (U. Edinburgh). A joint model for unsupervized learning of segmentation, phonetics and the lexicon

  • 10:30a Naomi Feldman (U. Maryland). Predicting listeners’ perceptual biases using low-level speech features

  • 11:30a Antoine Bordes (UTC). Machine Learning, semantics & knowledge extraction: a tutorial

Lunch Break

Afternoon 2:30-6:30p

  • 2:30p Clement Moulin-Frier (Inria/ENSTA) Self organization of early vocal development in infants and machines.

  • 3:30p Ewan Dunbar (U. Maryland). Phonological features and phonetic category systems: How, why, and whether to take Roman Jakobson to an Indian Buffet

  • 4.30p Abdellah Fourtassi (ENS). Leaning semantics and phonetics at the same time.

  • 5.30p Benjamin Boerschinger (Macquarie U.). Another joint mode for unsupervized learning of phonetics and the lexicon.

Wednesday, July 31-Friday, Aug 2. Hands on workshop

The aim of the hands-on workshop is to advance some of the issues raised in the discussions and make some progress towards addressing them, either at the conceptual level, or at the level of performing pilot experiments.


The following topics were discussed.

The Ideal Child-Directed Database

Participants: Goldwater, Dupoux, Versteegh

Noting that there is a lack of good quality, well annotated child directed corpora recorded in naturalistic situations, the group discussed what an ideal database would look like.


  • parents and children would be equipped with very light and unintrusive recording devices (audio recorders and head-mounted video).

  • the recording would take place at home continuously for several days (in order to capture a variety of learning contexts)

  • the population would be infants between 5 and 12 months.


  • the multi channel audio would be analysed through signal processing techniques in order to isolate the parent’s input from background noise.

  • the parental input would be segmented, orthographically transcribed and forced aligned.

  • optionally: a phonetic transcription

  • the video would be digitally stabilized and coded in order to extract what is the current focus of attention of the child, and what is the context

  • context coding would include a description of the scene and objects

  • event coding would include a description of the actions and objects involved

  • this coding should be done WITHOUT the audio channel.

Note: the event and context coding could be done, at least in part, through verbal descriptions collected by naive subjects (through mechanical Turk for instance).


  • reviewing existing databases and coding of video events

  • piloting some data recording

  • deciding on a coding scheme standard (an extention of the Childes standard)

What is semantics good for?

Participants: Swingley, Bordes, Fourtassi, Synnaeve, Johnson

The group discussed the possibility of evaluating the role of semantics in various learning tasks before having access to an ideal database. The idea is to generate synthetic data based on what we currently know about the availability and reliability of semantic cues in the infant’s input (regarding context, objects and events) and plug these cues in a probabilistic fashion in existing unannotated child databases.

For instance, work by Swingley et al and others suggest that in about 60% of the cases when a concrete nouns is mentionned in a sentence, this noun is present in the scene and/or the focus of the child’s attention. The group discussed the importance of consolidating this sort of data, both across learning contexts and cultures.

Later in the workshop, the semantic group used LDA on a section of the Providence corpus in order to derive another proxy for semantic representations, and applied it to unsupervized segmentation using Adaptor Grammars.

Islands of reliability

Participants: Boerschinger, Schatz, Moulin-Frier

The group discussed the possibility that the infant might use a strategy of data selection whereby not all input is analyzed, but only a fragment of the input for which the child has reasons to believe that it is reliable. The difficulty in implementing this idea it to find a way to evaluate the reliability while learning is not completed.

The group implemented a version of this idea in the segmentation task using the Adaptor Grammar, by relaxing the constraint that the entirety of a sentence should be parsed.

The role of phonological features

Participants: Dunbar, Feldman

The group explored the idea that discovering linguistic features simultaneously to constructing phonetic categories should actually be easier than constructing phonetic categories using clustering. The idea is to use an indian buffet process, and the group started to evaluate this idea on synthetic data.

The role of stress

Participants: Johnson, Demuth, Dupoux, Boerschinger

This group explored the idea that stress information could be usefully incorporated into word segmentation algorithms. It discussed a current proposal to add in stress information in a word segmentation algorithm based on adaptor grammar. The discussion was whether the system should try to learn the entire stress system of the language in a parametric fashion or simply to learn probabilistic word templates focussing on the word edges.

The group also discussed the ways in which continuous stress information could be plugged in the existing algorithms, as opposed to dicrete dictionnary citation form annotations. The feasibility of extending the AG framework to incorporate continuous input (Kalman filters?) has been discussed.

Finally, the group discussed the problem related to cross-linguistic variations in stress cues, and more generally how suprasegmental acoustic cues get mapped onto linguistic structures. Johnson, Demuth and Dupoux proposed to cofund a PHD project on this topic.

Bridging the gap between spoken term discovery and word segmentation

Participants: Dupoux, Ludusan, Versteegh

The speech and the NLP communities each have developped their own algorithms to that discover linguistic fragments out of the continuous speech. The former (spoken term discovery) take continuous speech features as input, the latter (word segmentation algorithms) take symbolic input. There are however a number of intermediate algorithms in the making (ie, symbolic systems that incorporate phonetic variations, or continuous systems that use high level features). It is therefore important to be able to compare these two sets of models using the same evaluation metrics.

The group started to decompose the models into separate components (fragment aligment, lexicon construction, utterance segmentation) and reviewed the different metrics used to evaluate each of them.

The group also explored the possibility of building a hybrid continuous/symbolic segmentation system.

The future

The participants of the workshop discussed the objectives for the next year, as well as the scope and organization of the next workshop.