Word Segmentation

How do infants come to identify words in the speech stream? As adults, we break up speech into words with such ease that we often think that there are audible pauses between words in the same sentence. However, unlike some written languages, speech does not have any completely reliable markers for the breaks between words (Cole and Jakimik, 1980). In fact, languages vary on how they signal the ends of words (Cutler and Carter, 1987), which makes the task even more daunting. Adults at least have a lexicon they can use to recognize familiar words, but when an infant is first born, they do not have a pre-existing lexicon to consult. In spite of these challenges, by the age of six months infants can begin to segment words out of speech (Bortfeld et al., 2005). The goal of my research is to use evidence from infant language acquisition research to build an efficient unsupervised word segmentation system.


Since the publication of our JCL article, we have discovered an error in how the phoneme trigram probabilities were calculated. Essentially, only the trigrams probabilities were being multiplied together, whereas the word-initial bigram probability should also have been included to properly follow the chain rule). The updated implementation of PHOCUS is available here.