How do infants come to identify words in the speech stream? As adults, we break up speech into words with such ease that we often think that there are audible pauses between words in the same sentence. However, unlike some written languages, speech does not have any completely reliable markers for the breaks between words (Cole and Jakimik, 1980). In fact, languages vary on how they signal the ends of words (Cutler and Carter, 1987), which makes the task even more daunting. Adults at least have a lexicon they can use to recognize familiar words, but when an infant is ﬁrst born, they do not have a pre-existing lexicon to consult. In spite of these challenges, by the age of six months infants can begin to segment words out of speech (Bortfeld et al., 2005). The goal of my research is to use evidence from infant language acquisition research to build an efficient unsupervised word segmentation system.
Since the publication of our JCL article, we have discovered an error in how the phoneme trigram probabilities were calculated. Essentially, only the trigrams probabilities were being multiplied together, whereas the word-initial bigram probability should also have been included to properly follow the chain rule). The updated implementation of PHOCUS is available here.
Blanchard, Daniel (2011). Unsupervised Word Segmentation: An Investigation of Sub-word Features. University of Delaware PhD Thesis Proposal.
Blanchard, Daniel, Jeffrey Heinz, and Roberta Golinkoff (2010). Modeling the contribution of phonotactic cues to the problem of word segmentation. Journal of Child Language vol. 37 (3) pp. 487-511. (Full results and segmenter code as it was at time of paper submission.)
Blanchard, Daniel and Jeffrey Heinz (2008). Improving Word Segmentation by Simultaneously Learning Phonotactics. CoNLL 2008: Proceedings of the 12th Conference on Computational Natural Language Learning, pp. 65–72.