Word Segmentation

How do infants come to identify words in the speech stream? As adults, we break up speech into words with such ease that we often think that there are audible pauses between words in the same sentence. However, unlike some written languages, speech does not have any completely reliable markers for the breaks between words (Cole and Jakimik, 1980). In fact, languages vary on how they signal the ends of words (Cutler and Carter, 1987), which makes the task even more daunting. Adults at least have a lexicon they can use to recognize familiar words, but when an infant is first born, they do not have a pre-existing lexicon to consult. In spite of these challenges, by the age of six months infants can begin to segment words out of speech (Bortfeld et al., 2005). The goal of my research was to use evidence from infant language acquisition research to build an efficient unsupervised word segmentation system.

PHOCUS

Since the publication of our JCL article, we discovered an error in how the phoneme trigram probabilities were calculated. Essentially, only the trigram probabilities were being multiplied together, whereas the word-initial bigram probability should also have been included to properly follow the chain rule. The updated implementation of PHOCUS is available here.

Publications

Blanchard, Daniel (2011). Unsupervised Word Segmentation: An Investigation of Sub-word Features. University of Delaware PhD Thesis Proposal.
Blanchard, Daniel, Jeffrey Heinz, and Roberta Golinkoff (2010). Modeling the contribution of phonotactic cues to the problem of word segmentation. Journal of Child Language vol. 37 (3) pp. 487-511. (Full results and segmenter code as it was at time of paper submission.)
Blanchard, Daniel and Jeffrey Heinz (2008). Improving Word Segmentation by Simultaneously Learning Phonotactics. CoNLL 2008: Proceedings of the 12th Conference on Computational Natural Language Learning, pp. 65–72.

Goldwater, Sharon, Thomas Griffiths, and Mark Johnson (2009). A Bayesian Framework for Word Segmentation: Exploring the Effects of Context. Cognition.
Fleck, Margaret (2008). Lexicalized phonotactic word segmentation. Proceedings of ACL-08: HLT, pp. 130–138.
Johnson, Mark (2008). Using adaptor grammars to identify synergies in the unsupervised acquisition of linguistic structure. Proceedings of ACL-08: HLT, pp. 398–406.
Johnson, Mark (2008). Unsupervised word segmentation for Sesotho using Adaptor Grammars. SIGMORPHON 2008: Proceedings of the Tenth Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, pp. 20–27.
Swingley, Daniel (2005). Statistical clustering and the contents of the infant vocabulary. Cognitive Psychology vol. 50 (1) pp. 86–132.
Batchelder, Eleanor (2002). Bootstrapping the lexicon: a computational model of infant speech segmentation. Cognition vol. 83 (2) pp. 167–206.
Venkataraman, Anand (2001). A statistical model for word discovery in transcribed speech. Computational Linguistics vol. 27 (3) pp. 352–372.
Brent, Michael (1999). An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery. Machine Learning vol. 34 pp. 71–105.

PHOCUS#

Publications#

Related Work#

PHOCUS

Publications

Related Work