Making sense of speech: it depends on how you slice it
13 May 2005
The University is committed to making the results of its research as widely available as possible. As an incentive to encourage more articles, we devised a writing competition with cash prizes. This article by Dr Laurence White shared third place.
Itisnotsoeasytoreadifallthewordsareruntogetherwithoutspacescommasorfullstops. Listening to an unfamiliar language can be similarly troublesome, sounds following each other rapidly with no clear start and end to words. In contrast, listening to speech in your native tongue is so effortless that it can seem like successive words are separated in time, like words on a page broken up by white spaces. In fact, silences between spoken words are relatively uncommon. So – how do we find the gaps?
Fishing the individual words from the onrushing stream of speech sounds is known as ‘speech segmentation’. This is a fundamental aspect of human language processing, essential from an early age. If babies could not learn to separate words from the babble of voices they experience, they would never develop an understanding of complex speech or the knowledge necessary to produce the words for themselves.
There are various experimental tools available to speech segmentation researchers. One rather indirect, but powerful, technique exploits the fact that a written word is recognised more quickly if the reader has just heard a fragment of the same word. This effect – cross-modal priming – is a sensitive gauge of what listeners have extracted from speech. Using such techniques, psycholinguists and phoneticians have been investigating speech segmentation for decades, identifying a number of acoustic markers that guide listeners to the location of word boundaries. Word stress – which underlies the rhythm of speech – is one such segmentation cue; variations in the length and quality of speech sounds at word boundaries are also important. How the various cues are exploited when encountered in combination in real-life speech has remained something of a mystery, however.
Take an extreme example – “The cat sat on the …”
Research in the Experimental Psychology Department, by Sven Mattys, James Melhorn and myself, now suggests that the human speech processing system takes a flexible approach to speech segmentation, according to the amount of information available at any moment. Essentially, the segmentation strategy is two-pronged, tackling the problem from the top down (from the meaning of the message) and from the bottom up (from the actual sounds of speech). Our cross-modal priming experiments show that when speech is clear and listening conditions are good, listeners use the meaning of what they have already heard to generate expectations of what is coming next.
To take an extreme example – hearing “The cat sat on the …” you are likely to have little trouble picking out the following word “mat”. This is one reason why knowledge of a language is so important – a non-speaker of English, hearing the same sequence of sounds, would struggle to identify the breaks between words that the context makes so obvious for a native speaker. Clear context allows listeners to exploit their existing ‘top-down’ knowledge of language, but not all speech is so predictable, of course. Often the context created by the preceding speech is ambiguous or the speech signal itself is muffled by the talker or by background noise. Our research simulates this variability in speech quality to examine the segmentation strategies that we fall back on when we have incomplete information.
A range of cues comes into play when top-down strategies cannot operate. Knowledge about likely sound sequences within words – phonotactics – is a valuable source of information. For example, there are no native English words beginning “shl” but plenty beginning “sl”, so listeners find it easier to hear “lake” in “dish lake”, where the “sh” must belong to the previous word, than in “kiss lake”, where “sl” could itself be the start of a word onset. In addition to these phonotactic cues, sounds at the start of words are often produced slightly differently and may be lengthened, patterns that listeners become familiar with at an early age.
When the noise level is such that individual sounds cannot be reliably identified, speech rhythm becomes important. The majority of words in English begin with a stressed syllable, such as the first syllable of “spoken”. Because of this, a useful fallback strategy in difficult listening conditions is for listeners to treat all stressed syllables as the start of new words.
Not all languages exploit the same cues as English
Always using stress to find word boundaries would lead to a lot of mistakes, however, because many words do not begin with stressed syllables. Chinese whispers tend to follow this pattern, so that “a chief in the pendants” might be the perplexing interpretation of “achieve independence”, with word boundaries placed before the stressed syllables. Our research demonstrates that listeners ignore stress patterns when there are more reliable segmentation cues available.
Not all languages exploit the same cues as English to guide listeners to word boundaries. If they did, we would have a big head start in language learning, being able to pick out the individual words even without knowing their meaning. We are now beginning to examine speech segmentation in other languages, looking particularly at the intriguing question of how someone’s first language affects their ability to learn another. Languages with similar segmentation cues, like English and Dutch, may promote mutual learning more than languages like Italian and Hungarian, which seem to have widely differing word boundary markers. If you struggled with French at school, this research may provide you with a belated excuse!
Laurence’s £100 prize will help support research travel abroad to broaden the scope of the next phase of the project: investigating speech segmentation in other languages. This work was supported by the Biotechnology and Biological Sciences Research Council.