You are on page 1of 119

The Ideal Listener: Making Optimal Use of Acoustic-Phonetic

Cues for Word Recognition.


by
Meghan Alison Clayards

Submitted in Partial Fulfillment


of the
Requirement for the Degree
Doctor of Philosophy

Supervised by
Professor Michael K. Tanenhaus
and
Professor Richard N. Aslin
Department of Brain and Cognitive Sciences
The College
Arts and Sciences
University of Rochester
Rochester, New York
2008

ii
Curriculum Vitae

Meghan Clayards attended the University of Victoria in Victoria, British Columbia from 1997
to 2001 and received a Bachelors of Science degree in Linguistics with Honours in 2001.
She began her graduate studies at the University of Rochester in 2002 in the department of
Brain and Cognitive Sciences. She received a Masters of Arts in Brain and Cognitive
Sciences in 2005.

iii
Acknowledgements
There are many people who have helped and supported me in this process. First and foremost, I
would like to thank my advisers Mike Tanenhaus and Dick Aslin for many, many things including
but not limited to ideas, encouragement, patience, hours of meetings, access to any resources I
needed, and useful feedback on writing and presentations. I would also like to thank them, the
Department of Brain and Cognitive Sciences, the Center for Language Sciences and the
relevant funding agencies for providing financial support and a rich collaborative environment in
which to do this work.
I would like to thank the members of my committee: Joyce McDonough who has been a
secondary adviser throughout my time in Rochester, I want to thank for reassuring me that I can
call myself a linguist and for always saying this is important stuff even when the data was not
what I wanted and for giving me many comments on previous drafts of this manuscript; and
Robbie Jacobs who has been a collaborator on several of the studies described here, for always
telling me this is a good paper even when it got rejected.
I would also like to thank people that I have worked with at Rochester who have been
instrumental in my work and training. Primarily Bob McMurray who was a mentor and
collaborator when I first started graduate school and trained me in speech synthesis, eyetracking methodology and analysis and countless other technical aspects of being a speech
scientist, and for many, many influential conversations over the years about speech and word
recognition. Id also like to thank Vikranth Rao and Joe Toscano for the many productive
conversations I have had with them about cue combination and for collaboration on projects and
sharing insights about models and data fitting. I want to thank Dana Subik for all his work
collecting much of the data, scheduling subjects and especially reminding me when I needed to
run a subject. Katherine Crosswhite taught me how to write my first Praat script and Hal
Greenwald helped me learn to use Matlab. Austin Frank and Katie Carbary helped me do the
regression analyses and Florian Jaeger helped with data fitting and has promoted my results to
anyone who will listen. I also need to thank Michael Berger, Neil Bardhan, Heike LehnertLeHouillier, and all the many people in the mtanlab and the speech group over the years for
helping me think through half formed ideas and results that didnt work.
Finally, Id like to thank all the many people who shared offices and houses and other parts of
my life and were supportive and patient and encouraging. And to my family and friends back
home for putting up with my new American accent and for not asking too many questions about
what Im doing with my life and why.

iv
Abstract
This dissertation investigates the role of variability in speech perception and word recognition. It
shows that listeners make optimal use of variable acoustic-phonetic cues available to them in the
speech signal. The task of word recognition is choosing between (at least) two words given
some probabilistic acoustic-phonetic cues (evidence). This task can be optimized if the listener
knows the likelihood distributions of each acoustic-phonetic cue for each word. A maximum
likelihood model predicts that individual speech cues should be combined linearly. This model
qualitatively predicts the trading relationships observed with speech data and makes quantitative
predictions about cue weights. Some of the predictions and assumptions of the model are
tested.
Experiment 1 established that listeners use distributions of cues to make judgements about
words. The distributions of one cue, Voice Onset Time, which distinguishes words like beach
and peach were manipulated in a 2AFC task. Listeners uncertainty (slope of the categorization
function, looks to the competitor object and reaction time) reflected the distributions they heard
plus some additional uncertainty. Experiment 2 tested some parameters of the distributions and
suggests that shifting the means closer together and increasing the distributions variance had
similar effects.
Experiment 3 measured multiple acoustic-phonetic cues in word-initial and word-medial labial
stops, produced by multiple speakers. Distributions in multi-dimensional space and correlations
among cues were examined to determine the appropriateness of a linear model.
Different patterns of cue use were found for the different structural positions and speakers.
Speakers who produced more overlapping distributions of the informative cues did not produce
less overlapping distributions of the less informative cues. Speakers who produced less
overlapping distributions did so by expanding the time spent articulating the relevant acoustic
information.
Experiment 4 investigated how listeners use multiple cues available in the stimuli recorded for
Experiment 3. Listeners performed a 4AFC task and were slower to respond to speakers with
more overlapping distributions of cues. A regression model predicting reaction time from the
individual acoustic measures was not successful.
The results of these experiments suggest that the phonetic knowledge listeners require to make
judgements about words includes the likelihood distributions of acoustic-phonetic cues.

v
Table of Contents
List of Tables ......................................................................................................................... viii
List of Figures......................................................................................................................... ix
Introduction ............................................................................................................................. 1
Chapter 1 Information in the speech signal............................................................................. 3
1.1 Variability....................................................................................................................... 3
1.1.1 Known sources of variability ................................................................................... 3
1.2 Within-category variability and fine phonetic detail ....................................................... 4
1.2.1 Redefining categories............................................................................................. 7
1.2.2 Partialing out variability .......................................................................................... 9
1.3 Multiple cues ............................................................................................................... 10
1.3.1 Sources of multiple cues ...................................................................................... 11
1.3.2 Trading relations................................................................................................... 12
1.3.3 Context effects ..................................................................................................... 13
1.3.4 Cue weighting....................................................................................................... 16
1.3.5 Cross-linguistic differences and distributional learning ........................................ 16
1.3.6 Summary .............................................................................................................. 18
1.4 Chapter summary........................................................................................................ 18
Chapter 2 A framework for variability and multiple cues ....................................................... 20
2.1 The ideal listener ......................................................................................................... 20
2.1.1 The goal of the listener ......................................................................................... 21
2.1.2 Optimizing the goal............................................................................................... 22
2.1.3 The likelihood distribution ..................................................................................... 22
2.1.4 The prior ............................................................................................................... 23
2.1.5 Competitor sets .................................................................................................... 26
2.1.6 Contrast and certainty .......................................................................................... 26
2.1.7 Phonetics.............................................................................................................. 29
2.1.8 Predictions of the ideal listener ............................................................................ 29
2.2 Multiple Cues .............................................................................................................. 30
2.2.1 The linear model................................................................................................... 30
2.2.2 Trading relations................................................................................................... 31

vi
2.2.3 Context effects ..................................................................................................... 32
2.3 Evidence for the linear model...................................................................................... 34
2.4 Chapter summary........................................................................................................ 36
Chapter 3 Experimental manipulations of distributions ......................................................... 37
3.1 Experiment 1 ............................................................................................................... 39
3.1.1 Methods................................................................................................................ 39
3.1.2 Materials ............................................................................................................... 39
3.1.3 Procedure ............................................................................................................. 40
3.2 Results ........................................................................................................................ 41
3.2.1 Categorization behaviour ..................................................................................... 41
3.2.2 Eye movements.................................................................................................... 43
3.2.3 Reaction time ....................................................................................................... 44
3.3 Chapter summary........................................................................................................ 46
Chapter 4 Control experiments ............................................................................................. 48
4.1 Experiment 2 ............................................................................................................... 49
4.1.1 Methods................................................................................................................ 49
4.2 Results ........................................................................................................................ 49
4.2.1 Categorization functions ....................................................................................... 49
4.2.2 Eye-movements ................................................................................................... 51
4.2.3 Reaction time ....................................................................................................... 52
4.3 Discussion ................................................................................................................... 53
4.4 Chapter summary........................................................................................................ 55
Chapter 5 Acoustic-phonetic analysis ................................................................................... 56
5.1 Importance of production data .................................................................................... 56
5.2 Assumptions of the linear model ................................................................................. 57
5.2.1 Normal distributions.............................................................................................. 57
5.2.2 Cue independence ............................................................................................... 58
5.2.3 Speaker control and enhancement ...................................................................... 60
5.2.4 Individual speaker differences .............................................................................. 61
5.2.5 Cue weighting....................................................................................................... 62
5.3 Experiment 3 ............................................................................................................... 62
5.3.1 Methods................................................................................................................ 62
5.3.2 Word Lists ............................................................................................................ 63

vii
5.3.3 Analysis ................................................................................................................ 63
5.4 Results ........................................................................................................................ 66
5.4.1 Normal distributions and overlap .......................................................................... 66
5.4.2 Cue weighting: D-prime ........................................................................................ 70
5.4.3 Conditional independence .................................................................................... 72
5.4.4 Individual differences............................................................................................ 76
5.5 Chapter summary........................................................................................................ 79
Chapter 6 Natural, multi-cue experiment .............................................................................. 81
6.1 Experiment 4 ............................................................................................................... 82
6.1.1 Participants........................................................................................................... 82
6.1.2 Stimuli................................................................................................................... 82
6.1.3 Procedure ............................................................................................................. 84
6.2 Results ........................................................................................................................ 85
6.2.1 Categorization data .............................................................................................. 85
6.2.2 Reaction time ....................................................................................................... 85
2.2.3 Speaker models ................................................................................................... 86
6.2.4 Acoustic model ..................................................................................................... 88
6.3 Chapter summary........................................................................................................ 92
Chapter 7 Summary and conclusions ................................................................................... 94
Bibliography .......................................................................................................................... 97
Appendix A: Visual stimuli for Experiments 1 and 2............................................................ 105
Appendix B: Visual stimuli for Experiment 4. ...................................................................... 106

viii
List of Tables
Table

Title

3.1

Number of repetitions of each VOT value in the narrow and wide

Page
40

variance conditions.
4.1

Number of repetitions of each VOT value in the narrow and

49

control conditions.
5.1

Words used in the production experiment.

63

5.2

Means and SD (in brackets) for each of the cues.

68

5.3

D-prime values for each cue.

71

5.4

Correlations between cues. Word-initial voicing.

72

5.5

Correlations between cues. Word-medial voicing.

74

6.1

Number of trials on which subjects choose each picture.

85

ix
List of Figures
Figure

Caption

1.1

Hypothetical distributions of spectral centre of gravity for different

Page
8

categories of the fricative /s/. Between-category variance shown by


grey arrows, within-category variance shown by black arrows.
1.2

[A] Hypothetical distributions of cues for the words sue, shoe,

Error!

see and she. Ellipses represent one standard deviation of a

Book

normal distribution. [B] Hypothetical response curves for different

mark

spectral centers of gravity of the frication noise. Solid line is for stimuli

not

with the vowel spliced from sue. Dotted line is for stimuli with the

defin

vowel spliced from see. Dashed line is for the vowel spliced from

ed.

she. Effect sizes are drawn for illustrative purposes and do not
reflect actual data.
2.1

Two hypothetical probability distributions of voice onset time (VOT)

23

for the word peach.


2.2

The effect of prior probability and stimulus uncertainty on the

24

posterior probability of lexical candidates. [A] High stimulus certainty,


i.e., narrow probability distributions [B] Low stimulus certainty, i.e.,
wide probability distributions. Dotted lines are equal prior probability
for wordA and the alternative. Solid lines are higher prior probability
for wordA. Dashed lines are lower prior probability for wordA.
2.3

Hypothetical likelihood distributions for two words (panels A, B and C)


and the resulting posterior probabilities of wordA (panels D, E and F).
[A] Hypothetical distributions for wordA (dark lines) and wordB (light
lines). [B] Hypothetical distributions in panel A (solid lines) compared
to two distributions with greater variance (dotted lines). [C]
Hypothetical distributions in panel A compared to two distributions
with closer means (dotted lines). [D] Posterior probability of wordA
given the likelihood distributions in A. [E] Posterior probability of

27

x
wordA given the likelihood distributions in B. [F] Posterior probability
of wordA given the likelihood distributions in C.
2.4

Hypothetical likelihood distributions of two cues and their combined

31

posterior probabilities predicted by the linear model. [A] Hypothetical


likelihood distributions of cueX for wordA (dark line) and wordB (light
line). [B] Hypothetical likelihood distributions of cueY for wordA (dark
line) and wordB (light line). [C] Posterior probability of wordA for all
values of cueX and 5 different values of cueY. CueY values are
indicated by shaded circles in panel B. [D] Posterior probability of
wordA for all values of cueY and 5 different values of cueX. CueX
values are indicated by shaded circles in panel A.
3.1

[A] Probability distributions of tokens that listeners categorized in the

38

narrow condition (dark lines) and wide condition (light lines). [B]
Optimal response curves calculated from the probability distributions
using Equation (2) for the narrow condition (dark lines) and wide
condition (light lines).
3.2

Example display screen containing the items beach, peach, lace

41

and race. Locations of items were randomized across trials. Actual


displays were in color.
3.3

Fitted response curves for individual participants in [A] narrow


condition and [B] wide condition. Optimal response curves (solid
lines) and curves from average slope of individuals (dashed lines) for
participants in [C] narrow condition and [D] wide condition.

42

xi
3.4

Relationship between posterior probability and looks to the competitor

44

object for each VOT. [A] Posterior probabilities of the competitor


words calculated using Equation 1 for the narrow (dark lines) and
wide (light lines) distributions. [B] Proportion of looks to the competitor
object for the narrow group (shaded bars) and wide group (open bars)
for all VOT values with sufficient trials to analyze. Error bars indicate
SEM. * p <.05.
3.5

Reaction time by VOT for listeners categorizing stimuli from narrow

45

distributions (dark lines) and wide distributions (light lines).


4.1

Categorization functions [A] individual participants [B] average fitted

50

function (solid line) and predicted function (dashed line).


4.2

Looks to the less likely objet [A] posterior probability of the less likely

51

object [B] proportion of looks.


4.3

Comparison of proportion of looks to the less likely object in all three

52

conditions in the two experiments.


4.4

Average reaction time for each condition in the two experiments.

53

Error bars are SE.


4.5

Flat condition [A] reaction time, [B] proportion of looks to the

54

competitor.
5.1

Three hypothetical relationships between two cues (X and Y) for two

59

categories (dark and light lines). Ellipses represent one standard


deviation around the mean of two-dimensional Gaussian distributions.
One-dimensional Gaussian distributions of the same categories are
represented above (cue X) and to the right (cue Y) of the panels.
5.2

Waveforms showing criteria for acoustic measurement. [A] VOT. [B]


Word-initial VOT showing pre-voicing. [C] Closure. [D] Vowel offset
before a stop.

64

xii
5.3

Waveform and spectrogram of a portion of stable showing the

65

segmentation of vowel, closure and VOT.


5.4

Word-initial productions. Histograms of the distribution of each cue for

67

each category for all speakers. Black lines are b words, grey lines
are p words. [A] VOT. [B] Vowel duration, solid lines are voiced
offsets, dashed lines are voiceless offsets. [C] Burst amplitude. [D] F0
onset, solid lines female speakers, dashed lines male speakers.
5.5

Word-medial productions. Histograms of the distribution of each cue

69

for each category, all speakers. Black lines are b words, grey lines
are p words. [A] VOT [B] Burst amplitude. [C] Closure duration. [D]
Voicing amplitude.
5.6

Word-medial productions. Histograms of the distribution of each cue

70

for each category, all speakers. Black lines are b words, grey lines
are p words. [A] F0 onset by gender, solid lines female, dashed
lines male [B] F0 offset by gender, solid lines female, dashed lines
male. [C] Vowel durations. [D] Vowel duration by vowel, solid lines
a, doted line ei, dashed lines ih.
5.7

Scatter plots of individual tokens in two dimensions. Word-initial

73

voicing. Filled circles are b words, open circles are p words. [A]
VOT and vowel duration. [B] VOT and F0 onset.
5.8

Scatter plots of individual tokens in two dimensions. Word-medial

76

voicing. Filled circles are b words, open circles are p words.


5.9

D-prime values of each cue for each speaker. [A] Word-initial voicing.

77

[B] Word-medial voicing. Speakers 2, 3 and 6 participated in both


experiments.
5.10

Histograms for the distribution of VOT values in word-initial position


for two speakers, 1 and 9, who have the lowest and highest D-prime
values respectively.

78

xiii
6.1

Waveforms and spectrograms (no pre-emphasis) for Mabel. Left

83

panel is the original, right panel has noise added (SNR 5 dB). Shaded
areas represent the stop closure.
6.2

Distributions of voicing amplitude for the b tokens (dark lines) and

84

p tokens (light lines). Left panel is natural stimuli, right panel is noisy
stimuli.
6.3

Relationship between speakers average D-prime and time spent

88

articulating each portion of the word. Onset is time to the onset of the
vowel. Relevant is time from the onset of the vowel to the onset of the
burst. Offset is time from the onset of the burst to the end of the word.
6.4

Estimated effects of noise and vowel duration, closure duration and


voice amplitude on reaction time. Left panels are p words, right
panels are b words.

91

Introduction
The problem of speech perception
Every day as we interact with the world we accomplish many basic tasks such as reaching
for objects, identifying faces and understanding speech. Accomplishing these tasks may
seem trivial and effortless, as adult humans are highly proficient at each of them and many
others. Closer inspection reveals that each task involves a complex process of inference
using a combination of multiple perceptual information sources and relevant knowledge. The
difficulty of the tasks is revealed by the sheer volume of research that has struggled to
produce machine equivalents of these human behaviors. Although major advances have
been made, no artificial intelligence has yet replicated the flexibility and robustness of
humans in such tasks.
In the case of speech perception, the problem has been described as lack of invariance.
Speech is highly variable. The same word often sounds very different when produced in
different contexts, or by different people. Extreme examples of between-speaker variability
include speech produced by people with different dialects or foreign accents. Within a
single speech community large differences exist between children and adults, men and
women and individuals with differently shaped vocal tracts and articulatory organs. Even the
speech of a single individual will differ greatly between careful and casual speech. The
problem is even more pervasive however. Even for an individual speaker and speaking
style, massive variability exists within and between different levels of linguistic analysis.
Words, syllables, and phones all differ from context to context and vary within context. All
these sources of variability contribute to the invariance problem and have plagued
researchers and speech recognition technology for the past 40 years.
The invariance problem in speech is an illustration of a larger problem in cognition. The
organism must make a guess about objects and events in the world (in the case of speech,
the message of the speaker) given some inherently probabilistic information from the
sensory systems (the speech signal). This inference process must be guided by knowledge
of the structure of the world and events in it. Given the range of environments humans find
themselves in, much of this knowledge must be learned through experience. In the case of
speech this is even more obvious as the linguistic environment is in large part determined by
the listeners community. Speech, and language in general, are a useful place to investigate

2
the integration of information from the environment and prior knowledge about the world
because we have sophisticated definitions about each of these and because multiple
linguistic environments provide multiple test cases.
In this dissertation I explore the case of word recognition as an example of inference from a
highly complex and structured perceptual environment. The speech signal is rich in spectrotemporal complexity and variable in all the ways described earlier. I draw on decades of
linguistic research outlining the dimensions along which speech varies and the relevant
linguistic contrasts conveyed. I also draw on recent advances in defining the task of
perceptual inference developed in the domains of visual perception and motor control.
Together they allow me to ask two basic questions about speech perception: what
information is available to the listener, and how is it used?
To answer these questions I examine the task of the listener from an ideal observer
perspective. In other words, given the nature of the speech signal and the goal of the
listener, how does the listener accomplish this goal? The ideal observer framework allows
one to investigate both of these questions simultaneously. It provides a way to characterize
the information in the speech signal, and at the same time it makes predictions about how
listeners should use that information. When there is a good match between this prediction
and observed behaviour, one can infer that this is a good model of both the information
available and how it is used.
In Chapter 1 I begin by characterizing the information in the speech signal. In this chapter I
examine two fundamental aspects: variability and multidimensionality. In Chapter 2 I
examine the goal of the listener and formally describe the ideal observer model. In Chapters
3, 4 and 6, I describe the results of several word recognition experiments designed to test
predictions of the ideal observer model. In Chapter 5 I describe the results of two speech
production experiments designed to investigate some details of the distributional properties
of the speech signal.

Chapter 1 Information in the speech signal

1.1 Variability
One of the classic problems for speech researchers has been the presence of variability in
the speech signal. On any dimension that speech can be measured there is variability. This
is not surprising given that speech is the output of a motor system and so requires the
control and coordination of many articulators and multiple muscle groups each of which is
subject to the neurological and physiological noise found in any biological system. The
challenge is that there is considerable variability even in the dimensions that seem to be the
most important. For example, a classic study by Peterson and Barney (1952) showed that
there is overlap in the distributions of different vowels along the primary dimensions used to
identify them (the formant frequencies). In other words, for any dimension that varies
between categories (eg vowels) there is also variability within categories. In general it is the
within-category variability that is considered problematic.
In this section I will describe some known sources of variability. I will then discuss the issue
of categories and why they are critical to the between versus within distinction.

1.1.1 Known sources of variability


There are a number of known sources of variability in articulations and their acoustic
consequences. First there are effects of the phonetic context. For any given articulation
(articulatory goal or target), the articulations to be executed before and after will influence its
execution and the resulting sound produced. For example, the spectral characteristics of
vowels are shifted towards the spectral characteristics of adjacent consonants (Moon &
Lindblom, 1994; Ohman, 1966). Such effects are pervasive. Some contextual factors
operate over long distances such as the beginning and end of syllables (Hawkins & Nguyen,
2004), or multiple syllables (Magen, 1997). I will refer to these sources of variability as the
phonetic context.

4
Higher levels of context also influence articulation. For example, articulations produced at
the beginnings of units (words, phrases or utterances) are produced with more articulatory
effort move farther along their trajectories than the same articulations in the middle of
units (Fougeron & Keating, 1997). Other factors such as the number of syllables in a word,
the prominence of the syllable, and the speaking rate (Moon & Lindblom, 1994; Wouters &
Macon, 2002) all produce variability in articulation and the resulting acoustics. I will refer to
these sources of variability as prosodic context.
There are also a number of effects that seem to be due to the predictability of the speech. In
general words or segments that are higher frequency or more predictable given their context
tend to be shorter and less clearly articulated (Aylett & Turk, 2004, 2006; Bell et al., 2003;
Bybee, 2000). These effects and possibly the effects of prosodic and phonetic context
can be thought of as variation along the hyper-hypo articulation dimension (Lindblom, 1990).
Speakers can choose to articulate more or less clearly and do so based on a number of
linguistic factors.
Finally, individuals vary in how they articulate the same linguistic sequence (Johnson,
Ladefoged, & Lindau, 1993; Moon & Lindblom, 1994; Perkell, Zandipour, Matthies, & Lane,
2002) and thus produce different acoustic patterns (Newman, Clouse, & Burnham, 2001). I
will call these speaker effects.

1.2 Within-category variability and fine phonetic detail


Traditionally these sources of variability were considered noise that must be accommodated
by the speech perception system. This was in part the result of choosing (a particular
definition of) phonemes as the important categories to be recognized. In the traditional
account all these sources of variability were variability within the category of the phoneme
and were therefore problematic (Liberman, Harris, Hoffman, & Griffith, 1975). It was further
believed that the perceptual system stripped away within-category variability through
processes of categorical perception.
Over the past 30 years however, it has become clear that within-category variability is
extremely important to the process of speech perception. A number of studies have shown
that rather than ignoring within-category variability, listeners are sensitive to it. For example,

5
differences in reaction times (Pisoni & Tash, 1974), category goodness ratings (Miller &
Volaitis, 1989), degree of semantic priming (Andruski, Blumstein, & Burton, 1994), patterns
of eye movements (McMurray, Tanenhaus, & Aslin, 2002), and neural patterns of activity
(Blumstein, Myers, & Rissman, 2005) have all been documented for within-category
differences.
Furthermore, a growing body of evidence suggests that variability is in fact useful to the
listener. It has been noted by several researchers that much of the variability described here
(phonetic and prosodic context effects) is systematic and therefore might be useful. That is,
if one knows the context, one can predict some of the variability. Conversely, knowing the
variability might allow the listener to predict the context.
As an example, we know that the execution of a particular articulation depends on the
surrounding articulations (co-articulation). One instantiation is the co-articulation of vowels
with the following consonant. As shown by Moon and Lindblom (1994), the formant
frequencies of a vowel shift depending on the location of the vocal-tract closure of the
following consonant. In fact the formants exhibit transitions into a particular configuration
depending on the consonant. These formant transitions then, are a very good way to identify
the place of articulation of the following consonant. Dahan et al. (2001) created stimuli in
which the final consonants of monosyllabic words such as cap and cat were cross
spliced. These cross-spliced stimuli had formant transitions which did not match the final
consonant. Listeners heard these words and performed a 4AFC task by clicking on objects
on a computer screen while their eye movements were monitored. Dahan et al showed that
when listeners had heard words in which the vowel transitions were consistent with cap but
the final consonant was t, they spent time looking at the picture of the cap, and took longer
to look at a picture of the cat than when they heard words in which the vowel had been
cross-spliced from another production of cat. Thus word recognition was disrupted by the
initially misleading information. This suggests that when listening to speech which has not
been cross-spliced, listeners will use the information in the formant transitions to anticipate
the place of articulation of the upcoming consonant.
Prosodic contextual information is used in the same way. The length of a syllable varies
depending on the number of syllables in a word (polysyllabic shortening). Thus the syllable
cap is much shorter when it occurs as part of captain than when it is a word on its own.

6
Salverda et al. (2003) showed that when listeners heard short versions of syllables such as
cap, they looked more to the polysyllabic alternative (captain), and when they heard long
versions of the syllables, they looked more at the monosyllabic alternative (cap). This
effect is further modulated by the position in the utterance of the word. When the word cap
occurs at the end of an utterance, it is longer and therefore more consistent with cat than
captain and when it occurs in the middle of an utterance it is shorter and more consistent
with captain than cat. Patterns of eye-movements confirm that which alternatives
listeners consider depends on both the acoustic signal (length of the syllable) and the
prosodic context (Salverda et al., 2007). Thus listeners seem to be sensitive to, and
knowledgeable of, the complex interactions of prosodic and phonetic contexts.
Similarly, effects of context between words can be used by listeners to anticipate aspects of
the upcoming word. Word-final consonants are often produced in a way that is similar to the
following word-initial consonant. One example is place assimilation between consonants in
English. The final /n/ in teen may be produced more like team when it is followed by a
labial consonant such as the /b/ in beat. This is due to temporal overlap between the
tongue tip gesture for teen and the lip closure gesture for beat (Kuhnert & Hoole, 2004).
The acoustic consequence is that the final segment of green has formant values that are a
mix of those expected for n and m (Gow, 2001). A number of studies have shown that
listeners can compensate for this co-articulation (Coenen, Zwitserlood, & Bolte, 2001;
Gaskell & Marslen-Wilson, 1996, 1998, 2001; Gow, 2001, 2002; Gow & McMurray, in press)
to correctly identify teen when hearing team in English as well as in Dutch (Mitterer &
Blomert, 2003). Furthermore, recent studies have shown that partial assimilation can help
listeners anticipate upcoming words (Gow & McMurray, in press).
A number of mechanisms have been proposed to account for this compensation ability from lexical competition networks (Gaskell, 2003; Gaskell & Marslen-Wilson, 1996) to
auditory processing mechanisms (Gow, 2003; Mitterer, Csepe, & Blomert, 2006; Mitterer,
Csepe, Honbolygo, & Blomert, 2006) - which vary in the degree of linguistic knowledge
required to accomplish the compensation. For example, Gow and colleagues have proposed
that the features of the consonant that are most like the following labial are parsed out by an
auditory grouping mechanism (Gow, 2003) that does not depend on experience with
patterns of assimilation. Evidence in support of this hypothesis comes from studies showing
that listeners seem to compensate for assimilation even when it is not a normal process in

7
their native language (Gow & Im, 2004; Mitterer, Csepe, & Blomert, 2006). However these
effects are weaker than native-language compensation suggesting there is still a role for
language experience. The pervasiveness of subtle contextual effects also suggests that
patterns of assimilation are present in all languages and differ only in degree. Reduced
effects of listener compensation may thus reflect the reduced prevalence of a particular
assimilation pattern in the listeners native language.
In summary, there is considerable evidence that listeners are sensitive to multiple types of
variability. Listeners show prototypicality effects for within-category variability, responding
more quickly (Pisoni & Tash, 1974), with fewer looks to alternatives (McMurray et al., 2002),
and higher goodness ratings (Miller & Volaitis, 1989) for tokens which are better examples
of a category. Listeners are also sensitive to the variability produced by many types of
context effects. They seem to have no difficulty compensating for this variability, and in fact
seem to use it to help infer the relevant context. Thus the traditional invariance problem has
been redefined. Rather than seeking to understand how variability is removed from the
signal, researchers are now searching for mechanisms that explain how it is used by the
listener.

1.2.1 Redefining categories


One way that contextual variability can be reduced is to redefine the categories used to
describe speech. Instead of the original phonemic categories defined by early linguistic
theories (Chomsky & Hale, 1968), units could be defined that cover longer stretches of
speech or are contextually conditioned. We can describe, for a particular category, the
distribution of values along a particular dimension. For example, we can describe the
distribution of spectral centre of gravity (COG) for the fricative /s/. This is illustrated by the
dashed line in Figure 1.1. We could instead describe the distribution of COG for more
specific categories, like the /s/ in sue versus the /s/ in she. These are illustrated by the
solid and dotted lines in Figure 1.1. Each of these distributions will have some mean and
some variance. The variance is the within-category variability and the difference between
the two means is the between-category variability. In Figure 1.1 we can see that by
redefining the categories to include one aspect of the context (the rounding of the following
vowel) we have reduced some of the within-category variability by converting it to betweencategory variability.

Figure 1.1: Hypothetical distributions of spectral centre of gravity for different categories of
the fricative /s/. Between-category variance shown by grey arrows, within-category variance
shown by black arrows.
This is an example of a phonetic context effect on variability. Many of the effects described
above result in similar shifts in the distributions of categories along some dimension. For
example vowel length, a cue to segmental contrasts such as voicing, varies with prominence
a prosodic context effect. Thus for more prominent words we would see a slightly different
distribution of vowel lengths than for less prominent words.
One example of such a solution is the use of diphones in many speech recognition and
synthesis applications. These units capture the coarticulatory variability between consonants
and vowels and within consonant clusters. Similarly, others have argued that an even larger
unit the syllable is the relevant domain to capture patterns of temporal variability
(Greenberg, Carvey, Hitchcock, & Chang, 2003).
It is important to note that changing the size of the category does not solve the invariance
problem. Even if the categories are defined in terms of large chunks of speech such as
words or phrases, thereby reducing effects of contextual variability, effects of prominence,
prosody, speaking style and individual speaker differences nevertheless persist. It would be
impossible to define a linguistic category which did not vary, since categories are by
definition abstractions. Conversely, even when all of the factors listed here have been
controlled for, by measuring a single speaker producing a single word in isolation in the
same speaking style (the most specific category possible), there is still considerable
variability in acoustic-phonetic cues (Newman et al., 2001). Furthermore, given some finite

9
amount of speech data, choosing either very specific categories or categories comprising
very long chunks of speech will result in many categories and few observations of each
category, limiting the amount of data from which one could learn about the categories (very
sparse distributions).
One response to the problem of choosing the right category is to say that there are no
categories, or that categories are dynamically defined. For example, Goldinger and Azuma
point out that evidence has been found for multiple sized units and further show that biases
in the listener or the speaker can increase the chances of finding evidence in favour of one
unit size over another (Goldinger & Azuma, 2003). They endorse a model of speech
perception in which the units of perception are dynamically defined Grossbergs adaptive
resonance theory (Grossberg, 2003). A similar approach is taken by Hawkins in
emphasizing an exemplar type model of fine phonetic detail with no real categories
(Hawkins, 2003). Exemplar models themselves have problems. They place a large demand
on memory by storing every aspect of every experience individually and not having any
categories to abstract over. These models further lack a clear computational definition of
how variability in the signal becomes information for the listener.
In this dissertation I attempt to computationally define the idea of a flexible category and
how variability in the signal both within-category and variability due to context becomes
information for the listener. Following Pierrehumbert (2003) I take as a starting point the fact
that variability in the speech signal must be defined in terms of categories and that all
linguistic categories have variability (are probabilistic). In the ideal observer approach,
information about a category is contained in these probabilistic distributions. It therefore
allows one to characterize the evidence for a particular category in a given speech signal,
whatever that category may be. Whether or not one type of category is more appropriate
remains an important empirical and theoretical question which I will not attempt to answer
here.

1.2.2 Partialing out variability


Unless categories are defined over large chunks of speech (bigger than the word), some
variability will always be due to the surrounding phonetic context. English place assimilation
is one example of this. Two major approaches have been taken to deal with this problem.

10
Gows auditory feature parsing is an example of a low-level approach. In this case the
variability is attributed to the appropriate context (source) without any reference to
categories. It applies only to cases of co-articulation or other acoustic-feature sharing
examples and not to other kinds of variation such as durational differences. Others have
proposed that durational factors can be accounted for by low-level auditory contrast effects
(Diehl & Walsh, 1989).
A different type of account ascribes listeners abilities to deal with (and use) contextual
variability to their knowledge of patterns of variation in the language. One example is the
theory of phonological under-specification (Lahiri & Marslen-Wilson, 1991) which assumes
that some representations are less specified than others. Others have argued that lexical
competition over the course of multiple words can account for context-conditioned behaviour
(Gow & McMurray, in press).
An account which uses patterns of distributional information to make inferences would
partial out the variability known to be correlated with the context. Cole and colleagues
recently demonstrated how statistical analyses of detailed phonetic data can accomplish this
goal (Cole, Linebaugh, Munson, & McMurray, submitted). They recorded productions of
utterances such as wet eagle and wet octopus in order to elicit vowel to vowel coarticulation and measured the formant frequencies in the vowel wet. They then conducted
a linear regression, adding in factors such as speaker gender and the consonant context to
try to partial out contextual variability. Their analysis found that when the right contextual
factors were taken into account, they could account for most of the variability in the formant
frequencies of wet. Importantly, the factors taken into account were those that the listener
could reasonably be expected to know. If it is statistically possible to parse out sources of
variability in speech production, then it may be possible for the listener to do the same thing
during speech perception. I will consider this issue again in Chapter 2.

1.3 Multiple cues


The second important characteristic of the speech signal is its multidimensionality. For any
of the categories described earlier, variability can be measured in multiple dimensions. In
order to determine which dimensions are relevant, one must consider a particular contrast.
Just as there are different kinds of potential categories, there are different kinds of contrasts

11
and it is the contrast that makes the categories meaningful. A single category that did not
contrast with any other would carry no information. For this dissertation I will talk about
minimal phonetic contrasts, the smallest differences that change word meaning such as the
difference between rapid and rabid, and will remain agnostic on whether these are
contrasts at the level of features, phones, syllables or words. Any acoustic-phonetic variable
that could potentially distinguish a contrast I will call a cue.

1.3.1 Sources of multiple cues


A single articulatory gesture can have multiple acoustic correlates. The acoustic
consequences of lowering the velum to allow air to pass through the nasal cavity is one of
the most complex examples of this. When the oral cavity is also unobstructed (as is the case
with nasal vowels), the resonating characteristics of the vocal tract can be characterized as
a set of coupled resonators (Fant, 1960). The resonating characteristics of the oral cavity
produce characteristic peaks in the frequency spectrum corresponding to the first two
formants (F1 and F2). Depending on the exact anatomy of the speaker, the resonating
characteristics of the nasal cavity and sinuses also contribute peaks in the frequency
spectrum. One of these peaks is generally in the vicinity of F1 (Hattori, Yamamoto, &
Fujimura, 1958) and another around 1000 Hz (House & Stevens, 1956). The acoustic
consequences of these coupled resonators include a general damping of energy in the low
frequency range, particularly in the vicinity of F1 and a shift in the frequency of F1 (Beddor &
Hawkins, 1990). Thus there are many acoustic effects of velum lowering and many potential
cues to nasality.
Categories also vary in multiple dimensions because multiple gestures are used to mark
contrasts. As an example, take the contrast between rapid and rabid. For both words, a
labial stop consonant occurs between the first and second syllables. To produce this
consonant the lips are closed, fully obstructing the flow of air in the oral cavity, and then
released as the second vowel is produced. In both words, the vocal folds are allowed to
vibrate as air passes through the glottis during the vowels. In the case of the voiced stop in
rabid, glottal vibration can continue during the oral closure. In the case of the voiceless
stop in rapid, glottal vibration is generally stopped during the oral closure. Thus the timing
relationship between the lips and the vocal folds is crucial to this contrast and the actions of
both articulators produce multiple acoustic consequences.

12

Lisker (1978) noted at least 9 potential acoustic cues to the rapid-rabid contrast (redefined somewhat here). These were 1) the length of the vowel in the first syllable, 2) the
length of the oral closure 3) the presence of vocal fold vibration during the closure, 4) the
VOT after the closure release 5) the F0 frequency going into and out of the closure, 6) the
length of the F0 transition in and out of the closure 7) the F1 frequency going into and out of
the closure 8) the amplitude of the release burst and 9) the relative amplitude of the second
and first syllables. He noted that some of these potential cues may be perceptually
negligible, but at least some of them are likely to be perceptually relevant. By manipulating
the presence or absence of vibration during the closure and the length of the closure, he
noted that the presence of vibration seems to be the dominant cue.
1.3.2 Trading relations
Repp (1982) summarized a number of perceptual studies showing that listeners use multiple
acoustic cues in making judgements about phonetic contrasts. These studies were
conducted primarily by researchers in the Haskins lab starting in the late 1940s. The
phenomenon studied in this body of literature is often called trading relations because as
Repp describes a change in the setting of one cue (which by itself, would have led to a
change in the phonetic percept) can be offset by an opposed change in the setting of
another cue so as to maintain the original phonetic percept (p87). (Repp, 1982). These
studies are conducted by varying two cues, often by creating a continuum of one cue and
using only a few values of the other cue. A trading relation is identified if the value of the
second cue affects the crossover point or categorization boundary (value where listeners
report each phonetic category equally often) of the first cue.
As an example, in word-initial voicing in English, VOT is recognized to be a relevant cue in
distinguishing the contrast between words such as beach and peach (Lisker, Liberman,
Erikson, Dechovitz, & Mandler, 1977). This cue has been shown to trade off against F1
onset frequency (Lisker et al., 1977; Stevens & Klatt, 1974; Summerfield & Haggard, 1977)
such that higher F1 onset values contribute to more voiced judgements. A second example
is fricative place of articulation (again in English). Here the spectral centre of gravity (COG)
of the frication is one cue that distinguishes words such as sue and shoe. As COG
increases, s responses increase (Mann & Repp, 1980). When the formant transitions from
the vowel in shoe are spliced onto the frication noise, listeners give more shoe

13
responses, and when the transitions come from sue, listeners give more sue responses.
The shift in categorization boundary according to the formant transitions indicates that
listeners are using the formant transitions to make judgements about fricatives.

1.3.3 Context effects


Repp (1982) makes a distinction between trading relations (a change in the interpretation of
one cue given another) and context effects (changes in interpretation of a cue given the
context) on the grounds that they are two different types of processes. In the description of
cues that I am outlining here, this distinction depends on how the categories (contrasts) are
defined. As an example, I return to the case of sue, see, shoe and she discussed
previously. These four words represent two contrasts, an /s/ /sh/ contrast and an /u/ /i/
contrast. The /s/-/sh/ contrast is distinguished by (at least) two cues, spectral COG and the
onset frequency of the second formant transition, both of which are determined by the size
of the resonant cavity in front of the constriction point. The /u/-/i/ contrast is distinguished by
the steady-state formant frequency, also determined by the size of the front resonating
cavity. Both the position of the tongue and the amount of rounding of the lips determine the
size of this front resonating cavity. While the lips are not normally rounded during the
production of /sh/ or /s/, they are rounded for the production of /u/. When the two are
produced in sequence, as in the words shoe or sue, some coarticulatory rounding occurs
during the fricative (Bell-Berti & Harris, 1979), lowering the spectral COG and the onset
frequency of the F2 transitions. This relationship is illustrated in Figure 1.2A.

14

Figure 1.2: [A] Hypothetical distributions of cues for the words sue, shoe, see and she.
Ellipses represent one standard deviation of a normal distribution. [B] Hypothetical response
curves for different spectral centers of gravity of the frication noise.
When making a decision between /sh/ and /s/, (i.e. between shoe and sue or between
she and see), both spectral COG and F2 onset are cues to the distinction and operate in
a trading relationship (as described by Repp). The particular value of COG that is most
ambiguous (labelled /s/ on 50% of trials) differs depending on the F2 onset. If the formants
from see are spliced onto the noise, there are more s responses (Figure 1.2B dotted line)
than if the formants from she are spliced onto the noise (Figure 1.2B dashed line). This
effect is independent of the relationship between vowel rounding and COG (it works for
unrounded vowels and for rounded vowels). However vowel rounding has a similar effect,
such that if the formant transitions from see are spliced onto the noise (Figure 1.2B dotted
line), there are fewer s responses than when the transitions from sue are spliced onto the
noise (Figure 1.2B solid line). Repp calls the effect of vowel rounding a context effect
because it is the result of the phonetic context (the vowel) while the effect of F2 onset
frequency is a secondary cue because it is a factor in how the fricative is produced. If the
listener were asked to respond according to the vowel instead of the fricative (see versus
sue) then two effects would be labelled with the opposite terms. Effects of COG on vowel
labelling would be considered a context effect while the F2 of the vowel itself would be the
cue to the contrast. Thus what is considered a cue to one contrast can be a context effect
for another contrast.
Note that the issues of categories applies here as well. If the distributions of cues are
described for all /s/ and /sh/ words together, as indicated by the dashed lines in Figure 1.2.A
and in Figure 1.1 for a single dimension, then the effect of the vowel is a context effect on

15
the /s/-/sh/ contrast. If the distributions of cues are described for each of the words
separately, as indicated by the solid and dotted lines in Figure 1.2A and 1.1, the context
effect disappears and is instead the result of the categories having different means. In both
cases, the two cues (COG and F2 onset frequency) and the vowel information are both
necessary to correctly identify the fricative.
The difference between context effects and multiple cues is subtle and crucially dependant
on context and has been controversial in the literature. For example, the effect of vowel
length on voicing judgements can be either a context effect or a cue. In word-initial voicing in
English, VOT is the main cue but the length of the following vowel also affects voicing
judgements with longer vowels leading to more voiced judgements (Summerfield, 1981).
Word-medial voicing and preceding vowel length and word final voicing and preceding vowel
length operate in the same way. This could be a second cue, because vowels are longer
when adjacent to a voiced segment than a voiceless one (Summerfield, 1975), or it could be
a context effect as both vowel length and VOT (or closure duration in the case of wordmedial voicing) increase with speaking rate (Summerfield, 1975). Either of these cases
would affect judgements about voicing in the same way. If vowel length varies directly with
voicing, it is a cue that should be taken into consideration and will influence the voicing
judgement directly (as the formant transitions for fricative place were in the previous
example). Alternatively, if VOT varies with speaking rate, then VOT should be interpreted
differently depending on the rate of speech. A number of studies have shown that while
listeners may use longer stretches of speech to determine speaking rate, more local cues
such as syllable length have the greatest effects (e.g., Allen & Miller, 1999; Kessinger &
Blumstein, 1997). Thus vowel length is one cue to speaking rate, and provides a context to
interpret VOT in the same way that vowel rounding provided a context to interpret COG in
the fricative case. A large body of literature has argued about whether vowel length is really
a cue to voicing or whether it is a context effect (e.g., Boucher, 2002), which is impossible to
summarize here. Through a number of production and perception studies, Miller and
colleagues have provided convincing evidence that vowel length has an effect on voicing
both as a rate cue, and as an independent cue to voicing (Allen & Miller, 1999; Wayland &
Miller, 1994).
In summary, multiple cues are available to distinguish any contrast and there is evidence
that listeners use multiple cues. Many of these cues must be interpreted with respect to the

16
surrounding context. For a particular (definition of a) contrast, cues to the contrast can be
distinguished from context effects per se (cues to a different contrast).

1.3.4 Cue weighting


The above examples are just a few of the many that illustrate the role of multiple cues to
phonetic contrasts. It is in fact generally acknowledged that every phonetic contrast has
multiple phonetic cues. It is also generally acknowledged that some cues have more of an
effect on labelling behaviour than others. It was noted by Lisker that while longer closure
durations tended to yield more rapid responses, no value of closure duration was
sufficiently long enough to be labelled as rapid if there was low frequency buzz during the
entire closure (Lisker, 1978). Similarly, in most experiments the finding is that some cues will
have an effect only when the values of other cues are sufficiently ambiguous, while some
cues will have an effect regardless of the values of other cues. Cues such as the later are
thought to have greater perceptual weight in the phonetic judgement than cues whose
influence is restricted to relatively ambiguous situations.

1.3.5 Cross-linguistic differences and distributional learning


Languages and language communities differ in how cues are weighted. For example, the /i/
-/I/ contrast is realized differently in Scottish English than in Southern British English (P
Escudero & Boersma, 2004). In Scottish English, vowel length is a more important cue than
formant frequency. In Southern British English, formant frequency is a more important cue
than vowel length. This pattern can be seen in both perception data and the distributions of
cues in the language environment. One aspect of listeners knowledge about their language
must therefore be the language specific weightings of each cue.
Learning which cues are more important to a contrast is a more subtle aspect of the larger
problem for language learners of learning which cues signal a contrast in their language and
which cues do not. There is evidence that infants may use the distributional properties of
acoustic-phonetic cues to discover which cues signal a contrast in their language. Work by
Maye and colleagues has shown that after listening to tokens from a bimodal distribution of
cues along a particular acoustic-phonetic dimension that is not used in their language,

17
infants will discriminate between endpoint tokens from that dimension. After listening to
tokens from a unimodal distribution of acoustic-phonetic cues along the same dimension,
infants do not discriminate between endpoint tokens (Maye, Weiss, & Aslin, 2008; Maye,
Werker, & Gerken, 2002). The same distributional learning technique can be used to
improve discriminability of non-native contrasts in adults (Maye & Gerken, 2000).
This line of work provides evidence that both infants and adults are sensitive to distributions
of acoustic-phonetic cues in their environment and use these distributions to distinguish
relevant contrasts. There is also evidence that second language learners can learn the cue
weightings of a language they are exposed to as adults. Native Spanish speakers exposed
to either Scottish or Southern British English showed patterns of cue weighting for the /i/ /I/
contrast (not used in their native language) that were consistent with the dialect they
encountered (Paola Escudero, 2000). American English listeners explicitly trained through
feedback can also learn to use the durational cue more than the spectral cue (the reverse of
their normal pattern) in making this native language distinction (Clayards, Aslin, &
Tanenhaus, 2005). A similar study trained American English listeners to use the burst cue
more than the formant transitions in distinguishing between different voiceless stops
(Francis, Baldwin, & Nusbaum, 2000). None of these studies used distributional information
as part of the training, but they demonstrate that adult listeners are flexible enough to learn
new cue weightings for even native language contrasts. Listeners can also learn new
distributions of a single cue in a native language contrast. Clarke and Luce (2005) showed
that after exposing native English listeners to sentences containing VOTs which had been
shifted towards those appropriate for a Spanish accent (shorter), listeners category
boundaries on a VOT continuum were also shifted towards shorter VOTs.
Recent computational work has shown that distributional information can be used
successfully with unsupervised learning techniques to learn details about contrasts in a
language. Toscano and McMurray (2007) used a modified mixture model to show that the
number (and locations) of vowel categories in a language can be learned from the
distributions of first and second formant frequencies. Escudero and Boersma (2004) used a
stochastic optimality theory model to learn the correct cue weightings of the /i/-/I/ contrast for
Scottish and Southern British English.

18
The task of learning the appropriate cues for a contrast, is therefore not unlike the task of
learning which dimensions constitute a potential contrast in ones language. There is
growing evidence that distributional information is important in this process and that even
adult listeners are sensitive to changes in the distributional properties of the language
around them. They are flexible enough to at least partly learn new language categories and
the cues that distinguish them and to learn new cue weighting strategies.

1.3.6 Summary
Multiple cues are available to the listener for any phonetic contrast. These cues must be
interpreted with respect to the context that they occur in. Cues may also be combined and
some are given more perceptual weight than others. Furthermore, languages (or language
communities) vary in how they weight cues to a particular contrast. Cue weighting must
therefore be part of the linguistic knowledge that a listener (or speaker) has. Cue weightings
may be flexibly adapted throughout a listeners lifetime.

1.4 Chapter summary


In this chapter I have laid out two of the fundamental characteristics of the speech signal.
The first is that for any given category (linguistic unit), there is variability in the articulation
and thus in the acoustic signal. Some of this variability is caused by a number of known
factors and can be partialed out by the speech scientist and presumably also by the listener.
In particular, variation due to context may be an important type of information that is useful
to the listener. Such a process will never account for all the variability, however, and thus
variability remains a fundamental characteristic of the speech signal. The second
fundamental characteristic is that for any given phonetic contrast, multiple acoustic cues are
available to the listener. Listeners use multiple cues and some cues are given more
perceptual weight than others. Which cues are relevant to distinguish between contrasts and
which cues should be used more than others for a particular contrast varies by language
(and by dialect). Thus the speech signal is fundamentally probabilistic and multi-dimensional
and its distributional structure may be part of a listeners linguistic knowledge.

19
The listener has the task of recovering the message of the speaker from this probabilistic
and multidimensional signal. In the next chapter I will discuss how the listener may solve this
problem.

20

Chapter 2 A framework for variability and multiple cues


In the previous chapter I laid out the structure of the speech signal. In this chapter I will
focus on the goal of speech perception and the task of the listener. I will argue that, given
the probabilistic and multi-dimensional nature of the speech signal, what the listener needs
to know is the probabilistic distribution of each acoustic cue for each linguistic category they
are trying to identify.

2.1 The ideal listener


In the last chapter I established that the speech signal is inherently probabilistic and
multidimensional. If we then consider the task of choosing the most likely message given a
number of probabilistic acoustic-phonetic cues, we can ask what behaviour would optimize
this task. An increasingly popular way to investigate human behaviour is through normative
or optimal models that (a) formalize goals as mathematical criteria, (b) search for behaviours
that optimize the criteria, and (c) compare the optimal behaviours with human behaviours.
Ideal observer models are increasingly being applied to perception in many domains and at
multiple levels (Anderson, 1990; Barlow, 1957; Geisler, 1989; Griffiths & Tenenbaum, 2006;
Todorov, 2004). In this chapter I will lay out an ideal observer, or what I will call an ideal
listener, model of speech perception and word recognition. The implications for classic word
recognition effects of this kind of Bayesian description have been sketched in some detail
recently by Norris and colleagues (Norris, 2006; Norris & McQueen, 2008) and Flemming
(2007). Flemming in particular has explored some of the implications for the acoustic
realizations of speech. What the present discussion adds is a more detailed account of the
role of phonetic detail and the information available in the signal.
In an ideal observer/listener model one wants to know (i) what is the information available to
the observer/listener, (ii) what is the task that the listener wants to accomplish, and (iii) given
these two constraints, what is the optimal solution. In these models, finding the optimal
solution is guided by several basic principles. The first principle is to acknowledge that the
world provides only probabilistic information, which is inherently ambiguous at any given
time. For our task the probabilistic information consists of the acoustic-phonetic cues
available in the speech signal. In order to take full advantage of this probabilistic information

21
these models use the entire probability distribution for each information source. This is a
point I will return to shortly. A second principle is that decisions should be made using all the
available information. This includes prior knowledge as well as whatever sensory information
sources are currently available to the perceptual system, in our case the multiple cues
available for any given speech contrast. Crucially, each of these information sources is
evaluated and weighted according to how useful it is to distinguish members of the contrast.

2.1.1 The goal of the listener


The goal of the listener is to understand the message of the speaker. There are many
possible messages that the speaker could intend to utter and the listener must decide which
is the most likely. To define the message in a more precise way we must choose a linguistic
unit such as a phrase, word or syllable to work with. The message of the speaker is of
course more complex than any single linguistic unit, or string of linguistic units. Furthermore,
the goal of the listener is undoubtedly to understand the content of the message (the
intention or speaker meaning) and not to recognize a string of linguistic units. Nevertheless,
to characterize the goal of the listener as trying to identify a single linguistic unit is a
necessary first step, if an oversimplification.
As discussed earlier, the choice of linguistic unit is an important theoretical and
methodological issue. Potential units that have been argued for include short phrases
(Bybee, 2000), lexical items (McMurray, 2004), syllables (Mehler, Segui, & Frauenfelder,
1981; Pallier, Sebastian-Galles, Felguera, Christophe, & Mehler, 1993), phones (Norris &
Cutler, 1988; Pitt & Samuel, 1990) and features (Keyser & Stevens, 2006). Another possible
unit is the morpheme. It is also possible that the most appropriate unit varies with the task
and that multiple units operate simultaneously within a language. Given that the goal of the
listener is to recognize the message rather than the linguistic units, it seems appropriate to
keep the unit of analysis as close to the message as is justifiable. For the purposes of
discussion in this dissertation, I will use lexical items as the unit of analysis. For the most
part however I will discuss lexical contrasts which are simultaneously consistent with a
lexical, syllable, phone or feature level of contrast.

22
2.1.2 Optimizing the goal
As I stated previously, the task of the listener is to choose the most likely intended message
of the speaker. In order to do this they must decide how probable each of the alternatives is.
A straightforward way to evaluate the probability that some event occurred (such as a
speaker uttering a particular word) given some evidence (such as the speech signal) is
Bayes rule given in (1) where wordA is the word being evaluated, stimX is the acoustic
signal and i is one example of the set of all words being considered.

(1)

P( wordA | stimX ) =

P( stimX | wordA) P( wordA)


P(stimX | wordi ) P(wordi )

word i words

This equation states that the probability that wordA was spoken given some speech
stimulus, stimX (the posterior probability), can be calculated if we know (a) the probability
that that stimulus would occur if the word was spoken (P(stimX|wordA) if stimX is a
particular acoustic cue, then this is the probability distribution of that cue for wordA often
called the likelihood distribution (b) the base-line or prior probability of that word without
considering the acoustic signal (P(wordA)) and (c) this same calculation for all of the other
words being considered (the denominator in (1)).

2.1.3 The likelihood distribution


One major advantage of the Bayesian model is that it is computationally explicit. What is a
potential information source and how useful it is, is explicit in the model. In this way it
provides straightforward predictions about how cues should be evaluated by an optimal
listener. That is, listeners should make use of the entire probability distribution of acousticphonetic cues and, in particular, the precision or amount of certainty about the world that a
particular cue provides should be inversely proportional to the variance of that cue in the
world.
The likelihood function (or probability density function) of an acoustic-phonetic cue for a
particular word (or other linguistic category) is the number of times each value of the cue
has occurred when that word is produced. Figure 2.1 shows two different hypothetical

23
probability distributions of a particular acoustic-phonetic cue, voice onset time (VOT) for a
particular word (e.g. peach). The solid line corresponds to a case in which the cue is
produced more consistently and thus has a narrower distribution. The dotted line
corresponds to a case in which the cue is produced less consistently and has a wider
distribution.

Figure 2.1: Two hypothetical probability distributions of voice onset time (VOT) for the word
peach.
In both of these cases, some values of the cue occur fairly often with this word. Other values
do not occur very often with this word and perhaps occur more often with another word,
such as beach. In the case of the narrow distribution there are a few values that are very
likely to occur with this word and most other values are very unlikely to occur. In the case of
the wider distribution, many values are somewhat likely, but no values are particularly likely.
So if the listener hears a particular value of VOT, such as 30ms, in the case of the narrow
distribution there will be higher confidence that it came from the peach category than in the
case of the wide distribution. This is what is meant by precision (amount of certainty) being
inversely proportional to the variance of the cue.

2.1.4 The prior


An important component in the calculation of Bayesian models is the prior probability of the
word in question. Even if there is very good evidence for a word in the acoustic signal, if that
word is very unlikely in general, then the probability that the word was the one produced is

24
lower. Figure 2.2 shows how acoustic evidence and prior probability interact to affect the
probability of a given word. StimX is an acoustic cue for wordA. WordA has some likelihood
distribution for this cue such that higher values of the cue occur frequently with wordA and
lower values occur with some other word, wordB. As the value of X increases, the probability
of word A increases. When the prior probability of word A is equal to the alternative (dotted
lines in Figure 2.2), then the posterior probability of wordA (from (1) above) is greater than
0.5 for all cue values over 25. When the prior probability of wordA is greater than the
alternative (solid lines in Figure 2.2), more values of X are above this 0.5 threshold. When
the prior probability of wordA is less than the alternative (dashed lines in Figure 2.2), fewer
values of X are above this threshold. The greater the amount of certainty (precision) that
stimX provides, the smaller the effect of prior probability. In panel A of Figure 2.2, the
probability distribution of cue X for wordA is relatively narrow (solid line in Figure 2.1),
providing a high degree of certainty (precision). In panel B of Figure 2.2, the probability
distribution of cue X for word A is relatively wide (dotted line in Figure 2.1), providing less
certainty. Thus the more information provided by the acoustic signal, the less the listener
should rely on the prior probability of the words to decide which is more likely.

Figure 2.2 : The effect of prior probability and stimulus uncertainty on the posterior
probability of lexical candidates. [A] High stimulus certainty, i.e., narrow probability
distributions [B] Low stimulus certainty, i.e., wide probability distributions. Dotted lines are
equal prior probability for wordA and the alternative. Solid lines are higher prior probability
for wordA. Dashed lines are lower prior probability for wordA.

25
There are a number of ways that the prior probability could be calculated by the listener.
One simple way would be to use an estimate of the frequency of the word in the language.
All other things being equal, this is the probability that the word would be said by a speaker.
A number of studies have found effects of word frequency on measures of word recognition.
The most extreme version is the Ganong effect where only one alternative along an
acoustic-phonetic continuum is a word, such as a VOT continuum varying from dask to
task. In this case the categorization boundary is shifted so that there are more task
responses, whereas when the endpoints of the continuum are dash and tash, the
category boundary shifts in the opposite direction (Ganong, 1980).
The overall frequency of a word in a language may not be a very good indicator of how likely
it is to occur in a particular situation. For example, beach may be much more frequent than
peach overall, but in at a farmers market the frequency of peach may increase. Similarly,
peach may be more frequent than beach after the word ripe. One way that any kind of
top down information (situational knowledge, or probability given the linguistic context) could
be characterized is by changes in the prior probability of words. This may be functionally
equivalent to priming in some cases. The advantage of describing this phenomenon in terms
of prior probability is that it may be possible to make quantitative predictions about the
effects of situational context. If it is possible to estimate the prior probability of a word for a
given situational or linguistic context, then one can predict how much this should affect
listeners interpretation for any given acoustic signal (as in the different lines in Figure 2.2).
The absolute magnitude of the effect depends on the certainty provided by the acoustic
signal (as in the different panels of Figure 2.2) but the relationship between different prior
probabilities will remain constant. Of course it may be difficult both for the listener and the
experimenter to make good estimates of prior probabilities in many real world contexts. It
may be the case that frequency or other estimates of probability given linguistic context may
provide a good heuristic for most situation and only in strongly biasing situational contexts,
or in contexts where the listener has extensive experience, do you see an effect of the
situational context on the listener. I will return to the idea of linguistic context in later
chapters to discuss in more detail how the kind of phonetic and prosodic context discussed
earlier could work into this model.
In this dissertation I will not test any predictions about prior probabilities and their effects on
speech perception or word recognition. For now I will assume that the listener does not have

26
any reason to believe that one word is more probable than another a priori, that is, all words
have equal prior probability. This is certainly not the case in real life, but in some
experimental paradigms (e.g., two-alternative forced choice) it may be a reasonable
assumption.

2.1.5 Competitor sets


The equation in (1) shows that to evaluate the posterior probability of even a single word, we
must take into account the other possible words. The denominator of the equation includes
calculations for the likelihood of these other words. The more probable the alternatives are,
the less probable wordA becomes. Conversely, in the absence of any alternatives, even a
word which does not fit the acoustic signal well can be very probable. It is clear from this that
the number of potential competitors will affect the posterior probability of a word. If there are
a number of words for which the acoustic signal in question has some non-zero probability,
then the probability mass will be distributed among them. The more likely (frequent) the
alternatives are, the more of the probability mass they will take up. From this we can explain
the classic competitor and neighbourhood density findings in word recognition and their
interactions with frequency. Flemming (2007) outlines in detail how these effects are
accounted for by the ideal listener model.

2.1.6 Contrast and certainty


To explore in more detail the relationship between acoustic-phonetic cues and the behaviour
of the listener, we will consider a hypothetical case where there are just two alternatives,
wordA and wordB, which vary along a single acoustic dimension, cueX, which is the only
correlate of the phonetic contrast which distinguishes them. As stated before, we will
assume for this hypothetical case that the prior probabilities of each word are equal. The
equation in (1) can now be restated as (2).

(2)

P( wordA | stimX ) =

P( stimX | wordA)
P( stimX | wordA) + P( stimX | wordB )

27
This simplified version of the Bayesian model is illustrated in Figure 2.3. In panel A are two
hypothetical likelihood distributions of cueX for the two categories (wordA, dark lines, and
wordB, light lines).

Figure 2.3: Hypothetical likelihood distributions for two words (panels A, B and C) and the
resulting posterior probabilities of wordA (panels D, E and F). [A] Hypothetical distributions
for wordA (dark lines) and wordB (light lines). [B] Hypothetical distributions in panel A (solid
lines) compared to two distributions with greater variance (dotted lines). [C] Hypothetical
distributions in panel A compared to two distributions with closer means (dotted lines). [D]
Posterior probability of wordA given the likelihood distributions in A. [E] Posterior probability
of wordA given the likelihood distributions in B. [F] Posterior probability of wordA given the
likelihood distributions in C.
Panel D shows, for each value of cueX, the posterior probability of wordA given these
likelihood distributions, calculated from (2). As the value of cueX increases, the posterior
probability of wordA increases. It does not increase linearly however. For most values of
cueX the posterior probability is either 0 or 1. There is a small range from 15 to 35 where
there is some uncertainty. This area corresponds to the range in panel A where the two
distributions overlap. In this range it is possible that either word generated the stimulus,
though one word is usually more probable than the other. The slope of the function in panel
D depends critically on the overlap in panel A. This overlap depends on two factors (1) the
variance of the two distributions and (2) the separation between their means. If the variance

28
of the distributions is increased, as in the dotted lines in panel B, the slope of the function
decreases as in panel E (dotted line). If the means of the distributions are brought closer
together, as in the dotted lines in panel C, the slope of the function also decreases as in
panel D (dotted line). Therefore, using the posterior probabilities we can predict for each
stimulus (value along the x axis), how it should be categorized by the listener, and how
certain they should be about that decision.
The comparisons in Figure 2.3 illustrate an important point. The sets of likelihood
distributions represent hypothetical patterns of acoustic-phonetic cues in a language. This is
the information in the signal in general that is available to distinguish a particular contrast.
Some patterns, such as the pattern in panel A provide more certainty about which member
of the contrast was probably produced. In this example, a listener hearing stimulus 20 would
calculate the probability that the speaker produced wordA is approximately .05, and the
probability that the speaker produced wordB would thus be .95. If the structure of the
language was like the dotted lines in panels B or C, the listener hearing the same stimulus
would conclude that the probability that the speaker produced wordA is approximately .20,
and the probability that the speaker produced wordB is thus .80. In both cases the best
choice is wordB, but in the first case the listener should be more confident in that choice.
Thus while the amount of information that a particular acoustic-phonetic cue provides about
a particular category (eg. word) depends only on the variance of that cue, the amount of
information that an acoustic-phonetic cue provides about a particular contrast (such as the
beach peach contrast) depends on the variance and the means of both likelihood
distributions.
One straightforward way to quantify the amount of overlap between two distributions is to
borrow the calculation of D-prime from signal detection theory. Signal detection theory is
concerned with the task of deciding whether a stimulus came from one distribution (the
signal) or another (the noise). This is exactly the same problem that a listener faces in
deciding which of two words generated a speech signal. D-prime is a measure of how
precise a listener can be in this task. The less overlap between the distributions, the easier
the task becomes. Just as is illustrated by Figure 2.3, D-prime depends on the separation
(distance between the means) and the spread (variance) of each distribution. The formula
for D-prime is given in (3).

29

(3)

Dprime =

1 2
separation
=
( 1 + 2 )/ 2
spread

Thus if we know the likelihood distributions of a particular acoustic-phonetic cue for both
categories of a contrast, we have a quantitative prediction about how useful the cue should
be in distinguishing that contrast.

2.1.7 Phonetics
It is in fact possible to measure the likelihood distributions of a particular acoustic-phonetic
cue for a particular contrast. This is what phoneticians do when they conduct careful
acoustic studies of a contrast. Guided by knowledge of the articulations involved and their
acoustic consequences, phoneticians look for acoustic cues which seem to distinguish a
contrast. In general however, when statistical analyses are performed on the distributions of
a cue, they tend to compare the global means and the spread of individual speakers means
for the two categories. This is not the same as measuring the variance of the cue for a
category overall. I argue that a better characterization of the information in the acoustics of a
language is to compare the distributions of tokens of multiple speakers. A notable example
of such an approach is Lisker and Abrahmsons seminal study of VOT (Lisker & Abrahmson,
1964). Here the authors established the importance of VOT in distinguishing stop categories
in several languages using a distributional analysis.

2.1.8 Predictions of the ideal listener


The ideal listener model predicts that the precision offered by a particular cue is inversely
proportional to the overlap between the likelihood distributions of the two categories. One
way to test this prediction is by explicitly manipulating the likelihood distributions of a
particular cue. In Chapter 4 I conducted just such a manipulation of the likelihood
distributions of VOT and compared listeners responses to the posterior probabilities
predicted by the ideal observer model.

30
2.2 Multiple Cues
One of the important aspects of the speech signal described in Chapter 1 is
multidimensionality. For any contrast there are multiple acoustic-phonetic cues available to
the listener. One important question in the speech literature is how the listener combines
these multiple information sources. The simplest model is a linear combination of cues. We
could therefore ask whether a linear model is optimal given the speech signal and whether it
is the best characterization of listeners behaviour.

2.2.1 The linear model


The Bayesian model described earlier can be expanded in a straightforward way to
incorporate multiple cues. If we consider the case of two cues, X and Y then the model in (2)
can be rewritten as (4) under the assumption that the cues are conditionally independent
(discussed in more detail in Chapter 5).
(4)

P ( wordA | cueX , cueY ) =

P(cueX | wordA) P(cueY | wordA)


P(cueX | wordA) P(cueY | wordA) + P(cueX | wordB ) P (cueY | wordB )

Figure 3.4 shows how the two cues interact in this model. In panel A are the likelihood
distributions of cueX. In panel B are the likelihood distributions of cueY. In this hypothetical
situation, cueX has less overlap between the categories than cueY. In panels C and D we
see two different ways to represent the posterior probability of stimuli which vary along both
dimensions. Each line in panel C represents a different value of cueY. The values of cueY
are represented by shaded circles in panel B. The lightest line corresponds to the lightest
shaded circle, etc. Panel C shows how the posterior probability of wordA changes with cueX
for each of the values of cueY. Just as before, we see that as cueX increases, the posterior
probability of wordA increases. As the value of cueY changes, the functions shift. If values of
cueY are more consistent with wordA (darker shaded circles), the functions shift so that the
posterior probability of wordA increases overall. If values of cueY are more consistent with
wordB (lighter shaded circles), the functions shift in the opposite direction. Panel D shows
the same data but this time each line varies with cueY and represents a single value of cueX

31
(corresponding to the shaded circles in panel A). From panels C and D in Figure 3.4 we can
see that cueY has a smaller effect on posterior probability and therefore a smaller predicted
effect on listeners judgements than cueX. This is a direct result of the increased overlap for
cueY.

Figure 2.4: Hypothetical likelihood distributions of two cues and their combined posterior
probabilities predicted by the linear model. [A] Hypothetical likelihood distributions of cueX
for wordA (dark line) and wordB (light line). [B] Hypothetical likelihood distributions of cueY
for wordA (dark line) and wordB (light line). [C] Posterior probability of wordA for all values of
cueX and 5 different values of cueY. CueY values are indicated by shaded circles in panel
B. [D] Posterior probability of wordA for all values of cueY and 5 different values of cueX.
CueX values are indicated by shaded circles in panel A.

2.2.2 Trading relations


The general pattern shown in Figure 2.4 is consistent with the trading relations data
described earlier. In those paradigms, the value of one cue is varied along a continuum
while the other cue takes on a number of values. Listeners are asked to categorize the
stimuli and they produce response curves not unlike those in Figure 2.4. Researchers using

32
this paradigm also interpret the data in a similar way. The amount of shift produced by the
second cue is interpreted as its strength. It is often noted that the effect of weaker cues are
greatest in the ambiguous region of stronger cues. This is also predicted by the linear model
(eg. panel C in Figure 2.4). There have been no direct tests of the linear model in a trading
relations paradigm. A similar model has been used to predict trading relations effects from
the distributions of real acoustic data (Tocano & McMurray, to appear).

2.2.3 Context effects


As described earlier, many acoustic cues need to be interpreted with respect to their
surrounding context. One way that context could be represented in this model is for the
likelihood distributions to be conditionalized on the context. For cases of context within a
word (or other category) the context can be captured simply by using a word-sized category
as in the case of fricative rounding in words like sue described earlier. When hearing the
frication noise of the /s/ listeners might consider the likelihood of the spectral COG given the
category sue and the category see (as well as many other words). For fricatives with
higher spectral COG, the posterior probability of see will be higher. For fricatives with lower
spectral COG, the posterior probability of sue will be higher.
For context effects that occur between words (or other categories), another characterization
is necessary. In cases like English place assimilation (e.g. teen/team beat) the distribution
of formant frequencies for teen in the context of beat will be different than for teen in the
context of talk and will overlap with the distributions of team in either context. Thus given
these four possible lexical combination, there are four relevant likelihood distributions
P(cueX | teen, beat), P(cueX | team, beat), P(cueX | teen, talk), and P(cueX | team, talk)
where cueX is the relevant formant frequency. Comparing the posterior probabilities of each
of the potential lexical combinations would allow the listener to both account for the
assimilation and predict the following word. In cases of assimilation, P(cueX |team, beat)
would be greater than any of the others. Adding a temporal component to the comparison,
the evidence at different points in the utterance could be compared. When there is further
evidence for the word beat (the case of regressive inference) the relevant likelihood
distributions for choosing between teen and team would be P(cueX | teen, beat)*P(cueY |
teen ,beat), P(cueX | team, beat)*P(cueY | team, beat), P(cueX | teen, talk)*P(cueY | teen,
talk) and P(cueX | team, talk)*P(cueY | team, talk) where cueY is the acoustic evidence for

33
beat. This account can easily account for any contextual effect. It has the disadvantage
however of requiring knowledge of each conditional probability distribution. As a result, the
listener would have to store in memory a large set of conditional probability distributions for
every pair of words in the lexicon.
A third, and more satisfying alternative, is that assimilated tokens may simply represent the
tails of the distributions. A partially assimilated token of teen is somewhat less likely to be
teen and somewhat more likely to be team than an unassimilated token, though the
likelihoods may still be in the favour of teen. Since coronals more often assimilate to labials
than vice versa, the distributions of formant values for teen will be broader than for team.
This in itself will create a bias towards reporting teen even for relatively assimilated forms.
Similarly, the distribution of formants in teen preceding beat will be different from that
preceding talk, so an assimilated formant makes beat more likely relative to talk than an
unassimilated one. In this account, assimilation is simply another cue to the following
context and listeners do not need to track or store any conditional probabilities or
relationships between words.
2.3 Conditional independence
This model is only appropriate if the cues are conditionally independent. Meaning that the
values of cueX must be independent from (not correlated with) the values of cueY - under
the given conditions. Thus if we are interested in p(wordA|cueX, cueY), then cueX and cueY
must be independent for wordA. If we were to rephrase the problem so that we were
interested in the probability of wordA followed by wordB given cueX and cueY , then cueX
and cueY would have to be independent when wordA was followed by wordB. Chapter 5
investigates the independence of multiple cues for single words. When independence is
found this model may be an appropriate model of the perceptual problem of the listener. In
the cases where conditional independence is not found the model should either be amended
to reflect the correlations between cues, or the cues should be redefined so that they are
again independent. Chapter 5 argues for an example of the later case. It is crucial to
remember that conditional independence is an assumption of this model which must be
tested before the model can be applied to a perceptual problem.

34
2.3 Evidence for the linear model
The prediction made by the linear model is that individual cues will be combined as a
weighted sum such that their effect on the listener depends on the precision of each cue.
There are a number of studies which find evidence for this prediction in non-speech
domains.
Investigations of multi-modal cue integration have compared participants precision in a
single modality with their precision when using both modalities. For example, Ernst and
Banks (2002) investigated the integration of visual and haptic cues to width judgements.
Participants first judged the width of various bars (by judging whether they were wider than a
standard) using only visual or only haptic information. From these judgements, the
researchers calculated the precision of each cue from the slope of the response curve. A
steeper slope corresponded to more precision (as discussed earlier). Participants then
judged the width of the bar using both visual and haptic information. Participants
judgements using both cues were well predicted by their judgements using each individual
cue in a linear model such as the one in (4). Importantly, adding noise to one modality
decreased the precision of that modality and increased the reliance on the other modality.
Other studies have found similar results for combining visual and auditory information for
sound localization (Battaglia, Jacobs, & Aslin, 2003). In this case, listeners relied more on
visual information, even when the visual information was degraded. This likely reflects the
overall higher reliability of the visual system over the auditory system for localization tasks.
The linear model of cue combination has also been tested for a non-speech auditory
categorization task (Holt & Lotto, 2006). In this task listeners were trained to categorize
noise stimuli that varied along two dimensions, the center frequency (CF) and the
modulation frequency (MF) of the noise. In this case, the distributions of each cue were
controlled by the experimenters and chosen so that each dimension was equally precise in
distinguishing between the two categories (pilot testing matched the steps along each
dimension for discriminability). After training, listeners were asked to discriminate stimuli that
had not been used in testing and were not given feedback to shape their performance.
Listeners used both cues, but relied more on the CF than the MF cue, suggesting they had
some reason to prefer the CF cue prior to entering the experiment. The distributions were
then changed to favour the MF cue by either moving the means closer along the CF

35
dimension or by increasing the variance of each category along the CF dimension.
Interestingly, only the variance manipulation decreased listeners reliance on the CF cue. It
is unclear why moving the means together did not have the same effect.
The predictions of the linear model for speech can be tested in a similar manner to the
foregoing auditory categorization task. In other cases of multi-cue integration, each cue is
tested in isolation and predictions from the single cue cases are compared to behaviour with
multiple cues. In the case of speech this is more difficult. For example, it is very difficult to
ask listeners to judge whether they heard beach or peach without presenting them with
stimuli that contain some value of VOT, burst amplitude, vowel length, F1 etc. It is simply not
possible to eliminate these cues without qualitatively altering the percept and thus taking the
task out of the domain of speech perception. Some cues, particularly those that rely on
specific frequency bands, may be selectively obscured by noise or band pass filtering of the
speech. Others, such as temporal cues, are much harder to obscure. This makes the
traditional paradigm difficult to apply to studies of speech perception.
There is a second way to test the linear model however. In the studies of visual and
multimodal perception it is difficult to measure the likelihood distributions directly. In part this
is because the structure of the world that the perceiver is trying to recover is ill defined (i.e.,
it does not fall into natural categories). In the case of speech perception we have the
opposite situation. While it is difficult to measure the effect of one cue in isolation, the
structure of the message is well defined, as are the phonetic categories, and thus we can
measure the likelihood distributions directly. This was discussed earlier in Chapter 2. Thus,
one way to test the model with speech data is to examine in detail the structure of the
likelihood distributions. From these distributions, predictions can be made about the amount
of precision offered by each cue, as in the auditory categorization task (Holt & Lotto, 2006),
and the amount that each cue should contribute to listeners judgements. In Chapters 3 and
4 I will directly manipulate the distributions of one acoustic-phonetic cue, holding all other
cues constant.
An assumption of the linear model is that only distributions of individual dimensions are
important. No relationships that can only be represented in multiple dimensions are captured
by this model. An example of such a relationship is a case where each cue alone provides
only partial information about category identity, but two cues together provide complete

36
separation in multi-dimensional space because the cues are correlated within-category (this
relationship is further explained in Chapter 5). The linear model predicts that speakers will
not produce distributions with this kind of relationship and that listeners will not be able to
learn categories with this kind of relationship. Evidence from speech and non-speech
auditory categorization tasks suggests that listeners can easily learn two categories that are
separated in one dimension while varying in another dimension but have difficulty learning
two categories that are only separated in multidimensional space (Goudbeek, 2007;
Goudbeek, Cutler, & Smits, 2008; Goudbeek, Smits, Swingly, & Cutler, submitted). This
suggests that listeners use a linear model like the one outlined here and are not able to
represent multidimensional relationships. Chapter 5 investigates the multidimensional
structure of multiple cues produced by speakers in order to determine if speech contains
such special relationships.

2.4 Chapter summary


In this chapter I have laid out a model of how acoustic-phonetic cues could be used by the
listener, the ideal listener model. In this model, multiple, independent, probabilistic cues are
used by the listener in proportion to their precision. The precision provided by a cue
depends on the distributions of that cue for the contrast being considered. In order to know
whether this model is appropriate we need to know if the structure of the acoustic-phonetic
cues in language fits the assumptions of the model and we must know whether listeners are
sensitive to the likelihood distributions in the way we predict. In Chapters 3 and 4 I present
data from two perceptual experiments testing the hypothesis that listeners are sensitive to
the precision in likelihood distributions when evaluating speech. Then in Chapter 5 I present
acoustic-phonetic data from two production experiments which test the validity of the
assumptions of the model. Chapter 5 also investigates the natural relationships between
cues, variability between speakers and the cumulative information available from multiple
cues in natural speech stimuli.

37

Chapter 3 Experimental manipulations of distributions


In Chapter 2 I described an ideal listener model of word recognition given probabilistic
speech cues. I argued that when deciding between two categories (words), the precision
offered by a particular cue is inversely proportional to the overlap between the likelihood
distributions of the categories (words). In this chapter I test the relationship between
precision and categorization certainty for the listener by directly manipulating the
distributions of VOT. If listeners are sensitive to the distributions of VOT as predicted by the
ideal listener model, then the amount of certainty they have in deciding between words like
beach and peach should be predicted by these distributions.
Fine-grained sensitivity to acoustic-phonetic cues is required for listeners to track the
distributions of the cues. Early models of speech perception treated within-category variance
as noise. Mechanisms such as categorical perception were thought to define ideal
boundaries along a continuum, with all exemplars within those boundaries treated as
identical category members (Liberman, 1996; Liberman et al., 1975). However, considerable
evidence has accumulated that listeners are sensitive to within-category differences
(Andruski et al., 1994; Blumstein et al., 2005; McMurray et al., 2002; Miller & Volaitis, 1989;
Pisoni & Tash, 1974). In addition both infants and adult listeners use distributional
information to find the number of categories along a continuum (Maye & Gerken, 2000;
Maye et al., 2008; Maye et al., 2002) and the optimal boundary between categories (Clarke
& Luce, 2005). These results are consistent with the ideal listener model. What has thus far
not been shown, however, is that listeners are sensitive to the entire probability distribution
of an acoustic-phonetic cue, and in particular the precision of that cue.
We tested this hypothesis by manipulating the probability distributions of tokens along a
VOT continuum in a category judgement task. The stimuli were tokens from two probability
distributions (shown in Figure 3.1A) centered around 0ms and 50ms (the prototypical
category means for beach and peach in American English). For one group of
participants, stimuli came from a pair of distributions with relatively wide variance (14 ms),
and for another group, stimuli came from a pair of distributions with relatively narrow
variance (8 ms). Importantly both pairs of distributions contain the same number of tokens
overall and the same category means. Participants categorized the stimuli by clicking on the

38
picture they thought was appropriate (e.g., a peach). Each trial was both a test and a
training trial; there were not separate training and testing phases. Using equation (2) from
Chapter 2 (repeated here),

(2)

P( wordA | stimX ) =

P( stimX | wordA)
P( stimX | wordA) + P( stimX | wordB )

we predicted the probability that listeners would choose the peach for each step along the
VOT continuum (Fig. 4.1B).

Figure 3.1: [A] Probability distributions of tokens that listeners categorized in the narrow
condition (dark lines) and wide condition (light lines). [B] Optimal response curves calculated
from the probability distributions using Equation (2) for the narrow condition (dark lines) and
wide condition (light lines).

39
In the categorization task described above (but without the distributional manipulation),
McMurray et al. (2002) found that looks to the competitor object (the beach when the
listener chose the peach) increased with increasing distance from the category boundary. If
the proportion of time listeners spend looking at each object reflects how strongly they are
considering that object as a potential referent, then according to our model, proportion of
looks should reflect the posterior probability of each object given a particular VOT value.
Figure 4A shows the posterior probability of each category (calculated from (2) as before) for
the less likely object (i.e., the competitor) for each VOT value given our distributions.
Posterior probability increases for VOT values closer to the category boundary, similar to the
increase in looks to the competitor object in McMurray et al. Importantly, our model makes
two new predictions. The first is that the increase in posterior probability is not linear, but
rather varies little around the category mean and then increases rapidly near the category
boundary. The second is that the posterior probability is a function of the uncertainty in the
distributions. For distributions with greater overlap (light lines), posterior probability
increases more quickly than for distributions with less overlap (dark lines). If the proportion
of looks to each object reflects the posterior probability, we expect to see different patterns
for listeners who are categorizing stimuli from distributions with different variances.

3.1 Experiment 1

3.1.1 Methods
Participants were 24 monolingual native English-speaking students from the University of
Rochester with no known hearing problems (12 each in the wide and narrow conditions),
Participants were tested individually in a quiet room. Sessions lasted approximately one
hour. Participants were given the opportunity to take breaks and were paid $7.50.
3.1.2 Materials
Auditory stimuli were synthesized using the Klattworks 1 interface to the 1988 Klatt
synthesizer (Klatt, 1980). VOT was manipulated in twelve 10 ms steps from -30 ms to 80
ms. Negative VOT values were created by adding voicing before the stop burst. Positive
1

available from Bob McMurray: bob-mcmurray@uiowa.edu

40
VOT values were created by replacing successive frames of voicing after the stop burst with
aspiration. All other parameters were held constant across words and were modeled on
natural stimuli. Three continua were created with endpoints corresponding to beachpeach, beak-peak and bees-peas. Each group of listeners heard 228 tokens, 76
from each of the 3 continua. The number of tokens of each step is shown in Table 1. Six
filler items were also synthesized: lake, rake, lace, race, lei and ray.
Table 3.1: Number of repetitions of each VOT value in the narrow and wide variance
conditions.
VOT

-30

-20

-10

10

20

30

40

50

60

70

80

Narrow

27

54

27

27

54

27

Wide

12

27

30

27

15

15

27

30

27

12

3.1.3 Procedure
Participants were seated in front of a computer screen at a comfortable viewing distance
and wore an SR Eyelink II head mounted eye-tracker with a sampling rate of 250Hz.
Auditory stimuli were presented over Sennheiser HD 570 head phones at a comfortable
listening level. The session began with 12 familiarization trials in which participants saw the
pictures and their corresponding written labels once each. No auditory stimuli were
presented during familiarization.
Each experimental trial began with a display containing four pictures, two test items and two
filler items, one in each quadrant (Figure 4.2). Visual stimuli are included in Appendix A.
One of the auditory stimuli was presented and participants chose the picture they thought
most appropriate by clicking on it with the mouse. Eye movements were monitored from the
onset of the auditory stimulus until participants made a response.

41

Figure 3.2: Example display screen containing the items beach, peach, lace and race.
Locations of items were randomized across trials. Actual displays were in color.
Each participant heard an equal number of test and filler items. For a particular display all
alternatives were equally likely. Trials were randomly ordered. Filler items were included to
provide some variety in the task and to make the design less obvious to the listeners.

3.2 Results

3.2.1 Categorization behaviour


Categorization functions were fit for each participant in the two cue-variability conditions
using a fitting algorithm designed for psychometric functions (Wichmann & Hill, 2001) 2 .
Participants were excluded and replaced if their fitted category boundaries were more than
15 ms different from the 25 ms boundary used in the distribution (two participants were
replaced in the wide condition). Figures 3.3 A and B show individual categorization functions

The function fit was

+ where corresponds to the boundary


f (x ) = (1 )
( (a x ) / b )
1+ e

(50% point), b to the slope (variance of the cumulative distribution). The last two variables, and ,
are the lapse rates (upper and lower asymptotes) and are included to model stimulus independent
errors (lapses) which are known to bias fits if not accounted for (Swanson & Birch 1992). These
parameters were constrained to be less than 5% which is thought to be the range of lapsing in
psychophysical paradigms (Wichmann & Hill, 2001).

42
for listeners in the narrow (mean RMSE = 0.07) and wide (mean RMSE = 0.05) conditions.
As predicted, categorization functions in the wide condition had shallower slopes 3 (mean =
6.2, sd = 0.89) than functions in the narrow condition (mean = 3.5, sd = 0.76). This
difference was significant (t(22) = -2.4, p = 0.02). The slopes of the functions in each
condition were compared to the optimal function given the distributions. Figure 3.3C shows
the optimal function given the narrow distributions (solid line) and the empirically obtained
function (dashed line) using the average slope of listeners in the narrow condition. Figure
3.3D shows the optimal function and empirically obtained function for the wide condition. As
predicted, listeners are less certain than the optimal observer given either of the
distributions.

Figure 3.3: Fitted response curves for individual participants in [A] narrow condition and [B]
wide condition. Optimal response curves (solid lines) and curves from average slope of
individuals (dashed lines) for participants in [C] narrow condition and [D] wide condition.

Slope in this case is beta from (Wichmann & Hill, 2001). This is not the same as the derivative at the
50% point of the function. As slope gets steeper beta decreases while the derivative increases.

43
While the source of this additional uncertainty is unknown, and may differ from listener to
listener, it should not vary for the two conditions. We quantified the amount of additional
uncertainty using the observation of Feldman and Griffiths (2007) that given the
categorization function (5)

p(CategoryA | stimX ) =

(5)

1
1+ e

gstimX + b

the slope (g) is given by (6). The equation in (6) assumes that both categories have the
same variance ( CategoryA,B ) and any additional uncertainty can be described as a Gaussian
2

distribution with zero mean and some variance ( N ).


2

slope =

(6)

CategoryA CategoryB
2
2
CategoryA
,B + N

Using (6), the N values for both groups (Narrow = 10.7, Wide = 10.8) were very similar,
2

suggesting that the same additional source of uncertainty affected responses in both groups
and was independent of the distributions themselves.

3.2.2 Eye movements


From the posterior probability functions in Figure 3.4A, we predicted that the largest
difference in looks to the competitor object (i.e., peach for short VOTs and beach for long
VOTs) between the two groups would be at 20 and 30 ms, a smaller difference at 10 and 40
ms, and no difference at other VOT values. Because there were so few trials at VOT values
of -20, 20, 30 and 70 ms (see Table 1), we could not analyze eye-movements for these
values. Figure 3.4B shows the proportion of looks for the remaining VOT values. A repeated
measures analysis of variance (ANOVA) was performed separately for the b (-10, 0, 10)
and p (40, 50, 60) sides of the continuum. On the b side there was a significant effect of
VOT (F(2,44) = 9.09, p <.0001), a significant effect of condition (F(1,22) = 5.2, p<.05), and
no interaction (F(2,44) = 1.08, p = 0.42). On the p side there was a significant effect of
VOT (F(2,44) = 13.4, p <.0001), no effect of condition (F(1,22) = 3.5, p =.07), but a
significant interaction (F(2,44) = 4.60, p < 0.05). As predicted, the largest effects were at 10
ms and 40 ms. Planned t-tests showed that the effect of group was significant both for 10
ms (t(22) = 2.10, p<.05) and for 40 ms (t(22) = -2.22, p<.05), but not for any other VOT

44
values. The size of the effect was slightly larger on the p side of the continuum. Natural
VOT values are more variable for the p than for the b category (Lisker & Abrahmson
1964), and this pattern was confirmed for the words from the present study in Chapter 5.
This asymmetry may have made listeners more sensitive to our manipulation for the p
category.

Figure 3.4: Relationship between posterior probability and looks to the competitor object for
each VOT. [A] Posterior probabilities of the competitor words calculated using Equation 1 for
the narrow (dark lines) and wide (light lines) distributions. [B] Proportion of looks to the
competitor object for the narrow group (shaded bars) and wide group (open bars) for all
VOT values with sufficient trials to analyze. Error bars indicate SEM. * p <.05.

3.2.3 Reaction time


Previous studies measuring reaction time (RT) to stimuli varying in VOT found that RT
increased as stimuli became more ambiguous (Pissoni & Tash 1974). This suggests that RT
may also reflect the posterior probability of a stimulus in the same way as looks to the

45
competitor object. The prediction is that RT will increase for VOT values close to the
category boundary, and will increase more for listeners categorizing stimuli from the wide

Reaction Time (ms)

distributions than for listeners categorizing stimuli from the narrow distributions (Figure 3.4).
2300
2200
2100
2000
1900
1800
1700
1600
1500
1400
1300

Narrow
Wide

-30 -20 -10

10

20 30

40 50 60

70 80

VOT (ms)

Figure 3.5: Reaction time by VOT for listeners categorizing stimuli from narrow distributions
(dark lines) and wide distributions (light lines).
Average reaction time for each VOT step is shown in Figure 3.5. Reaction time increased as
VOT approached the category boundary. Reaction time was also higher overall for listeners
in the wide group than for listeners in the narrow group. To test whether listeners were
slower to react to more ambiguous stimuli, a mixed effects linear regression model was
used. As with eye-movements, each side of the continuum was analyzed separately.
Reaction times were log transformed (to ensure normal distribution of values) and trials were
excluded if RT was more than 2.5 standard deviations from the mean. Predictor variables
were VOT, group (wide or narrow) and their interaction. Predictor variables were centered to
remove correlations between the main effects and their interaction. Subject and item were
included as random effects.
On the p side of the continuum there was a significant negative effect of VOT ( = -0.0018,
SE = .0004), RT decreased as VOT increased. There was also a significant effect of group
( = -0.14, SE = .046), RT was higher for listeners in the wide group than in the narrow
group. The interaction was also significant ( = 0.0023, SE = .0004). For listeners in the
narrow group, each 10ms increase in VOT reduced RT by 50ms on average. For listeners in
the wide group each 10ms increase in VOT reduced RT by 10ms on average.

46

On the b side of the continuum there was a significant positive effect of VOT ( = 0.0033,
SE = .0004), RT increased as VOT increased. There was also a significant effect of group (
= -0.15, SE = .049), RT was higher for listeners in the wide group than in the narrow group.
The interaction was not significant ( = 0.0001, SE = .0008). For listeners in the narrow
group, each 10ms increase in VOT reduced RT by 58ms on average. For listeners in the
wide group each 10ms increase in VOT reduced RT by 51ms on average.
Reaction times mirror the results found for eye-movements. Reaction times increase closer
to the category boundary and this increase is greater for listeners in the wide group than in
the narrow group. The same asymmetry in the interaction was found as well. The difference
between the wide and narrow groups was smaller for the b side of the continuum than for
the p side of the continuum. This suggests that reaction time roughly reflects the posterior
probability of the two choices. As the difference in posterior probability between the two
candidates decreases, the choice becomes more difficult and reaction time increases.
Importantly reaction time does not reflect simply distance from the category boundary. It
reflects something more like category prototypicality.

3.3 Chapter summary


In this chapter, I tested one claim of the ideal listener model: that listeners use the
probability distribution of an acoustic-phonetic cue (VOT) to estimate the probability that a
token was an example of a particular category (e.g., peach). The results provide two kinds
of evidence in support of this model. First, the average categorization slopes for the two
conditions were well predicted by the distributions of the cues, given some additional source
of uncertainty constant across both conditions. Second, participants uncertainty about their
decision (as indexed by looks to the competitor object and reaction time) also followed the
pattern predicted by the distribution of cues.
In this chapter I have described the categories listeners were identifying as lexical items.
However, as discussed earlier, the data are also consistent with feature level categories
(e.g., voiced and voiceless), phoneme level categories (e.g,. /b/ and /p/) or syllable level
categories (e.g., /bi/ and /pi/). In this study it is not possible to know at which level listeners
tracked these distributions or to what degree they would generalize to other examples (e.g.,

47
other voiced stops or words containing /bi/ and /pi/). It remains an important theoretical and
empirical question to determine over which categories listeners calculate distributions.
In summary, the close relationship between posterior probability and response choice,
probability of looks to an object, and reaction time suggests that these three measures
reflect listeners estimates of posterior probability. Furthermore, it suggests that listeners are
acting in a manner consistent with the probability distributions they have heard when using
acoustic information to recognize words.

48

Chapter 4 Control experiments


In the previous chapter I claimed that listeners use the probability distributions of acousticphonetic cues to make decisions about which word they are hearing. The main evidence for
this claim is that when the distributions are more overlapping, listeners show more
uncertainty as indexed by shallower categorization functions, more looks to the competitor
object, and increased reaction times.
As discussed in Chapter 2 (and illustrated in Figure 2.3), two parameters determine the
amount of overlap between two distributions: the variances of the distributions and the
separation between their means. In the previous experiment only the variances were
manipulated while the means were kept constant between the two sets of distributions.
Moving the means closer together will also increase the overlap between the distributions
and should increase the amount of uncertainty for the listener. In this chapter I test this
prediction by exposing a new set of listeners to the narrow distributions used previously, but
with the means shifted closer together.
A second reason to test this prediction is to rule out an alternative explanation for the results
in the first experiment. The increased uncertainty of listeners in the wide condition could be
the result of the increase in overall variability in the cue rather than the increase in overlap
or the variability in the individual distributions. For example, listeners in the wide condition
heard more repetitions of extreme tokens (-20 and 70 ms) and a few repetitions of extreme
tokens that listeners in the narrow condition did not hear (-30 and 80 ms). Thus listeners
may have been responding to this global increase in variability rather than the variability in
the individual distributions. In this new condition, global variability has been reduced (there
are now fewer repetitions of extreme tokens and no repetitions of the most extreme tokens),
but the prediction is still that uncertainty should increase over the narrow condition because
of the increase in overlap.

49
4.1 Experiment 2
4.1.1 Methods
Participants were 12 new monolingual native English-speaking students from the University
of Rochester with no known hearing problems. Participants were tested individually in a
quiet room as before. Sessions lasted approximately one hour. Participants were given the
opportunity to take breaks and were paid $7.50.
Stimuli and procedure were identical to the previous experiment. The means of the two
distributions were each shifted one step (10 ms) closer together. The distribution of trials is
listed in Table 4.1 (Close condition).
Table 4.1: Number of repetitions of each VOT value in the Narrow and control conditions.
VOT

-30

-20

-10

10

20

30

40

50

60

70

80

Narrow

27

54

27

27

54

27

Close

27

54

30

30

54

27

Flat

21

21

21

21

21

21

21

21

21

21

21

21

4.2 Results

4.2.1 Categorization functions


As before, categorization functions were fit for each individual subject. Participants were
excluded and replaced if their categorization boundary was more than 10ms more or less
than the expected boundary (25 ms). One participant was excluded. Figure 4.1 A shows
individual categorization functions for each participant. The categorization function with the
average slope is shown in Figure 4.1 B.

50

Figure 4.1: Categorization functions [A] individual participants [B] average fitted function
(solid line) and predicted function (dashed line).
As before the average categorization function was shallower (mean = 3.8, SD = 1.8) than
the predicted categorization function (predicted = 2.2). Again this is to be expected if we
assume there is some other source of uncertainty in listeners categorization behaviour. The
amount of additional uncertainty was calculated as before. This time the variance accounted
for by this additional uncertainty ( N ) was estimated to be less than in the previous
2

experiment (Narrow Close = 7.1, Narrow = 10.7, Wide = 10.8).


There are (at least) two interpretations of this difference. The first is that this is the correct
estimate of the additional uncertainty and it was lower for listeners in this experiment than in
the previous experiments. This could be due to minor changes in experimental setup. At the
beginning of each trial a drift correction on the eye-tracker was performed. In the previous
experimental setup this process was not very accurate, requiring many re-corrections by
participants and possibly contributing some distraction. In the second experiment (reported
here) the drift correction process was made more accurate, potentially relieving this source
of distraction and or frustration.
The second explanation is that the additional uncertainty was the same for these
participants as for the participants in the previous experiment, but the uncertainty arising
from the distributions themselves is less than predicted. This would be consistent with the
hypothesis that listeners are responding to overall variability in the cue (range) and not to
the distributions of individual categories. I will return to these explanations in the discussion.

51

4.2.2 Eye-movements
Eye-movements were analysed as before. Looks to the less likely (competitor) object were
compared to the predicted posterior probability of the competitor object computed as before
(Figure 4.2). The pattern of looks is similar to that observed in the previous experiment.
Listeners looked more at the less likely object when its posterior probability was higher.

Figure 4.2: Looks to the less likely object [A] posterior probability of the less likely object [B]
proportion of looks.
There were more looks to the competitor object overall than in the previous experiment
(Figure 4.3). This may also have been due to the change in experimental setup. The
improved drift correction procedure may have encouraged participants to make more eyemovements during the trial rather than using a strategy of maintaining central fixation for the
next drift correction. Such a strategy was informally observed for participants who had
particular difficulty with the drift correction in the previous experiment. Alternatively, the
increased looks to the competitor object may reflect increased overall uncertainty. Note that
this is the opposite pattern to that of the categorization functions.

Proportion looks to competitor

52

0.35
0.3
0.25
Wide

0.2

Narrow

0.15

Narrow close

0.1
0.05
0
-10

10

20

30

40

50

60

VOT (ms)

Figure 4.3: Comparison of proportion of looks to the less likely object in all three conditions
in the two experiments.

4.2.3 Reaction time


Reaction time was also analyzed as in the last experiment. Figure 4.4 shows the average
RT for each VOT step in comparison to the previous experiment. As before, RT increased
for more ambiguous VOT steps. Reaction time was faster overall than in the previous
experiment. This is again consistent with either less uncertainty than predicted or less
distraction from the drift correction procedure.

53

Figure 4.4: Average reaction time for each condition in the two experiments. Error bars are
SE.

4.3 Discussion
This experiment was designed to test whether listeners uncertainty is the result of the
amount of overlap between the distributions of acoustic cues (as hypothesized) or is a
function of the overall variability in the cue, or simply the within-category variability of each
distribution.
The distributions used had the same variance as the Narrow distributions used before but
their means were at 10 and 40 ms instead of 0 and 50 ms. This produced less overall
variability but more overlap. Therefore if listeners are responding to the overall variability
they should be less uncertain in this condition than in the Narrow condition tested before. If,
on the other hand, listeners are responding to the amount of overlap between the
distributions, they should be more uncertain than they were in the Narrow condition.
Results were mixed. Listeners categorization functions were narrower than predicted,
possibly reflecting less uncertainty than in the Narrow condition. Reaction times were also
faster, again possibly reflecting less uncertainty than predicted by the overlap of the
distributions. Listeners looked more at the competitor object than in the previous study,
however, suggesting more uncertainty than predicted by the overlap.

54
One possible reason for this conflict is the minor change in the drift correction procedure.
Because the drift correction was more accurate, listeners were less likely to have to click on
the fixation point two or more times. This may have allowed them to attend better to the
task, thus reducing the amount of uncertainty in their responses and decreasing their
reaction time. They may also have been less likely to adopt a strategy of maintaining central
fixation, thus increasing looks to the competitor object.
In order to test this hypothesis, an additional four participants categorized stimuli from a flat
distribution (21 trials from each step) using the new drift correction procedure. These
listeners heard stimuli from the widest range (-30 to 80 ms). Thus if uncertainty is a function
of the range (inclusion of extreme tokens) of stimuli they should be more uncertain than in
the Narrow Close condition tested above. Figure 4.5 shows the reaction time and proportion
of looks to the competitor for these participants.

Figure 4.5: Flat condition [A] reaction time, [B] proportion of looks to the competitor object.
Reaction times were similar to those in the narrow close condition (Flat: mean = 1564 ms,
SD = 105 ms; Narrow close: mean = 1561 ms, SD = 102 ms) as were average proportion
fixations (Flat: mean = 0.16, SD = 0.05; Narrow close: mean = 0.18, SD = 0.05). These two
results suggest that the decreased uncertainty in the narrow close condition compared to
the previous results arose because of the difference in the testing procedure and not
because of the decrease in overall variability in the Narrow close condition.

55
4.4 Chapter summary
In the previous chapter I showed that listeners uncertainty increased when categorizing
distributions that were more variable. In this chapter I argue that uncertainty also increases
when distributions are more overlapping. In order to test this claim, listeners categorized
distributions which had the same variance as in the Narrow condition of the previous
experiment but whose means were shifted closer together. If uncertainty depends on the
amount of overlap between the distributions, listeners should be more uncertain than in the
previous Narrow condition. Three measures of uncertainty were examined: categorization
slope, proportion of looks, and reaction time. Both categorization slope and reaction time
suggested less uncertainty than in the Narrow condition. Proportion of looks, however,
suggested more uncertainty than in the Narrow condition. One explanation for this
inconsistent result is that the change in the experimental setup shifted listeners behaviour in
the experiment. This hypothesis is supported by the pattern of reaction times and looks to
the competitor found for a new set of subjects who categorized an equal number of trials for
each VOT step. The different behavioural pattern found here makes direct comparison with
the previous experiments difficult. Replication of the Narrow condition with the new
experimental setup is necessary before conclusions can be drawn.

56

Chapter 5 Acoustic-phonetic analysis


In the previous chapter I laid out an ideal observer model of speech perception using
multiple, probabilistic cues. In this chapter I present acoustic-phonetic data from two
production studies designed to test some of the assumptions and predictions of this model.

5.1 Importance of production data


If we are to understand the role of distributions in how listeners use acoustic-phonetic
information, then it is important to look at how those cues are in fact distributed in the real
world. There are two major assumptions of the linear model outlined in Chapter 2. One is
that cues are normally distributed. The second, and perhaps more important is that they are
independent. Is our assumption of normally distributed variability warranted? Are cues
independent? Furthermore, if the distributions of cues have implications for perception, then
they will have implications for phonological theories as well.
In general, phonetic studies do not report the entire distribution of the cue that is measured
(see Kessinger & Blumstein, 1997; Lisker & Abrahmson, 1964; Newman et al., 2001; van
Alphen & Smits, 2004 for notable exceptions). What is typically measured is the mean for
each cue for each speaker. The average mean and standard deviation of speakers means
are then reported. This kind of analysis gives us some idea of individual variability of the
means of cues, but it does not give us any idea about the variances of the distributions as a
whole or for individual speakers. As we have shown, the variances as well as the means are
crucial for understanding the contribution a particular cue may make to identifying
categories. For this reason we conducted a phonetic study of labial stop voicing in English
and report here the distributions of several acoustic-phonetic cues in two syllable positions,
word-initially and word-medially before an unstressed syllable.

57
5.2 Assumptions of the linear model

5.2.1 Normal distributions


I argue that listeners track the likelihood distributions of acoustic-phonetic cues and that
these distributions are approximated well by normal or Gaussian distributions. This
assumption need only be approximately true and can easily be tested by measuring the
distributions of cues. If this is true, then our assumption of Gaussian likelihood distributions
is justified. There is some evidence from the phonetic literature that this is indeed the case.
The seminal study by Lisker and Abrahmson (1964) measured VOT values for word-initial
stop consonants in several languages. In reporting their results they provide not only the
mean and standard deviation of cues for the subjects but also histograms of their data.
These histograms appear to be roughly normally distributed for each category in each
language. More recently, Kessinger and Blumstein (1997) replicated these results for three
languages: English, Thai and French, in three different speaking styles: in isolation, and in a
carrier phrase at slow and fast speaking rates. Following Lisker and Abrahmson, they report
histograms for VOT in each case. They found that for each language and each speaking
style tokens are roughly normally distributed (Kessinger & Blumstein, 1997). Newman,
Clouse and Burnham (2001) also conducted a phonetic study of the spectral properties of
word-initial alveolar and palatal alveolar fricatives (/s/ and /sh/) in English. They report not
only histograms for the group data but also for several individuals. Again, each of these
distributions appears to be roughly normally distributed. Thus there is evidence that both
spectral and temporal cues conform to Gaussian distributions in a number of languages.
Here we replicate these findings for word-initial stop voicing in English and extend them to
word-medial stop voicing. We also extend these finding to include multiple cues. One study
which reports the distributions (histograms) of multiple cues is a recent report of Dutch wordinitial stops (van Alphen & Smits, 2004). Here the researchers measured the distributions of
prevoicing, burst amplitude, burst duration, F0 onset frequency and amount of F0 rise or fall
into the vowel. Of these cues only burst duration had a skewed distribution, again supporting
a Gaussian assumption for acoustic-phonetic cue distributions.

58
5.2.2 Cue independence
More importantly perhaps than the exact distributions of the cues is their relationship to each
other. Our model assumes that each cue is independent. That is, for a particular token,
knowing the value of one cue does not help you predict the value of a second cue. Because
the linear model outlined here uses only the distributions of individual cues and does not
represent the relationship between cues, it cannot capture any important relationships
between cues. Therefore, if there are any important relationships between cues, the linear
model will be insufficient.
Of course, any cue that has different means for two categories will be correlated with any
other cue that has different means for the same categories. These differences in means can
be captured by the linear model. Of interest here is therefore not this overall correlation but
whether for a particular category these cues are correlated. This is the kind of relationship
that cant be captured by the linear model. To understand the consequences of such a
relationship we will again consider a hypothetical situation.
Figure 5.1 shows three examples of hypothetical relationships between two cues. In each
panel the x-axis represents one cue (acoustic-phonetic dimension) and the y-axis
represents another cue. Hypothetical category distributions for two categories are
represented in each panel by the ellipses (one standard deviation of a normal distribution).
The distributions of each cue in one dimension are shown along the side of the axes. In
each panel the one-dimensional distributions are held constant and what varies between
panels is the correlations between the cues. In panel A, the cues are correlated overall, but
for each individual category they are not. Thus there is no information available in the twodimensional plots of these cues that is not available in the two one-dimensional plots.

59
Uncorrelated Cues
B

Redundant Cues
C

Cue Y

Correlated Cues

Cue X

Cue X

Cue X

Figure 5.1: Three hypothetical relationships between two cues (X and Y) for two categories
(dark and light lines). Ellipses represent one standard deviation around the mean of twodimensional Gaussian distributions. One-dimensional Gaussian distributions of the same
categories are represented above (cue X) and to the right (cue Y) of the panels.

In panel B, the cues are again correlated overall but are also correlated for a particular
category. In this example the two categories are more separated in two-dimensional space
(have tighter distributions) than the categories in panel A. Thus, the relationship between the
two categories could allow the listener to better distinguish between them than would be
predicted from their one-dimensional distributions alone.
In panel C, the cues are correlated within-category as they were in panel B, but now the
correlation is in the same direction as the overall correlation. Values of cue Y can be
partially predicted from values of cue X, making the cues partially redundant. There is also
more overlap between the categories than in either of the other cases. In this example we
might expect listeners to have more difficulty distinguishing between the two categories than
predicted by the one-dimensional distributions.
In all three cases illustrated in Figure 5.1 the one-dimensional distributions are the same
despite the different relationships between the cues. Thus a model that only represents the
one-dimensional distributions will make the same predictions for all three cases, while a
model that represents the relationships between them will make different predictions.
Interestingly, listeners learning to categorize speech and non-speech sounds seem to have
difficulty when the categories are highly overlapping in each individual dimension but highly
separated in the two dimensions an extreme example of Panel B (Goudbeek, 2007;

60
Goudbeek et al., 2008) suggesting that listeners do not represent the multi-dimensional
relationships well.
A linear model would be appropriate if speakers produce only distributions such as the ones
in Panel A or C above. It would be inappropriate if speakers produce distributions such as
the ones in Panel B. In order to determine the appropriateness of these two classes of
models, the nature of speakers productions must be considered.

5.2.3 Speaker control and enhancement


The potential relationships between cues illustrated in Figure 5.1 have implications for
theories of speech production and speech perception. As noted by Kingston and Diehl
(1994), some patterns of cues may occur for physiological reasons, but others may be under
speaker control. We might expect the pattern seen in Figure 5.1C in cases where the
relationship is the result of physiological constraints. For example, in order to maintain
sufficient sub-glottal pressure to produce voicing during an oral closure, the speaker may
lower the larynx (Bell-Berti, 1975). This lowering has the consequence of decreasing the
tension on the vocal folds, thus lowering F0 in the adjacent vowel (House & Fairbanks,
1953b). Because of this physiological consequence of larynx lowering, the amount of voicing
during closure may directly affect the F0 offset frequency, producing a correlation across
and within categories.
Alternatively, if the cues are both under articulatory control, speakers might produce
patterns such as in Figure 5.1B in order to make a contrast more distinct. Such behavior
might be considered a kind of enhancement. When the speaker produces an ambiguous
value of one cue, they would tend to produce a less ambiguous value of the other cue. Note
that just producing two different distributions of any cue provides information, so simply
producing the second cue would help disambiguate the first cue. Producing the extreme
values of the second cue provides additional information: thus, I am referring to it as a kind
of enhancement.
The term enhancement has been used by several other researchers in slightly different
ways. Stevens and Keyser (Keyser & Stevens, 2006; Stevens & Keyser, 1989) use
enhancement to refer to any gesture which is recruited by the speaker to improve the

61
distinction for the listener. This may include simply producing another cue, or it may mean
the kind of correlated relationship in Figure 5.1B. Kingston and Diehl argue that cues may
enhance each other through auditory processes (Kingston & Diehl, 1994, 1995; Kingston,
Diehl, Kirk, & Castleman, 2008). For example, lowered F0 and voicing during closure may
both contribute to the percept of low frequency vibration. These processes are argued to
be part of the perceptual system and do not depend on the structure of the cues in the
linguistic environment. Thus Kingston and Diehl would predict that even in the case with no
category internal correlation (Figure 5.1A), listeners should be better able to distinguish
between the categories than predicted by the one-dimensional distributions alone.

5.2.4 Individual speaker differences


There is much documentation of individual differences in speakers production strategies
(e.g Kuhnert & Hoole, 2004). Sometimes these different strategies can produce the same
acoustic result. There are also many examples of individual differences in acoustic output.
Newman, Clouse and Burnham (2001) measured spectral center of gravity (COG) for the
fricatives /s/ and /sh/ in English and report histograms for individual speakers. They found
that some speakers produced distributions with more variability than others. Some speakers
also produced distributions with more overlap than others. Allen, Miller and DeSteno (2003)
conducted a study of individual talker differences in VOT in word-initial voiceless stops.
Speakers differed in their distributions of VOT even when differences in speaking rate were
accounted for. We therefore expect to find differences in the distributions of cues for
individual speakers. Furthermore we expect to find individual differences in the amount of
overlap between the distributions of the cues.
None of the studies investigating individual differences has investigated multiple cues. One
interesting question is whether speakers differ in their strategies of cue production.
Speakers who produce one cue less consistently (with relatively more overlap between the
categories) may produce another cue more consistently. This would be a kind of
compensatory enhancement at the individual level. Conversely, speakers may simply vary in
how consistent they are overall. If the model outlined here is correct, this should have
implications for how easily these speakers are understood. Newman, Clouse and Burnham
(2001) found that listeners were slower to categorize speech tokens from speakers who
produced the spectral COG cue with more overlap than tokens from speakers who produced

62
the cue with less overlap. This suggests that listeners are in fact sensitive to the degree of
overlap between the categories. What is unclear is whether listeners were sensitive to the
degree of overlap for individual speakers or whether they used a more general strategy and
some speakers simply produced more tokens in the ambiguous region.

5.2.5 Cue weighting


In Chapter 2 I argued from signal detection theory that the degree of overlap between the
distributions of two categories, quantified by D-prime, determines how useful a cue is.
Examining the D-prime values of multiple cues should therefore predict how strongly those
cues are used by the listeners. In this way, a prediction is made about the size of the trading
relation produced by each cue, not just their relative ranking.

5.3 Experiment 3
There are a number of questions to be addressed in the following production experiments.
The most important is the structure of the multidimensional acoustic-phonetic cues available
to the listener. If the cues are not correlated within categories, then the linear model
described here may be appropriate. If there are correlations within categories, this may
indicate either an enhancement strategy on the part of the speaker or an articulatory
constraint, depending on the nature of the correlation. Individual differences in cue
production are also examined.
In order to address these questions, two production experiments were conducted. The first
experiment investigated productions of voiced and voiceless word-initial labial stops (e.g.
beach and peach). The second experiment investigated productions of voiced and
voiceless word-medial labial stops (e.g. staple and stable).

5.3.1 Methods
Data were collected for the word-initial and word-medial voicing cases separately. Nine
speakers (7 Females) produced the word-initial items and eight speakers (5 Females)
produced the word-medial items. Three of the speakers participated in both production

63
tasks. All were native monolingual English speakers from a variety of North American dialect
regions (age 23-30). Speakers read the words from index cards, repeating each word three
times in succession. No carrier phrase was used. Word order was randomized for each
subject. Speakers were recorded in a sound attenuated booth using a Marantz portable
digital recorder (PMD 670) and a Technica lapel microphone worn approximately six inches
from the mouth. Sound files were digitized at a sampling rate of 32,000 Hz.

5.3.2 Word Lists


Table 5.1 contains the word lists for the two experiments. Words were originally chosen to
be part of a perceptual experiment. Words in the word-initial experiment were all chosen to
have the same vowel quality and to be picturable. Words in the word-medial experiment
varied in vowel quality and were chosen because they were minimal pairs and mostly
picturable.
Table 5.1: Words used in the production experiments.
-----------------------------------------------------------------------------------------------------------------------Word-initial
Word-medial
-----------------------------------------------------------------------------------------------------------------------beach
peach
Mabel
maple
dabble
dapple
bees

peas

stable

staple

rabbits

rapids

beak
peak
nibble
nipple
rabid
rapid
------------------------------------------------------------------------------------------------------------------------

5.3.3 Analysis
All acoustic analyses were performed using the Praat software package (Boersma &
Weenik, 2007). Temporal intervals (e.g. length of the vowel) were marked by hand by the
author according to characteristic markers in the waveforms and checked for consistency.
Temporal measurements
In both word-initial and word-medial stops, VOT was defined as the time from the beginning
of the stop burst to the first periodic pitch cycle. The start of the pitch cycle was defined as
the point where the waveform first crosses zero in the positive direction (see Figure 5.2A).

64
Word-initially, some speakers produced a period of voicing before the release burst (Figure
5.2B). Because this was always separated from the following release burst by a brief silence
and thus was not contiguous with the voicing in the vowel, this pre-voicing was not
considered part of VOT proper. The same was true of voicing during the closure period
word-medially. For some tokens voicing was continuous throughout the closure period but
for others it continued on from the previous vowel and stopped before the release burst
(Figure 5.2C). For this reason, voicing during closure was not treated as VOT proper and
VOT was again measured as the time between the release burst and the onset of voicing in
the following vowel.

Figure 5.2: Waveforms showing criteria for acoustic measurement [A] VOT [B] word-initial
VOT showing pre-voicing [C] Closure [D] Vowel offset before a stop.
The onset of the vowel after word-initial stops was defined as the same point as the end of
VOT (Figure 5.2A). The offset of the vowel was defined differently depending on the
following consonant. Before stops and affricates the end of the vowel was the last pitch
cycle before a significant drop in amplitude (Figure 5.2D). Before fricatives it was the last
pitch cycle before significant frication noise. Closure was the period between the end of the
previous vowel and the release burst (Figure 5.2C).
Figure 5.3 shows a portion of the word stable and the segmentation of the vowel, closure
and VOT. The second syllable of the bisyllabic words tended to be produced with a reduced
vowel or sometimes a syllabic sonorant.

65

ei

Figure 5.3: Waveforms and spectrogram of a portion of stable showing the segmentation
of vowel, closure and VOT.
Spectral measurements
Along with the temporal measurements, several spectral analyses were also conducted. The
burst amplitude (BA) of each release burst was measured for the first 10 ms of VOT. All
energy below 500 Hz was first filtered out to eliminate any energy from vocal fold vibration
following van Alphen and Smits (2004). The amount of voicing during closure (voice
amplitude: VA) was measured by taking the mean intensity of the closure interval. Other
measures such as number of pitch cycles or proportion of the interval that was voiced were
considered. Mean intensity of the closure interval (in dB) was chosen because it was not a
proportion and therefore not bounded by 1 and 0; making it more like the other measures.
F0 onset was measured for the following vowel of word-initial and word-medial tokens. F0
offset was measured for the preceding vowel of word-medial tokens. F0 onset and offset
were the mean F0 in the 10 ms at the beginning or end of the vowel respectively.

66
5.4 Results
The results of the word-initial and word-medial experiments are presented together. Onedimensional distributions, D-prime, two-dimensional distributions, correlations between cues,
and individual differences in D-prime were all measured.

5.4.1 Normal distributions and overlap


The one-dimensional distributions of cues were first examined to determine if the cues are
normally distributed and to compare the overlap of distributions for each cue.
Word-initial position
Figure 5.4 show the histograms of each cue measured in word-initial position for each
category. For VOT the two categories are non-overlapping and the p category has longer
VOT values and a wider distribution than the b category (Figure 5.4A). This is consistent
with previous research such as Lisker and Abrahmson (1964). All other cues have more
overlap between the categories, confirming Lisker and Abrahmsons claim that VOT is the
dominant cue to voicing in word-initial position.

67

Figure 5.4: Word-initial productions. Histograms of the distribution of each cue for each
category for all speakers. Black lines are b words, grey lines are p words. [A] VOT. [B]
Vowel duration, solid lines voiced offsets, dashed lines voiceless offset. [C] Burst Amplitude.
[D] F0 Onset, solid lines female speakers, dashed lines male speakers.
For vowel duration it was necessary to divide the samples according to the voicing of the
final consonant. The words beach, beak, peach and peak all end in voiceless
consonants while bees and peas end in voiced consonants. Figure 5.4B shows that the
words ending in voiced consonants have longer vowels than those ending in voiceless
consonants. This has been reported previously in the literature (House & Fairbanks, 1953a).
If we consider just the tokens that have the same final voicing we see considerable overlap
between the categories. In fact, the effect of final voicing on vowel duration is much greater
than the effect of initial voicing on vowel duration. This suggests that vowel duration is not
likely to be a strong cue to word-initial voicing.
Burst amplitude was slightly higher for b words then for p words. There is considerable
overlap between the categories suggesting that this is also not a useful cue to word-initial
voicing.
Because on average F0 is lower for males than females, the data was divided according to
the gender of the speaker. F0 was found to be lower for b words than p words, consistent

68
with the literature on the effects of voicing and F0. Again there is considerable overlap
between the categories.
In summary, by this analysis, VOT is a much stronger cue than any of the others. Vowel
length, burst amplitude and F0 onset and offset all appear to be very weak cues at best.
Means and standard deviations for each category are summarized in Table 5.2.
Table 5.2: Means and SD (in brackets) across speakers for each of the cues.
----------------------------------------------------------------------------------------------------------------------Category
VOT
VWL
BA
F0 on
F0 off
CLR VOICE
----------------------------------------------------------------------------------------------------------------------Initial
B
11(7)
v 372(75)
63(5)
f 218(49)
---vl 333(59)
m 148(22)
P
82(17) v 169(36)
61(6)
f 248(39)
---vl 145(52)
m 168(12)
-----------------------------------------------------------------------------------------------------------------------Medial
B
12(7)
113(31)
61(6) f
227(80) f 188(29)
61(14) 66(4)
m 143(67) m 119(20)
P
19(8)
96(26)
60(6) f 205(54) f 193(40)
87(15) 60(3)
m 141(50) m 123(21)
-----------------------------------------------------------------------------------------------------------------------Word-medial position
In word-medial position we expect a different pattern of results. From the literature it is
predicted that VOT will be a weaker cue and closure duration and voicing during closure will
be much stronger.
Figure 5.5 shows the histograms of each cue measured in word-medial position for each
category. For word-medial VOT, b words have shorter VOT values than p words as
before, but there is significantly more overlap than in word-initial position (Figure 5.5A). For
burst amplitude, there is no difference between the two groups (Figure 5.5B). For closure
duration, b words have shorter closures than p words, as expected (Figure 5.5C). Also as
predicted, voicing amplitude is greater for b words than p words (Figure 5.5D). Both
closure duration and voicing amplitude have greater separation between the categories than
VOT or burst amplitude.

69

Figure 5.5: Word-medial productions. Histograms of the distribution of each cue for each
category, all speakers. Black lines are b words, grey lines are p words. [A] VOT [B] Burst
amplitude [C] Closure duration [D] Voicing amplitude.
F0 onset and offset are again separated by speaker gender (Figure 5.6A, B). Numerically
b words have slightly higher F0 than p words in F0 onset and slightly lower in F0 offset.
The high amount of overlap suggests that this difference is not significant. For vowel
duration, we see slightly longer vowels before voiced consonants (Figure 5.6C) similar to
what was found in the previous experiment between words with voiced and voiceless final
consonants (e.g. peak and pees). Interestingly, the size of the effect is much smaller than
for the word-initial voicing words which were monosyllabic. This word list had three different
vowels (while the word-initial voicing list had only one vowel) and these vowels have
inherent length differences. Even when we break vowel duration down by vowel however,
we still see only a small effect of voicing on vowel duration (Fig 5.6D). Means and standard
deviations for each category are again summarized in Table 5.2.

70

Figure 5.6: Word-medial productions. Histograms of the distribution of each cue for each
category, all speakers. Black lines are b words, grey lines are p words. [A] F0 onset by
gender, solid lines female, dashed lines male. [B] F0 offset by gender, solid lines female,
dashed lines male. [C] Vowel duration. [D] Vowel duration by vowel, solid lines a, dotted
line ei, dashed lines ih.
In summary, as expected, the two strongest cues appear to be closure duration and voicing
intensity. VOT and vowel duration are predicted to be weak cues while burst amplitude and
F0 onset and offset frequency seem to provide very little information.
Looking at the histograms in Figures 5.4, 5.5 and 5.6 it seems that each cue is roughly
normally distributed. Some cues have less normal distributions, such as word-medial closure
duration. This cue has more tokens in the overlapping region that would be predicted by a
Gaussian distribution. In general, however, it seems that the assumption of normal
distributions is warranted.

5.4.2 Cue weighting: D-prime


In order to compare how much precision each cue provides, the amount of overlap between
the cues must be compared directly. To do this, D-prime was computed for each cue. D-

71
prime is the difference between the means (separation) divided by the average standard
deviation (spread). The higher D-prime, the less overlap between the categories. Table 5.3
shows the D-prime values for each cue.
Table 5.3: D-prime values for each cue.
----------------------------------------------------------------------------------------------------------------VOT
VWL
BA
F0 on
F0 off
CLR
VOICE
SUM
-----------------------------------------------------------------------------------------------------------------Initial
6.0
0.6
0.4
0.8
---7.8
-----------------------------------------------------------------------------------------------------------------Medial
0.9
0.6
0.2
0.2
0.2
1.8
1.8
5.7
-----------------------------------------------------------------------------------------------------------------For word-initial voicing we see that VOT has a D-prime value more than six times greater
than the next largest value (F0 onset frequency) and that the D-prime value for vowel
duration and burst amplitude are very small. Note that the small D-prime values for vowel
duration and F0 onset frequency are not a result of the variability due to final consonant
voicing and speaker gender. D-prime values were calculated for each of the groups
separately (as laid out in Table 5.2) and averaged to give the total D-prime value for the cue.
Thus these D-prime values assume that the listener is able to partial out the variability due
to these other factors. If this assumption is not correct, the D-prime values reported here
may overestimate the contributions of the two cues. It is clear from Table 5.3 that VOT is
overwhelmingly the strongest cue available to word-initial voicing. This is not surprising but
this analysis quantifies the dominance of VOT.
Comparing the word-medial cues reveals a very different pattern of cue strengths. Closure
duration and voicing amplitude are the strongest cues. VOT and vowel duration follow in
decreasing strength, and F0 onset, offset and burst amplitude are very weak cues. These
results roughly mirror early intuitions of researchers (Lisker, 1957, 1978) but are quantified
here for the first time.
When we compare the D-prime values for the cues in the two structural positions, wordinitially versus word-medially, we see two different profiles. Word-initially, VOT is by far the
strongest cue. Word-medially however, VOT is not the strongest cue (also noted by Lisker
1978). Instead, closure duration and voicing during closure are strongest. They are not
nearly as dominant as VOT in word-initial position, with D-prime values of less than 2 and
only twice the next strongest cue. In word-medial position all the cues have smaller D-prime

72
values than they did word-initially, though there are more of them. If we sum up the D-prime
values of the cues in each structural position we see that word-initial VOT alone is as strong
as all of the word-medial cues together. Thus we would predict that word-initial voicing is
easier to discriminate than word-medial voicing. However, the word list used for word-initial
voicing was much more homogenous than the word-medial voicing list. Some of the extra
variability we see in the word-medial voicing could be due to other factors such as phonetic
context.

5.4.3 Conditional independence


Word-initial voicing
I now turn to the important question of correlations between the cues. Table 5.4 shows the
correlations coefficients for the word-initial data. Correlations were calculated for all the data
together in order to determine the direction of any overall correlation between cues. These
overall correlations are expected when both cues have different means for the different
categories (as illustrated in Figure 5.1). Correlations are then calculated within each
category separately to determine if there is a special relationship between the cues (such as
those illustrated by Figure 5.1 panels B and C).
Table 5.4: Correlations between cues. Word-initial voicing.
--------------------------------------------------------------------------------------------------------------------All
B
P
--------------------------------------------------------------------------------------------------------------------VOT
VWL
BA
VOT
VWL
BA
VOT
VWL
BA
VWL
-0.05
---0.06
--0.37
--BA
-0.18
-0.05
--0.05
-0.07
--0.05
-0.08
-F0on
0.31
0.08 -0.10
0.05
0.07 0.14
0.31
0.17
0.18
-------------------------------------------------------------------------------------------------------------------Figure 5.7 shows two dimensional scatter plots of the data. Each point represents one
measurement. The only cues with any correlation overall are VOT and F0 onset (R2 = 0.31).
This correlation reflects the fact that the F0 cue had slightly different means for the two
categories. Other cues are not highly correlated overall as expected (R2 <0.2). Within the b
category there are no significant correlations (R2 <0.15). For the p words, VOT and vowel
duration are somewhat correlated (R2 = 0.37). This replicates findings in the literature of an
effect of speaking rate on the p category (Kessinger & Blumstein, 1997). Because no

73
carrier phrase was used in this experiment it is impossible to factor out speaking rate as a
cause. There is also a correlation for the p category between VOT and F0 onset (R2 =
0.31). In this case the correlation is in the same direction as the overall correlation (R2 =
0.31) suggesting redundancy. However, because VOT is such a strong cue with clear
separation between the two categories, the effect of the small correlations with the other
cues is negligible and doesnt produce more or less overlap between the categories (Figure
5.7).

Figure 5.7: Scatter plots of individual tokens in two dimensions for word-initial voicing. Filled
circles are b words, open circles are p words. [A] VOT and vowel duration. [B] VOT and
F0 onset.
Word-medial voicing
There were 247 tokens used in the correlational analysis. Table 5.5 shows the correlation
coefficients for all 247 tokens from the word-medial productions.

74
Table 5.5: Correlations between cues. Word-medial voicing. Correlations of note are in bold.
ALL

VA

VA

CLR

VOT

VWL

BA

F0 onset

F0 offset

CLR

-0.47

VOT

-0.37

0.26

VWL

0.24

-0.23

-0.16

BA

0.12

0.05

-0.22

0.12

F0 onset

0.08

0.10

0.09

0.00

-0.07

F0 offset

0.00

0.33

0.08

0.03

0.03

0.43

VA

VA

CLR

VOT

VWL

BA

F0 onset

F0 offset

CLR

-0.10

VOT

-0.17

0.14

VWL

-0.03

-0.16

-0.07

BA

0.18

0.15

-0.20

0.01

F0 onset

0.08

0.18

0.15

-0.11

-0.07

F0 offset

0.23

0.24

0.10

0.04

0.09

0.34

P
VA

VA

CLR

VOT

VWL

BA

F0 onset

F0 offset

CLR

-0.00

VOT

-0.09

-0.20

VWL

0.10

0.14

0.02

BA

0.07

0.07

-0.25

0.02

F0 onset

-0.09

0.32

0.17

0.11

-0.08

F0 offset

-0.06

0.48

-0.00

0.08

-0.01

0.64

There were overall correlations among several of the cues. Correlations between the cues
overall should reflect the strengths of the two cues involved. For the most part this is what
we see. The strongest correlation was between the two strongest cues, closure duration and
voicing amplitude (R2 = -0.47). Figure 5.8A shows the distributions of these two cues. There
was no correlation between these cues for the individual categories. In general, the
correlations decrease as the strengths of the individual cues decrease. Exceptions to this
pattern are of particular interest. These cases are shown in bold in Table 5.5.
F0 onset and offset were highly positively correlated, both overall and within categories.
Higher F0 offset frequencies corresponded to higher onset frequencies. This is not

75
unexpected since they are both measures of F0 within a relatively short time window. The
correlation within each category was in the same direction as the overall correlation
producing almost complete overlap between the cues and suggesting that these two
measures are redundant (Figure 5.8B). Closure duration and F0 offset were positively
correlated overall (R2 = 0.33) as well as within the b category (R2 =0.24 ) and even more
so within the p category (R2 = 0.48). Longer closure durations corresponded to higher F0
offset. The distribution of these two cues is shown in Figure 5.8D. Closure duration was also
correlated with F0 onset for the p category. It is not clear why these cues should be
correlated, however, given how little information F0 onset and offset seem to provide, this
correlation is unlikely to be important. VOT and burst amplitude were also more highly
correlated than would be expected (R2 = 0.22). Longer VOT values corresponded to lower
burst amplitudes. The cues were also correlated within-category in the same way. This
suggests that there may be an articulatory basis for this correlation. Burst amplitude only
measured the energy in the first 10 ms of the burst. It may be that when the release is
longer (longer VOT), the energy is spread out over a longer time period. This would produce
the correlation found here. The correlation does not seem to be controlled by the speaker to
create better separation between the categories.

76

Figure 5.8: Scatter plots of individual tokens in two dimensions for word-medial voicing.
Filled circles are b words, open circles are p words.
There were no other correlations between cues. In general then, it appears that cues are
independent from each other. The exceptions (F0 onset and offset, VOT and burst
amplitude) seem to be the result of production constraints and do not produce greater
separation between the categories. Importantly, the strongest cues (voicing amplitude,
closure duration, VOT and vowel duration) do not show any correlations between them
when we look within either category.

5.4.4 Individual differences


In looking at the relationships between cues there was no evidence that speakers as a
group compensate for an ambiguous value of one cue by producing a more extreme value
of another cue. However, it may be the case that individual speakers use such a

77
compensatory strategy on a cue by cue basis. Speakers who produce one cue in a less
consistent way may produce another cue in a more consistent way. To investigate this
question, D-prime was calculated for each cue for each speaker separately. Figure 5.9
shows the D-prime values of each speakers productions in the word-initial experiment
(panel A) and the word-medial experiment (panel B). Note that three of the speakers
participated in both experiments (Speakers 2, 3 and 6).

A 16
14
12
D prime

10
8
6
4
2
0
-2
1

13

Speaker

B5
4

D prime

3
2
1
0
-1
6

10

14

12

11

Speaker
VOICE

CLR

VOT

VWL

BURST

F0

Figure 5.9: D-prime values of each cue for each speaker. [A] Word-initial voicing. [B] Wordmedial voicing. Speakers 2, 3 and 6 participated in both experiments.
Panel A of Figure 5.9 shows the D-prime values for each of the cues and each speaker in
the word-initial experiment. For each speaker, the D-prime value of VOT is much greater

78
than any of the other cues. There is also considerable variability among speakers in the Dprime values of VOT. We do not see any real differences in the D-prime values of other
cues. Those speakers who produce more overlapping distributions of VOT do not produce
less overlapping distributions of the other cues. This may be because even for Speaker 1
with the lowest D-prime value, there is actually no overlap between the categories, and thus
very little ambiguity to be resolved by other cues. Figure 5.10 shows the histograms for
Speaker 1 and Speaker 9, the speaker with the highest D-prime value.

Figure 5.10: Histograms for the distribution of VOT values in word-initial position two
speakers, 1 and 9, who have the lowest and highest D-prime values, respectively.
Figure 5.9 panel B shows the D-prime values of each cue for each speaker for the wordmedial experiment. Because F0 onset and offset values were so highly correlated, D-prime
values are averaged here for each speaker. This situation is more interesting than wordinitial voicing because there is more overlap (lower D-prime) for every cue, providing more
opportunity for compensatory strategies. However, speakers still dont seem to be using a
cue-based compensation strategy. For example, Speaker 6 has the lowest D-prime for the
two strongest cues (voice amplitude and closure duration), but the D-prime values for the
other cues are similar to those of Speaker 13 who has some of the highest D-prime values
for the two strongest cues. Therefore, Speaker 6 doesnt seem to be providing more
information via the secondary cues than Speaker 13.
There are, however, some relationships between how speakers use the individual cues. Dprime values for the two strongest cues are highly correlated. Speakers who produce
relatively well separated voice amplitude distributions also tend to produce well separated

79
closure duration distributions (R2 = 0.67). There is also a strong correlation between how
speakers use VOT and burst amplitude (R2 = 0.88). The D-prime values of these two cues
are very similar for individual speakers. This correlation probably has the same origins as
the correlations between individual tokens of these cues seen before, strengthening the
possibility that this correlation is due to articulatory factors. There is also a strong negative
correlation between how speakers use VOT and F0 (R2 =-0.92) and burst amplitude and F0
(R2 = -0.71). In general, speakers have higher D-prime values for VOT and burst amplitude
than for F0. This is particularly true for speakers 11, 13 and 14. The one exception is
speaker 12 who produces F0 with a high D-prime value and VOT and burst amplitude with
low D-prime values. If we exclude this one speaker there is still a strong negative correlation
between VOT and F0 (R2 = -.0.85). It is unclear why these two cues should have such a
strong correlation across speakers but no correlation across individual tokens. It is possible
that speakers choose to use one or the other as a kind of cue trading strategy.

5.5 Chapter summary


This chapter investigated the relationship between different cues to the same contrast,
voicing in two different structural positions. Very different patterns of cue distribution were
found for the two different positions, as predicted by the literature. For word-initial voicing,
VOT was the best cue and all other cues had highly overlapping distributions. For wordmedial voicing, closure duration and voicing amplitude were the best cues, followed by VOT
and preceding vowel length. Burst amplitude, F0 onset and F0 offset were not strong cues.
Two assumptions of the linear model were tested, normality of distributions and
independence of cues. In all cases cues were roughly normally distributed. Only one case of
non-independence was found. This was between VOT and burst amplitude in the wordmedial position. This is likely due to a physical relationship between the length of the release
(VOT) and the amount of energy released (burst amplitude). A similar correlation across
speakers supports this hypothesis. Such a relationship is not found in word-initial position.
This may be because the lack of carrier phrase resulted in variable amounts of pressure
being built up in the oral cavity prior to the word-initial release. Such a situation would
obscure any relationship between VOT and burst amplitude in this case.

80
The implication for the linear model is that for the most part it is appropriate to assume that
individual cues are independent. In the cases where this is not true, such as VOT and burst
amplitude, or F0 onset and offset, these should perhaps not be treated as separate cues for
a perceptual model. Instead they should be combined into a new cue such as F0 near the
closure or burst energy over time.
Speakers potential compensation strategies were also tested. The lack of correlation
between cues suggests that on a token by token basis, speakers do not produce less
ambiguous values of one cue to compensate for more ambiguous values of another cue.
Individual speakers do employ different cue use strategies however. Some speakers
produce well separated distributions of the strongest cues and others produce overlapping
distributions of the strongest cues. Those speakers who produce overlapping distributions of
the strongest cues do not, however, produce less overlapping distributions of weaker cues
to compensate. In other words, speakers with weak strong cues do not produce strong weak
cues. In fact, speakers who produced less overlapping distributions of one of the strong
cues (closure duration) tended to also produce less overlapping distributions of the other
strong cue (voicing amplitude). Thus speakers do not seem to compensate on a cue by cue
basis either. This predicts that some speakers should be easier to understand than others.
One exception was the inverse relationship between the overlap of F0 and VOT. In this case
speakers do seem to use one cue more than the other, and the more they use one cue the
less they use the other.
The structure of the speech signal is therefore one of multiple, independent, probabilistic
cues. These cues vary in their usefulness (amount of overlap between categories) between
cues, between structural positions, and between speakers. The interesting question then
becomes whether listeners are able to track these distributions in each of these situations.
The literature on trading relations suggests that listeners are sensitive to the differences
between cues and the differences between structural positions. What is not yet clear is
whether it is possible for listeners to track these distributions for individual speakers.

81

Chapter 6 Natural, multi-cue experiment


In previous chapters I outlined a simple linear model for cue combination (Chapter 2) and
tested some of the assumptions of the model by looking at production data (Chapter 5). I
showed that for a single cue, listeners seem to be sensitive to the distributions of that cue
(Chapters 3 and 4). I now ask whether we can find evidence that listeners use multiple cues
in the way that the simple linear model predicts.
The linear model predicts that listeners should combine cues as a simple weighted average.
Cues which are more informative (higher D-prime) should receive more weight than cues
which are less informative (lower D-prime). This predicts that more informative cues should
have a greater effect on listeners behaviour as they are hearing and identifying words.
In order to test this hypothesis, I conducted a simple perceptual experiment using the words
recorded in the production experiment. A new set of participants heard and categorized a
subset of the minimal pairs recorded. Their responses, reaction times, and eye-movements
were recorded. These behavioural responses were compared to the predictions made from
the acoustic measurements and D-prime analysis.
Since the stimuli were natural productions, we expected listeners to have little trouble
categorizing them correctly. This is why we used measures known to be sensitive to small
differences in stimulus goodness, i.e., reaction time and eye-movements. Furthermore,
noise was added to half of the stimuli to make the task more difficult. This noise was
expected to reduce some of the acoustic information available and to have a greater impact
on spectral cues than temporal cues. We predicted that for the stimuli with noise, spectral
cues would be weighted less on behavioural measures than for the stimuli without noise.
We also expected that listeners would have a harder time categorizing tokens from some
speakers than from other speakers. Previous work found that listeners are slower to
categorize tokens when they were produced by speakers with more overlapping
distributions of two categories (/s/ and /sh/) than when they were produced by speakers with
less overlapping distributions of categories (Newman et al., 2001).

82
The speakers who produced tokens for this study were evaluated for the amount of overlap
in their distributions for multiple cues (D-prime discussed in Chapter 5). In Chapter 5 I
argued that speakers do not seem to compensate for low D-prime values in one cue by
producing higher D-prime values in another cue. This predicts that some speakers will be
harder to understand than others. Listeners should be slower to categorize tokens from
speakers who have lower D-prime values overall.

6.1 Experiment 4

6.1.1 Participants
Participants were eight monolingual native English speakers from the University of
Rochester (ages 18-23) with no known hearing problems. They were paid $10.00 for each
day of the experiment.

6.1.2 Stimuli
Stimuli were natural productions recorded for the acoustic study described in Chapter 5.
Speakers produced the words in isolation in quiet conditions. The productions from two of
the speakers were not used in the perception experiment. They were excluded because
they were involved in recruiting and running participants in the experiment and thus listeners
may have been familiar with their voices. The six remaining speakers were expected to be
unfamiliar to the listeners.
Two sets of minimal pairs were used, staple stable maple and Mabel. Only four unique
words were used in order to have multiple repetitions of each token. This enabled
predictions to be made about individual tokens.
Each token was segmented from the recorded files and saved as a separate file with
approximately 10 ms of silence before and after the word. All stimulus manipulations were
done using Praat (Boersma & Weenik, 2007). Tokens were normalized to have the same
average amplitude. A second copy of each stimulus was created containing broadband
noise. The broadband noise was speech shaped (pink noise), i.e., it had more energy in the

83
lower frequencies than the higher frequencies. This was done by creating broadband noise
with a flat spectrum and de-emphasizing higher frequencies such that the energy decreased
6 dB per octave. The noise was then scaled to be 5dB less than the speech and added to
each speech token. The noisy stimuli thus had a signal to noise ratio (SNR) of 5 dB. Figure
6.1 shows spectrograms for a sample token with and without noise.

Figure 6.1: Waveforms and spectrograms (no pre-emphasis) for Mabel. Left panel is the
original, right panel has noise added (SNR 5 dB). Shaded area represents the stop closure,
burst and VOT.
Adding noise was expected to interfere most with low frequency spectral information. In
particular it was expected to interfere with the voice amplitude cue. To confirm the effect of
noise on this cue, the acoustic analysis of voice amplitude was repeated for the noisy
stimuli. The amplitude of noise in the closure interval was increased for all noisy stimuli
(Figure 6.2), as expected. Furthermore, the distributions of voicing amplitude were more
overlapping for the noisy stimuli (Figure 6.2B) than for the natural stimuli (Figure 6.2A).

84

Figure 6.2: Distributions of voicing amplitude for the b tokens (dark lines) and p tokens
(light lines). [A] Natural stimuli. [B] Noisy stimuli.
Wherever possible, three tokens of each of the four words were used from each of the 6
speakers. Four tokens were not used because they contained extraneous noise or because
complete acoustic analyses were not available. This left 68 unique items. An additional 68
items were created with noise for a total of 136 items.

6.1.3 Procedure
Participants heard each item 15 times for a total of 2040 trials. Trials were split over four
sessions on consecutive days (510 trials per session). Each session lasted approximately
one hour. Participants were seated in front of a computer in a quiet room. Eye-movements
were recorded using an Eyelink II head mounted eye-tracker with a sampling rate of 250Hz.
Sounds were presented over Sennheiser HD 570 headphones at a comfortable listening
level.
On each trial participants saw a visual display containing four pictures, one in each quadrant
(see Figure 3.2). Each picture corresponded to one of the words (staple, stable, maple or
Mabel). Pictures are included in Appendix B. Participants first clicked on a fixation point in
the center of the screen. After a delay of 100 ms one of the tokens was played over the
headphones. Participants chose the picture they thought was most appropriate by clicking
with the mouse. Reaction times (time to click) were recorded from the onset of the stimulus.

85
Eye-movements were recorded from the onset of the stimulus to the end of the trial (mouseclick response).

6.2 Results

6.2.1 Categorization data


Listeners were extremely accurate in categorizing the naturally produced tokens, as
predicted. Overall listeners choose the correct picture on 98.6% of trials. Accuracy was
similarly high for each of the four words. On trials where listeners chose the wrong picture,
80% of the time they chose the other member of the minimal pair (e.g. maple for Mabel).
Listeners were just as accurate on trials with noisy stimuli (98.3%) as on trials with natural
stimuli (98.9%). Table 6.1 shows the response patterns. Analysis of individual items showed
that no single item was incorrectly identified more than 23% of the time (only 3 items were
identified incorrectly more than 5% of the time).
Table 6.1: Number of trials on which subjects chose each picture.
Response
Noise

Word

Mabel

maple

stable

staple

total

no

Mabel

2387

14

2403

no

maple

31

2109

2147

no

stable

18

2042

2075

no

staple

11

2690

2702

yes

Mabel

1907

15

1928

yes

maple

29

2017

2048

yes

stable

1964

81

2047

yes

staple

2156

2162

6.2.2 Reaction time


It was predicted that although listeners may be highly accurate, they may be slower to
identify some items than others. In particular, items which are more ambiguous should lead

86
to slower reaction times (RT). The acoustic analysis in Chapter 3 made four main
predictions:
1) Cues with a high D-prime (closure duration and voice amplitude) should be better
predictors of RT than those with a low D-prime (vowel duration, VOT, and burst
amplitude)
2) Listeners should be faster to respond to speakers with high D-prime values than
speakers with low D-prime values.
3) Listeners should be slower to identify noisy stimuli than natural stimuli
4) Noise should affect spectral cues (voice amplitude and burst amplitude) more than
temporal cues (vowel duration, closure duration and VOT).

Linear mixed model regression analyses (lmer function from the languageR package for the
R stats program) were used to test these predictions. Mixed models were used because
they provide a way to include both subject and item effects in the same model allowing for
better model comparison and the ability to observe subtle effects. Separate models were
used to test the effects of the acoustic variables (individual cue D-prime) and speaker
variables (individual speaker D-prime). The effect of noise was included in all models as was
the effect of time (day of the experiment).
For all models RT data were log transformed to make it more normally distributed. Outliers
were removed that were more than 2.5 standard deviations from the mean of the log
transformed data.

2.2.3 Speaker models


The speaker model included both fixed and random effects. Random effects (intercepts) for
subjects and items were included to model different overall reaction times for each subject
and item. Fixed effects included noise (yes or no), day of test (1 through 4), and one or more
D-prime measures. An interaction term between noise and the D-prime measures was also
considered. All continuous fixed effects were centered and rescaled to have a mean of 0
and a standard deviation of 0.5 (for ease of interpreting the coefficients of the resulting

87
model). Binary fixed effects were centered. Analyses performed on both the rescaled and
non rescaled predictor variables produced similar results.
The first model used average D-prime as the speaker predictor variable. The interaction of
noise and average D-prime was not significant ( = -0.004, SE = 0.005). The final model did
not include an interaction term. In this and all other models tested, there was a significant
effect of day ( = -0.060, SE = 0.003). Listeners reaction times decreased by an estimated
210 ms between day 1 and day 4. Similarly, in this and all other models tested, there was a
significant effect of noise ( = 0.016, SE = 0.003). Listeners were estimated to be 21 ms
slower to respond to noisy stimuli than to natural stimuli. The main effect of D-prime was
also significant ( = -0.031, SE = 0.003) but in the opposite direction to our prediction. When
listening to speakers whose D-prime values were on average higher, listeners were slower
than when listening to speakers whose D-prime values were on average lower.
This result was surprising given the previous finding that reaction time increased for more
overlap. One possible difference between this study and the previous one is that in the
previous study the lexical contrast was at word onset, while this contrast is in the middle of
the word. This could result in different amounts of time between the onset of the word and
the relevant acoustic information. In particular, if speakers who produce less overlap in their
distributions do so by speaking more slowly, then increases in D-prime would correlate with
longer time to hear the relevant acoustic information. To test this hypothesis, average time
to the onset of the first vowel (/ei/), time from the onset of the vowel to the onset of the burst
(the relevant acoustic information), the duration of the second syllable, and the total word
duration were measured for each speaker. Indeed, speakers with high average D-prime
values (above 2) had longer overall word durations than speakers with low average D-prime
values. Average D-prime and word duration were highly positively correlated (R2 = .98) as
were average D-prime and onset duration (R2 = .76), duration of the relevant acoustic
information (R2 = .92), and the duration of the second syllable (R2 = .94). Figure 6.3
illustrates these trends. A more extensive study of all the items recorded for the production
experiment revealed that D-prime was most correlated with the relevant acoustic
information. Together they mean that speakers with high average D-prime values achieve
this clarity by selectively extending the time spent articulating the contrasting information. It
is quite likely that speakers were aware of the contrast given the word list they read. This
result is interesting in and of itself and mirrors recent work demonstrating that segments that

88
carry more information tend to be longer (Aylett & Turk, 2004, 2006). The implication for the
present analysis is that it may explain why higher average D-prime values produce longer
RTs. In order to remove this confound, RT on each trail was adjusted to reflect the onset of
the burst rather than the onset of the stimulus. The previous analysis was repeated with this
new RT (burst RT).
700
600
500
total

400

offset
relevant

300

onset

200
100
0
0.00

0.50

1.00

1.50

2.00

2.50

Figure 6.3: Relationship between speakers average D-prime and time spent articulating
each portion of the word. Onset is time to the onset of the vowel. Relevant is time from the
onset of the vowel to the onset of the burst. Offset is time from the onset of the burst to the
end of the word.
In the analysis predicting burst RT, noise ( = 0.019, SE = 0.003) and day ( = -0.074, SE =
0.003) were again significant predictors with effects of the same magnitude. Average Dprime was also a significant predictor ( = -0.021, SE = 0.013) but this time when listening to
speakers with a high average D-prime, listeners were faster to respond than when listening
to speakers with a low average D-prime (an estimated difference of 32 ms from the highest
to the lowest D-prime). Thus after hearing the relevant acoustic information, listeners were
faster to respond to speakers who produced less overlapping distributions.

6.2.4 Acoustic model


The second set of models tested the hypothesis that RT should depend on the individual
acoustic cues available in the signal. Cues with higher overall D-prime should be better
predictors of RT than cues with lower overall D-prime. Furthermore, for cues that affect RT,

89
values of those cues that are ambiguous should have longer RTs and values that are more
extreme should have shorter RTs. In general cue values in the middle of the continuum will
be the ambiguous values, thus the direction of the effect of each cue should switch for
different sides of the continuum. In order to fit a linear model, the data were split in half
according to which word was intended. The b half of the data included only the words
stable and Mabel and the p half only the words staple and maple. For the b half,
increasing closure duration and VOT should make b the less likely category - increasing
reaction times - while increasing vowel length, burst amplitude and voicing amplitude should
make b the more likely category - decreasing reaction times. For the p half the
predictions are reversed. For both halves of the data the size the effect for each of these
cues should correspond to the average D-prime for that cue. Noise should also affect burst
amplitude and voice amplitude more than the other cues.
To test these hypotheses, separate models were fit to each half of the data. Predictor
variables were centered between -.5 and .5 separately for each half. Following the analysis
above, the overall RT as well as the RT from burst onset (burst RT) were tested as outcome
variables. In these analyses item was not used as a random effect as it was in the speaker
analyses. This is because individual differences in the acoustic cues (which varied by item)
were exactly the effects of interest. Therefore item effects were predicted and not a random
variable. Differences in RT as a result of hearing maple versus staple were not of interest
however. Therefore a modified item effect was included (maple/Mabel vs. staple/stable).
B model
The first model tested included each of the acoustic predictor variables (vowel duration,
closure duration, VOT, burst amplitude and voice amplitude) as well as day and noise as
before. Subject and item were included as random effects as before. The model predicting
RT was much worse than the model predicting burst RT (loglikelihood RT = 2231, burstRT =
536). Given this and the result of the previous model, burst RT was used as a predictor in all
subsequent models.
In the model containing all the acoustic predictor variables and all the interactions, VOT and
burst amplitude were highly correlated (R2 = .80) as was found in the acoustic study
suggesting co-linearity between these factors. Removing VOT had little effect on the size of

90
the burst amplitude effect and no effect on the sign, while removing burst amplitude had a
large effect on the amplitude of the VOT effect and reversed the sign. For this reason VOT
was excluded from the model. Closure duration and voicing amplitude were also correlated
(R2 = .59). Removing closure duration from the model did not have much of an effect on
voice amplitude, but when voice amplitude was removed from the model closure duration
was no longer significant. However, removing either factor from the model significantly
reduced the fit of the model (factor removal: closure duration 2(1) = 10.3, p <.01, voice
amplitude 2(1) = 25.7, p <.001). Both were included in the final model.
In the best model, all the main effects were significant. Noise significantly increased reaction
time (= 0.021, SE = 0.005) by an estimated 23 ms. Reaction time decreased over the
course of the experiment (= 0.075, SE = 0.005) by an estimated 250 ms. Vowel duration
had a significant positive effect (= 0.026, SE = 0.006). Listeners responded faster to
shorter vowels. Closure duration had a significant negative effect (= 0.032, SE = 0.007).
Listeners responded faster to longer closures. Both of these effects were in the opposite
direction to what was predicted. Voice amplitude had a significant negative effect (= -0.047,
SE = 0.006). Listeners responded faster to higher voice amplitudes. Burst amplitude had a
significant positive effect (= 0.030, SE = 0.006). Listeners responded faster to higher burst
amplitudes. Both these effect were in the predicted direction.
Vowel duration had the most significant positive interaction with noise (= 0.049, SE =
0.013). Closure duration had a marginally significant negative interaction (= 0.027, SE =
0.015). Voice amplitude also had a marginally significant negative interaction (= 0.030, SE
= 0.016). The predicted effects on RT for noise, vowel duration, closure duration, voice
amplitude and their interactions are shown in Figure 6.4. In all cases the addition of noise
increased the main effects. Only voice amplitude and burst amplitude were predicted to be
affected by noise.
P model
In the model containing all the acoustic predictor variables and all the interactions, VOT and
burst amplitude were again highly correlated (R2 = .77). Neither effect was significant on its
own or after removal of the other effect. Removal of either had no significant effect on the
model. Both were removed from the final model.

91

Figure 6.4: Estimated effects of noise and vowel duration, closure duration and voice
amplitude on reaction time. Left panels are p words, right panels are b words.

In the best model, noise and day were again significant. Noise significantly increased
reaction time (= 0.017, SE = 0.005) an estimated 18 ms. Reaction time decreased over the
course of the experiment (= 0.072, SE = 0.005) by an estimated 235 ms. Vowel duration
had no significant effect (= -0.008, SE = 0.006). Closure duration had a significant negative
effect (= 0.015, SE = 0.006). Listeners responded faster to longer closures. This effect

92
was in the predicted direction. Voice amplitude did not have a significant main effect (= 0.009, SE = 0.006) but did have a significant interaction with noise (= -0.030, SE = 0.011).
When there was no noise listeners were slightly slower to respond to stimuli with high voice
amplitude (as predicted). When noise was added, listeners were much slower to respond to
stimuli with low voice amplitude (Figure 6.4E) consistent with the increase in voice amplitude
observed for these stimuli (as described in Chapter 5).
Figure 6.4 shows that for vowel duration, the main effects are in the opposite direction from
those predicted, though most of this main effect comes from the stimuli with noise. Noise
seems to increase RT for those stimuli that are most consistent with the category (short
vowels for p and long vowels for b). The effect of closure duration is in the predicted
direction for b and in the opposite direction of the prediction for p stimuli. In both cases,
noise increase RT for stimuli with a short closure. The effect of voicing amplitude on natural
stimuli is in the right direction for both categories and the effect of noise increases RT for low
amplitude closures in both categories. This is consistent with the observation in Chapter 5
that noise shifted the lower amplitude stimuli into the range of the higher amplitude stimuli.

6.3 Chapter summary


Listeners were highly accurate at categorizing these natural stimuli, even with the addition of
noise. This suggests there is significant acoustic information in these stimuli for listeners to
accurately determine the intended word, even for speakers with low average D prime. Thus,
any predicted differences between items and between speakers will be subtle. Mixed model
regression analysis of reaction time revealed some of these subtle effects. Four predictions
were made for reaction time:
First, it was predicted that cues with a high D-prime (closure duration and voice amplitude)
would be better predictors of RT than the other cues. VOT was removed from both models
because it did not have a significant effect. Burst amplitude was removed from the p model
for the same reason. The most predictive cues were therefore closure duration, voice
amplitude and vowel duration. Of these, only voice amplitude had an effect in the right
direction for both categories, and closure had an effect in the right direction for the p
category. This suggests that voice amplitude may be the most reliable predictor of RT in this
study, followed by closure duration. These two had the highest D-prime values and so were

93
expected to have the most reliable effects. More convincing data would be necessary to
show that the effects of each cue were predicted by their D-prime values.
The second prediction was that listeners should be faster to respond to speakers with high
D-prime values than speakers with low D-prime values. Perhaps the most interesting
observation from this study is that speakers who produce less overlapping distributions
(higher average D-prime) do so by expanding the time they take to articulate the relevant
acoustic information. Once listeners have heard the relevant acoustic information, they are
faster to respond to words produced by speakers with higher average D-prime values.
The third prediction was that listeners should be slower to identify noisy stimuli. This was the
most robust finding, with reaction times increasing by approximately 20 ms for noisy stimuli.
A second robust and predictable finding was that listeners reaction times increased over the
course of the experiment (by approximately 200 ms on average).
The final prediction was that noise should affect spectral cues more than temporal cues.
This does not seem to be the case. While the only interpretable interaction between noise
and an acoustic variable was with voice amplitude (a spectral cue), there were significant
interactions between noise and each of the other (temporal) variables.
In general, this experiment may have failed to find robust effects of the individual acoustic
cues because the variance in reaction times was small. Repeated presentations of the
stimuli were expected to give good estimates of the relative ambiguity in each stimulus, but
may have given the listeners too much exposure to the speech. Alternatively, natural speech
may simply contain enough cues to make most words unambiguous. The analysis in
Chapter 5 suggested that most of the cues contribute only small amounts of information to
the contrast. Nevertheless, their incremental contributions may be enough to disambiguate
most naturally produced stimuli.

94

Chapter 7 Summary and conclusions


In this dissertation I have described a way to quantify the information available in acousticphonetic cues for word recognition. I have argued that for a listener to interpret a particular
acoustic-phonetic utterance, they must know the relevant likelihood distributions for any
referents they are considering. The likelihood of that cue is then compared for each of the
potential referents and the listener can then decide which referent has the highest
probability of having been uttered. The listener may also know how certain they should be
about that choice. In cases where one alternative is much more probable, they may be very
certain about their choice. In cases where two or more alternatives are similarly probable,
the listener may be less certain about their choice. This uncertainty is revealed in increased
reaction times and more looks to candidate objects.
The forgoing view of the speech signal provides a detailed account of category structure. It
is not enough for the listener to know which side of a category boundary a token is on, or
even how far from the mean or category boundary it is. The results presented here argue
that it is the tokens place in the relevant probability distribution that matters. This
interpretation is consistent with previous work showing that listeners are sensitive to withincategory variability. It further suggests that categories are not merely graded but internally
structured. Categories should also vary in how gradient or categorical they appear to be
depending on how broadly distributed and overlapping the probability distributions are.
Categories which are defined by cues with well separated distributions may behave in a
more categorical way because the difference in posterior probability between them and their
competitors are large. Categories which are defined only by cues with distributions which
are highly overlapping with their competitors may be treated with more uncertainty. It is
interesting to note however, that even in the case of word-initial VOT as a cue to stop
voicing, one of the least overlapping distributions, there is still evidence of within-category
sensitivity.
The model of speech perception outlined here relies heavily on the language-specific
knowledge of the listener. The relevant categories (words or otherwise) must be learned
and the relevant distributions of acoustic-phonetic cues must also be learned. The skills
required to learn these distributions are thought to be domain general and apply equally well

95
to the task of learning to perceive and act in the world in general as outlined in the
introduction. This is in contrast to models which rely on specialized linguistic skills and is
also in contrast to models which rely on knowledge-neutral auditory processing
mechanisms. For example, the model outlined here would predict that listeners are best
able to accommodate patters of assimilation/coarticulation found commonly in their native
language. Models of compensation for assimilation that posit knowledge-neutral auditory
processing mechanisms would predict that all listeners should be able to compensate for
patterns of assimilation regardless of experience with it in their language.
More direct tests of this approach are needed. The cross language prediction for
assimilation behaviour is one such test. More controlled studies of multi-cue integration
behaviour are another. For example, the predictions of relative cue weightings from the
acoustic-phonetic study could be more precisely tested in a trading relations study using
more controlled stimuli. Another direct test of the linear model of cue combination would be
to manipulate the distributions of two cues, as in Experiment 1, to determine if the relative
weightings of cues change in the predicted way.
If further testing supports the claims made here, this model may be a useful way to think
about the problem of language learning, both for first and second language learning. It is
known that cue weightings change over the course of development (Mayo & Turk, 2004;
Nittrouer & Miller, 1997). What is not understood is what predicts the patterns of early cue
weighting. Some researchers have argued that transitional cues are favoured by young
children over briefer spectral-temporal cues (Nittrouer & Miller, 1997). Other research has
not supported this claim (Mayo & Turk, 2004). It may be the case that developmental trends
can be predicted by taking into account the distributional properties of cues. For example,
cues with clearly separated distributions may be learned early and may be learned from the
input without knowledge of the categories to which they belong (Maye & Gerken, 2000),
while more subtle cues may require labelled data, i.e. knowledge from the context that
there are two categories. The reverse might also be true. Knowing which cues are relevant
to contrasts in your language might require learning which cues are not relevant. The results
from the non-speech auditory categorization task (Holt & Lotto, 2006) suggest that it is
experience with high variability in some cues which teaches listeners to rely less on those
cues rather than experience with low variability which teaches listeners to rely more on a
cue.

96

In summary, the ideal observer (listener) model outlined here is a powerful and precise way
to characterize the information available to the listener in the speech signal. It puts the focus
on the listeners knowledge of the relationship between acoustic-phonetic cues and the
categories they signal. These categories may be defined flexibly depending on the contrast
of interest and the relevant context. Thus, the knowledge invoked is entirely linguistic and
specific to a listeners language community. The inference process described to use this
knowledge is, however, domain general. It is the same inference process thought to guide
all perception and action in the world.

97
Bibliography
Allen, J. S., & Miller, J. L. (1999). Effects of syllable-initial voicing and speaking rate on the
temporal characteristics of monosyllabic words. Journal of the Acoustical Society of
America, 106(4), 2031-2039.
Allen, J. S., Miller, J. L., & DeSteno, D. (2003). Individual talker differences in voice-onsettime. Journal of the Acoustical Society of America, 113(1), 544-552.
Anderson, J. R. (1990). The adaptive character of thought. Hillsdale, NJ: Lawrence Erlbaum
Associates.
Andruski, J. E., Blumstein, S. E., & Burton, M. (1994). The effect of subphonemic differences
on lexical access Cognition, 52, 163-187.
Aylett, M., & Turk, A. (2004). The smooth signal redundancy hypothesis: A functional
explanation for relationships between redundancy, prosodic prominence, and duration in
spontaneous speech. Language and Speech, 47(1), 31-56.
Aylett, M., & Turk, A. (2006). Language redundancy predicts syllabic duration and the
spectral characteristics of vocalic syllable nuclei. Journal of the Acoustical Society of
America, 119(5), 3048-3054.
Barlow, H. B. (1957). Increment thresholds at low intensities considered as signal/noise
discriminations. Journal of Physiology, I36, 469-488.
Battaglia, P. W., Jacobs, R. A., & Aslin, R. N. (2003). Bayesian integration of visual and
auditory signals for spatial localization. Journal of the Optical Society of America, 20(7),
1391-1397.
Beddor, P. S., & Hawkins, S. (1990). The influence of spectral prominence on perceived
vowel quality. Journal of the Acoustical Society of America, 87(6), 2684-2704.
Bell-Berti, F. (1975). Control of pharyngeal cavity size for English voiced and voiceless
stops. Journal of the Acoustical Society of America, 57(456-461).
Bell-Berti, F., & Harris, K. S. (1979). Anticipatory coarticulation: Some implication from a
study of lip rounding. Journal of the Acoustical Society of America, 65(5), 1268-1270.
Bell, A., Jurafsky, D., Fosler-Lussier, E., Girard, C., Gregory, M., & Gildea, D. (2003). Effects
of disfluencies, predictability, and utterance position on word form variation in English
conversation. The Journal of the Acoustical Society of America, 113(2), 1001-1024.
Blumstein, S. E., Myers, E. B., & Rissman, J. (2005). The perception of Voice Onset Time:
An fMRI investigation of phonetic category structure. Journal of Cognitive Neuroscience,
17(9), 1353-1366.
Boersma, P., & Weenik, D. (2007). Praat: doing phonetics by computer (Version 4.6.09).
Retrieved June 24th, 2007

98
Boucher, V. J. (2002). Timing relations in speech and the identification of voice-onset times:
A stable perceptual boundary for voicing categories across speaking rates. Perception and
Psychophysics, 64(1), 121-130.
Bybee, J. (2000). Lexicalization of sound change and alternating environments In M. B. a. J.
Pierrehumbert (Ed.), Papers in Laboratory Phonology V: Acquisition and the Lexicon.
Cambridge: Cambridge University Press.
Chomsky, N., & Hale, M. (1968). The sound patterns of English. New York: Harper & Row.
Clarke, C. M., & Luce, P. A. (2005). Perceptual adaptation to speaker characteristics: VOT
boundaries in stop voicing categorization. In C. T. McLennan, P. A. Luce, G. Mauner & J.
Charles-Luce (Eds.), University at Buffalo Working Papers on Language and Perception, 2
(pp. 362-366).
Clayards, M., Aslin, R. N., & Tanenhaus, M. K. (2005). Experience mediated cue integration.
Paper presented at the ISCA workshop on Plasticity in Speech Perception.
Coenen, E., Zwitserlood, P., & Bolte, J. (2001). Variation and assimilation in German:
Consequences for lexical access and representation. Language and Cognitive Processes,
16, 535-564.
Cole, J., Linebaugh, G., Munson, C., & McMurray, B. (submitted). Vowel-to-vowel
coarticulation across words in English: Acoustic evidence.
Dahan, D., Magnuson, J. S., Tanenhaus, M. K., & Hogan, E. M. (2001). Subcategorical
mismatches and the time course of lexical access: Evidence for lexical competition.
Language and Cognitive Processes, 16(5/6), 507-534.
Diehl, R. L., & Walsh, M. A. (1989). An auditory basis for the stimulus-length effect in the
perception of stops and glides. Journal of the Acoustical Society of America, 85(5), 21542164.
Ernst, M. O., & Banks, M. S. (2002). Humans integrate visual and haptic information in a
statistically optimal fashion. Nature, 415(24), 429-433.
Escudero, P. (2000). Developmental patterns in the adult L2 acquisition of new contrasts:
The acoustic cue weighting in the perception of Scottish tense/lax vowels in Spanish
speakers Unpublished M.Sc. thesis, University of Edinburgh.
Escudero, P., & Boersma, P. (2004). Bridging the gap between L2 speech perception
research and phonological theory. Studies in Second Language Acquisition, 26(4), 551-585.
Fant, G. (1960). Acoustic theory of speech production. The Hague: Mouton.
Feldman, N. H., & Griffiths, T. L. (2007, August). A rational account of the perceptual
magnet effect. Paper presented at the Twenty-Ninth Annual Conference of the Cognitive
Science Society, Nashville Tennessee.
Flemming, E. (2007). Commentary: Modeling listeners. Paper presented at the Laboratory
Phonology, Paris.

99
Fougeron, C., & Keating, P. (1997). Articulatory strengthening at the edges of prosodic
domains. Journal of the Acoustical Society of America, 101(6), 3728-3740.
Francis, A., Baldwin, K., & Nusbaum, H. C. (2000). Effects of training on attention to
acoustic cues. Perception and Psychophysics, 62(8), 1668-1680.
Ganong, W. F. (1980). Phonetic categorization in auditory word perception. Journal of
Experimental Psychology: Human Perception and Performance, 6(1), 110-125.
Gaskell, G. (2003). Modeling regressive and progressive effects of assimilation in speech
perception. Journal of Phonetics, 31, 447-463.
Gaskell, G., & Marslen-Wilson, W. D. (1996). Phonological variation and inference in lexical
access. Journal of Experimental Psychology: Human Perception and Performance, 22, 144158.
Gaskell, G., & Marslen-Wilson, W. D. (1998). Mechanisms of phonological inference in
speech perception. Journal of Experimental Psychology: Human Perception and
Performance, 24, 380-396.
Gaskell, G., & Marslen-Wilson, W. D. (2001). Lexical ambiguity resolution and spoken word
recognition: Bridging the gap. Journal of Memory and Language, 44, 325-349.
Geisler, W. S. (1989). Sequential ideal-observer analysis of visual discriminations.
Psychological Review, 96(2), 267-314.
Goldinger, S. D., & Azuma, T. (2003). Puzzle-solving science: the quixotic quest for units in
speech perception. Journal of Phonetics, 31, 305-320.
Goudbeek, M. (2007). The acquisition of auditory categories. Unpublished PhD thesis,
Radboud University of Nijmegen.
Goudbeek, M., Cutler, A., & Smits, R. (2008). Supervised and unsupervised learning of
multidimensionally varying non-native speech categories. Speech Communication, 50, 109125.
Goudbeek, M., Smits, R., Swingly, D., & Cutler, A. (submitted). Acquiring auditory and
phonetic categories.
Gow, D. (2001). Assimilation and anticipation in continuous spoken word recognition.
Journal of Memory and Language, 45(1), 133-159.
Gow, D. (2002). Does English coronal place assimilation create lexical ambiguity?
Perception and Psychophysics, 65, 575-590.
Gow, D. (2003). Feature parsing: Feature cue mapping in spoken word recognition.
Perception and Psychophysics, 65(4), 575-590.
Gow, D., & Im, A. M. (2004). A cross-linguistic examination of assimilation context effects.
Journal of Memory and Language, 51, 279-296.

100
Gow, D., & McMurray, B. (in press). Word recognition and phonology. Papers in Laboratory
Phonology 9.
Greenberg, S., Carvey, H., Hitchcock, L., & Chang, S. (2003). Temporal properties of
spontaneous speech - a syllable-centric perspective. Journal of Phonetics, 31, 465-485.
Griffiths, T. L., & Tenenbaum, J. B. (2006). Optimal predictions in everyday cognition.
Psychological Science, 17(9), 767-773.
Grossberg, S. (2003). Resonant neural dynamics of speech perception. Journal of
Phonetics, 31, 423-445.
Hattori, S., Yamamoto, K., & Fujimura, O. (1958). Nasalization of vowels in relation to
nasals. Journal of the Acoustical Society of America, 30, 267-274.
Hawkins, S. (2003). Roles and representations of systematic fine phonetic detail in speech
understanding. Journal of Phonetics, 31(373-405).
Hawkins, S., & Nguyen, N. (2004). Influence of syllable-coda voicing on the acoustic
properties of syllable-onset /l/ in English. Journal of Phonetics, 32, 199-231.
Holt, L. L., & Lotto, A. J. (2006). Cue weighting in auditory categorization: Implications for
first and second language acquisition. Journal of the Acoustical Society of America, 119(5),
3059-3071.
House, A. S., & Fairbanks, G. (1953a). The influence of consonant environment upon the
secondary acoustical characteristics of vowels. Journal of the Acoustical Society of America,
25(1), 105-113.
House, A. S., & Fairbanks, G. (1953b). The influence of consonantal environments upon the
secondary acoustical characteristics of vowels. Journal of the Acoustical Society of America,
25, 105-113.
House, A. S., & Stevens, K. N. (1956). Analog studies of the nasalization of vowels. Journal
of Speech Hearing Disorders, 21(2), 218-232.
Johnson, K., Ladefoged, P., & Lindau, M. (1993). Individual differences in vowel production.
Journal of the Acoustical Society of America, 94(2), 701-714.
Kessinger, R. H., & Blumstein, S. E. (1997). Effects of speaking rate on voice-onset time in
Thai, French and English. Journal of Phonetics, 25, 143-168.
Keyser, S. J., & Stevens, K. N. (2006). Enhancement and overlap in the speech chain.
Language, 81(1), 33-63.
Kingston, J., & Diehl, R. L. (1994). Phonetic knowledge. Language, 70(2), 419-454.
Kingston, J., & Diehl, R. L. (1995). Intermediate properties in the perception of distinctive
feature values. In B. Connell & A. Arvaniti (Eds.), Papers in Laboratory Phonology IV:
Phonology and Phonetic Evidence (pp. 7-27). Cambridge: Cambridge University Press.

101
Kingston, J., Diehl, R. L., Kirk, C. J., & Castleman, W. A. (2008). On the internal perceptual
structure of distinctive features: The [voice] contrast. Journal of Phonetics, 36, 28-54.
Klatt, D. (1980). Software for a cascade/parallel formant synthesizer. Journal of the
Acoustical Society of America, 67(3), 971-995.
Kuhnert, B., & Hoole, P. (2004). Speaker-specific kinematic properties of alveolar reductions
in English and German. Clinical and Linguistic Phonetics, 18(6), 559-575.
Lahiri, A., & Marslen-Wilson, W. D. (1991). The mental representation of lexical form: A
phonological approach to the recognition lexicon. Cognition, 38(3), 245-294.
Liberman, A. M. (1996). Speech: a special code. Cambridge, MA: MIT Press.
Liberman, A. M., Harris, K. S., Hoffman, H. S., & Griffith, B. C. (1975). The discrimination of
speech sounds within and across phoneme boundaries. Journal of Experimental Psychology
54(5 ), 358-368.
Lindblom, B. (1990). Explaining phonetic variation: a sketch of the H&H theory. In W.
Hardcastle & A. Marchal (Eds.), Speech production and speech modeling (pp. 403-439):
Springer.
Lisker, L. (1957). Closure duration and the intervocalic voiced-voiceless distinction in
English. Language, 33(1), 42-49.
Lisker, L. (1978). Rapid vs. rabid: A catalogue of acoustic features that may cue the
distinction. Haskins Laboratories Status Report on Speech Research.
Lisker, L., & Abrahmson, A. S. (1964). Cross-language study of voicing in initial stops. Word,
20, 384-422.
Lisker, L., Liberman, A. M., Erikson, D. M., Dechovitz, D., & Mandler, R. (1977). On pushing
the voice-onset-time (VOT) boundary about. Language and Speech, 20(3), 209-216.
Magen, H. (1997). The extent of vowel to vowel coarticulation in English. Journal of
Phonetics, 25, 187-205.
Mann, V. A., & Repp, B. H. (1980). Influence of vocalic context on perception of the /sh/-/s/
distinction. Perception and Psychophysics, 28(3), 213-228.
Maye, J., & Gerken, L. (2000). Learning phoneme categories without minimal pairs. Paper
presented at the 24th Annual Boston University Conference on Language Development.
Maye, J., Weiss, D. J., & Aslin, R. N. (2008). Statistical phonetic learning in infants:
Facilitation and feature generalization. Developmental Science, 11(1), 122-134.
Maye, J., Werker, J. F., & Gerken, L. (2002). Infant sensitivity to distributional information
can affect phonetic discrimination. Cognition, 82, B101-B111.
Mayo, C., & Turk, A. (2004). Adult-child differences in acoustic cue weighting are influenced
by segmental context: Children are not always perceptually biased toward transitions.
Journal of the Acoustical Society of America, 115(6), 3184-3194.

102
McMurray, B. (2004). Within-category variation is used in spoken word recognition:
Temporal integration at two time scales Unpublished doctoral thesis, University of
Rochester.
McMurray, B., Tanenhaus, M. K., & Aslin, R. N. (2002). Gradient effects of within-category
phonetic variation on lexical access Cognition, 86, B33-B42.
Mehler, J., Segui, J., & Frauenfelder, U. (1981). The role of the syllable in language
acquisition and perception. In J. L. T. Myers, & J. Anderson (Ed.), The cognitive
representation of speech (pp. 295305). Amsterdam: North-Holland.
Miller, J. L., & Volaitis, L. E. (1989). Effect of speaking rate on the perceptual structure of a
phonetic category Perception and Psychophysics 46 (6), 505-512.
Mitterer, H., & Blomert, L. (2003). Coping with phonological assimilation in speech
perception: Evidence for early compensation. Perception and Psychophysics, 56(6), 956969.
Mitterer, H., Csepe, V., & Blomert, L. (2006). The role of perceptual integration in the
recognition of assimilated word forms. The Quarterly Journal of Experimental Psychology,
59(8), 1395-1424.
Mitterer, H., Csepe, V., Honbolygo, F., & Blomert, L. (2006). The recognition of
phonologically assimilated words does not depend on specific language experience.
Cognitive Science, 30(3), 451-471.
Moon, S. J., & Lindblom, B. (1994). Interaction between Duration, Context, and Speaking
Style in English Stressed Vowels. Journal of the Acoustical Society of America, 96(1), 40-55.
Newman, S. R., Clouse, S. A., & Burnham, J. L. (2001). The perceptual consequences of
within-talker variability in fricative production. Journal of the Acoustical Society of America,
109(3), 1181-1196.
Nittrouer, S., & Miller, M. E. (1997). Predicting developmental shift in perceptual weighting
schemes. Journal of the Acoustical Society of America, 101, 2253-2266.
Norris, D. (2006). The Bayesian reader: Explaining word recognition as an optimal Bayesian
decision process. Psychological Review, 113(2), 327-357.
Norris, D., & Cutler, A. (1988). The relative accessibility of phonemes and syllables.
Perception and Psychophysics, 43, 541550.
Norris, D., & McQueen, J. M. (2008). Shortlist B: A Bayesian model of continuous speech
recognition. Psychological Review, 115(2), 357-395.
Ohman, S. E. G. (1966). Coarticulation in VCV utterances: Spectrographic measurements.
Journal of the Acoustical Society of America, 39(1), 151-168.
Pallier, C., Sebastian-Galles, N., Felguera, T., Christophe, A., & Mehler, J. (1993).
Attentional allocation within the syllabic structure of spoken words. Journal of Memory and
Language, 32(3), 373-389.

103
Perkell, J. S., Zandipour, M., Matthies, M. L., & Lane, H. (2002). Economy of effort in
different speaking conditions. I. A preliminary study of intersubject differences and modeling
issues. Journal of the Acoustical Society of America, 112(4), 1627-1641.
Peterson, G. E., & Barney, H. L. (1952). Control methods used in a study of the vowels.
Journal of the Acoustical Society of America, 24(2), 175184.
Pierrehumbert, J. B. (2003). Probabilistic Phonology: Discrimination and Robustness. In R.
Bod, J. Haye & S. Jannedy (Eds.), Probabilistic Linguistics (pp. 127-228). Cambridge MA:
MIT Press.
Pisoni, D. B., & Tash, J. (1974). Reaction times to comparisons within and across category
Perception and Psychophysics 15(2), 285-290.
Pitt, M., & Samuel, A. (1990). Attentional allocation during speech perception: How fine is
the focus? Journal of Memory and Language, 29(5), 611-632.
Repp, B. H. (1982). Phonetic trading relations and context effects: New experimental
evidence for a speech mode of perception. Psychological Bulletin, 92(1), 81-110.
Salverda, A. P., Dahan, D., & McQueen, J. M. (2003). The role of prosodic boundaries in the
resolution of lexical embedding in speech comprehension. Cognition, 90, 51-89.
Salverda, A. P., Dahan, D., Tanenhaus, M. K., Crosswhite, K., Masharov, M., &
McDonough, J. (2007). Effects of prosodically-modulated sub-phonetic variation on lexical
competition. Cognition, 105, 466-476.
Stevens, K. N., & Keyser, S. J. (1989). Primary features and their enhancement in
consonants. Language, 65(1), 81-106.
Stevens, K. N., & Klatt, D. (1974). Role of formant transitions in the voiced-voiceless
distinction for stops. Journal of the Acoustical Society of America, 55(3), 653-659.
Summerfield, Q. (1975). Aerodynamics versus mechanics in the control of voicing onset in
consonant-vowel syllables. In Speech Perception (No. 4). Belfast: Queen's University,
Department of Psychology.
Summerfield, Q. (1981). Articulatory rate and perceptual constancy in phonetic perception.
Journal of Experimental Psychology: Human Perception and Performance, 7(5), 1074-1095.
Summerfield, Q., & Haggard, M. (1977). On the dissociation of spectral and temporal cues
to the voicing distinction in initial stop consonants. Journal of the Acoustical Society of
America, 62(2), 436-448.
Tocano, J., & McMurray, B. (to appear). Paper presented at the Cognitive Science,
Maryland.
Todorov, E. (2004). Optimality principles in sensorimotor control (review). Nature
Neuroscience, 7(9), 907915.

104
Toscano, J., & McMurray, B. (2007). Taking Statistical Learning to the Next Level: A
Computational Approach to the Acquisition of Multi-Dimensional Categories Paper
presented at the Biennial Meeting of the Society for Research in Child Development,
Boston, MA.
van Alphen, P. M., & Smits, R. (2004). Acoustical and perceptual analysis of the voicing
distinction in Dutch initial plosives: the role of prevoicing. Journal of Phonetics, 32(455-491).
Wayland, S. C., & Miller, J. L. (1994). The influence of sentential speaking rate on the
internal structure of phonetic categories. Journal of the Acoustical Society of America, 95(5),
2694-2701.
Wichmann, F. A., & Hill, N. J. (2001). The psychometric function: I. Fitting, sampling, and
goodness of fit. Perception and Psychophysics, 63(8), 1293-1313.
Wouters, J., & Macon, M. W. (2002). Effects of prosodic factors on spectral dynamics. I.
Analysis. Journal of the Acoustical Society of America, 111(1), 417-427.

105
Appendix A: Visual stimuli for Experiments 1 and 2.
Experimental stimuli were in color.

beak

bees

beach

peas

peach

lake

lace

lei

rake

race

ray

peak

106
Appendix B: Visual stimuli for Experiment 4.
Experimental stimuli were in color.

Mabel

maple

stable

staple