Professional Documents
Culture Documents
UP
Contents
Section A Gesture
1 An introduction to task dynamics
SARAH HAWKINS 9
Section B Segment
6 An introduction to feature geometry
MICHAEL BROE 149
Section C Prosody
13 An introduction to intonationalphonology
D. ROBERT LADD 321
vin
Contents
14 Downstep in Dutch: implications for a model
ROB VAN DEN BERG, CARLOS GUSSENHOVEN, AND
TONI RIETVELD 335
Comments on chapter 14 NINA GRONNUM 359
References 424
Name index 452
Subject index 457
Contributors
xu
A cknowledgments
Commentaries
Stephen R. Anderson
Gosta Bruce
Stephen D. Isard
Bjorn Lindblom
Joan Mascaro
Posters
Anne Cutler
Janet Fletcher
Sarah Hawkins
Jill House
Daniel Recasens
Jacques Terken
Ian Watson
Section A
Gesture
I
An introduction to task dynamics
SARAH HAWKINS
*I thank Thomas Baer, Catherine Browman, and Elliot Saltzman for helpful comments on
earlier versions of this paper.
Gesture
several discrete, abstract tasks in this model. Speech requires a succession of
skilled movements, each of which is modeled as a number of tasks. For
example, to produce English [J], the precise action of the tongue is critical,
the lips are somewhat rounded and protruded, and the vocal folds are moved
well apart to ensure voicelessness and a strong flow of air. In addition to
controlling respiratory activity, therefore, there may be at least five distinct
tasks involved in saying an isolated [J]: keeping the velopharyngeal port
closed, and controlling the degree of tongue constriction and its location, the
degree of lip protrusion, the size of the lip aperture, and the size of the glottal
aperture. The same articulator may be involved in more than one task. In this
example, the jaw contributes to producing both the correct lip aperture and
the correct tongue constriction. To add to the complexity, when the [J] is
spoken as part of a normal utterance, each of its tasks must be carried out
while the same articulators arefinishingor starting tasks required for nearby
sounds.
The sort of complicated tasks we habitually do with little conscious effort -
reaching for an object, lifting a cup to the mouth without spilling its contents,
speaking - are difficult to explain using physiological models in which the
angles of the joints involved in the movement are controlled directly. These
models work well enough when there is only one joint involved (a "single-
degree-of-freedom" task), because such movements are arc-shaped. But even
simple tasks usually involve more than one joint. To bring a cup to the
mouth, for example, the shoulder, elbow and wrist joints are used, and the
resultant movement trajectory is not an arc, but a more-or-less straight line.
Although describing how a straight-line trajectory could be controlled
sounds like a simple problem, it turns out to be quite complicated. Task
dynamics offers one way of modeling quasi-straight lines (Saltzman and
Kelso 1987).
A further complication is the fact that in these skilled "multi-degree-of-
freedom" tasks (involving more than one joint) the movement trajectory
usually has similar characteristics no matter where it is located in space.
When reaching outwards for an object, the hand tends to move in a straight
line, regardless of where the target is located with respect to the body before
the movement begins, and hence regardless of whether the arm starts by
reaching away from the body, across the body, or straight ahead. This ability
to do the same task using quite different joint angles and muscle contractions
is a fundamental characteristic of skilled movement, and has been called
"motor equivalence" (Lashley 1930; Hebb 1949).
One important property related to motor equivalence is immediate
compensation, whereby if a particular movement is blocked, the muscles
adjust so that the movement trajectory continues without major disruption
to attain the final goal. Immediate compensation has been demonstrated in
10
1 Sarah Hawkins
"perturbation" experiments, in which a moving articulator - usually an
elbow, hand, lip, or jaw - is briefly tugged in an unpredictable way so that the
movement is disrupted (e.g. Folkins and Abbs 1975; Kelso et al. 1984). The
articulators involved in the movement immediately compensate for the tug
and tend to return the movement to its original trajectory by adjusting the
behavior of the untugged as well as the tugged articulators. Thus, if the jaw is
tugged downwards during an upward movement to make a [b] closure, the
lips will compensate by moving more than usual, so that the closure occurs at
about the same time in both tug and no-tug conditions. These adjustments
take place more or less immediately (15-30 msec, after the perturbation
begins), suggesting an automatic type of reorganization rather than one that
is under voluntary, attentional control. Since speaking is a voluntary action,
however, we have a reflexive type of behavior within a clearly nonreflexive
organization. These reflexive yet flexible types of behavior are called
"functional reflexes."
Task dynamics addresses itself directly to explaining both the observed
quasi-straight-line movement trajectories and immediate compensation of
skilled limb movements. Both properties can be reasonably simply explained
using a model that shapes the trajectories implicitly as a function of the
underlying dynamics, using as input only the end goal and a few parameters
such as "stiffness," which are discussed below. The model automatically
takes into account the conditions at the start of the movement. The resulting
model is elegant and uses simple constructs like masses and springs.
However, it is hard to understand at first because it uses these constructs in
highly abstract ways, and requires sophisticated mathematics to translate the
general dynamical principles into movements of individual parts of the body.
Descriptions typically involve a number of technical terms which can be
difficult to understand because the concepts they denote are not those that
are most tangible when we think about movement. These terms may seem to
be mere jargon at first sight, but they have in fact been carefully chosen to
reflect the concepts they describe. In this paper, I try to explain each term
when I first use it, and I tend to use it undefined afterwards. New terms are
printed in bold face when they are explained.
13
Gesture
hence the effector systems and terminal devices) are the same. But the
physical details of how the task is achieved differ (see e.g. Sussman,
MacNeilage, and Hanson 1973).
The observed movements are governed by the dynamics underlying a series
of discrete, abstract gestures that overlap in time and space. At its present
stage of development, the task-dynamic model does not specify the sequenc-
ing or relative timing of the gestures, although work is in progress to
incorporate such intergestural coordination into the model (Saltzman and
Munhall 1989). The periods of activation of the tract variables are thus at
present controlled by a gestural score which is either written by the
experimenter or generated by rule (Browman et al 1986). In addition to
giving information about the timing of gestures, the gestural score also
specifies a set of dynamic parameters (such as the spatial target) that govern
the behavior of each tract variable in a gesture. Thus the gestural score
contains information on the relative timing and dynamic parameters
associated with the gestures in a given utterance; this information is the input
to the task-dynamic model proper, which determines the behavior of
individual articulators. In this volume, Browman and Goldstein assign to the
gestural score all the language-particular phonetic/phonological structure
necessary to produce the utterance.
The system derives part of its success from its assumption that continuous
movement trajectories can be analyzed in terms of discrete gestures. This
assumption is consistent with the characterization of speech as successions of
concurrently occurring tasks described above. Two important consequences
of using gestures are, first, that the basic units of speech are modeled as
movements towards targets rather than as static target positions, and second,
that we can work with abstract, essentially invariant units in a way that
produces surface variability. While the gestures may be invariant, the
movements associated with them can be affected by other gestures, and thus
vary with context. Coarticulation is thus modeled as inherent within the
speech-production process.
14
1 Sarah Hawkins
15
Gesture
no "target overshoot." One consequence of using critically damped trajec-
tories is that the controlled component of each gesture is realized only as a
movement towards the target.
The assumption of constant mass and the tendency not to vary the degree
of damping mean that we only need consider how changes in the state of the
spring affect the pattern of movement of a tract variable. The rate at which
the mass moves towards the target is determined by how much the spring is
stretched, and by how stiff the spring is. The amount of stretch, called
displacement, is the difference between the current location of the mass and
the new target location. The greater the displacement, the greater the peak
velocity (maximum speed) of movement towards equilibrium, since, under
otherwise constant conditions, peak velocity is proportional to displacement.
A stiff spring will move back to its resting position faster than a less stiff
spring. Thus, changes in the spring's stiffness affect not only the duration of a
movement, but also the ratio of its peak velocity to peak displacement. This
ratio is often used nowadays in work on movement.
Displacement in the model can be quite reasonably related to absolute
physical displacement, for example to the degree of tongue body/jaw
displacement in a vocalic opening gesture. Stiffness, in contrast, represents a
control strategy relevant to the behavior of groups of muscles, and is not
directly equatable to the physiological stiffness of individual muscles
(although Browman and Goldstein [1989] raise the possibility that there may
be a relationship between the value of the stiffness parameter and bio-
mechanical stiffness).
In phonetic terms, changes in stiffness affect the duration of articulatory
movement. Less stiffness results in slower movement towards the target. The
phonetic result depends partly on whether there are also changes in the
durations and relative timing of the gestures. For example, if there were no
change in gestural activation time when stiffness was reduced, then the slower
movement could result in gestures effectively undershooting their targets. In
general, stiffness can be changed within an utterance to affect, for example,
the degree of stress of a syllable, and it can be changed to affect the overall
rate of speech. Changes in displacement (stretch) with no change in stiffness,
on the other hand, affect the maximum speed of movement towards the
target, but not the overall duration of the movement.
To generate movement trajectories, the task-dynamic model therefore
needs to know, for each relevant task variable, the current state of the system
(the current position and velocity of the mass), the new target position, and
values for the two parameters representing, respectively, the stiffness of the
hypothetical spring associated with the task variable and the type of friction
(damping) in the system. The relationships between these various parameters
16
1 Sarah Hawkins
are described by the following differential equation for a damped mass-
spring system.
mx + bx + k(x - xo) = 0
Since, in the model, the mass m has a value of 1.0, and the damping
ratio, b/(2*[mk]1/2), is also usually set to 1.0, then once the stiffness k and
target xo are specified, the equation can be solved for the motion over time
of the task variable x. (Velocity and acceleration are the first and second
time derivatives, respectively, of x, and so can be calculated when the time
function for x is known.) Solving for x at each successive point in time
determines the trajectory and rate of movement of the tract variable. Any
transient perturbation of the ongoing movement, as long as it is not too
great, is immediately and automatically compensated for because the
point-attractor equation specifies the movement characteristics of the
tract variable rather than of individual articulators.
Matrix transformations in the task-dynamic system determine how
much each component articulator contributes to the movement of the
tract variable. These equations use a set of articulator weightings which
specify the relative contributions of the component articulators of the
tract variable to a given gesture. These weightings comprise an additional
set of parameters in the task-dynamic model. They are gesture-specific,
and so are included in the gestural score.
The gestural score thus specifies all the gestural parameters: the
equilibrium or target position, xo; the stiffness, k; the damping ratio that,
together with the stiffness, determines the damping factor b; and the
articulator weightings. It also specifies how successive gestures are co-
ordinated in time.
As mentioned above, the issue of how successive gestures are coordi-
nated in time is difficult and is currently being worked on (Saltzman and
Munhall 1989). A strategy that has been used is to specify the phase in one
gesture with respect to which a second gesture is coordinated. The
17
Gesture
definition of phase for these purposes has involved the concept of phase
space - a two-dimensional space in which velocity and displacement are
the coordinates (Kelso and Tuller 1984). Phase space allows a phase to be
assigned to any kind of movement, but for movements that are essentially
sinusoidal, phase has its traditional meaning. So, for example, starting a
second gesture at 180 degrees in the sinusoidal movement cycle of a first
gesture would mean that the second one began when the first had just
completed half of its full cycle. But critically damped movements do not
lend themselves well to this kind of strategy: under most conditions, they
never reach a phase of even 90 degrees, as defined in phase space.
Browman and Goldstein's present solution to this problem uses the
familiar notion of phase of a sinusoidal movement, but in an unusual way.
They assume that each critically damped gesture can also be described in
terms of a cycle of an underlying undamped sinusoid. The period of this
undamped cycle is calculated from the stiffness associated with the
particular gesture; it represents the underlying natural frequency of the
gesture, whose realization is critically damped. Two gestures are coordi-
nated in time by specifying a phase in the underlying undamped cycle for
each one, and then making those (underlying) phases coincide in time. For
example, two gestures might be coordinated such that the point in one
gesture that is represented by an underlying phase of 180 degrees
coincided in time with the point represented by an underlying phase of 240
degrees in the other. This approach differs from the use of phase
relationships described above in that phase in a gesture does not depend
on the actual movement associated with the gesture, but rather on the
stiffness (i.e. underlying natural frequency) for that gesture. This ap-
proach, together with an illustration of critical damping, is described in
Browman and Goldstein (1990: 346-8).
Coordinating gestures in terms of underlying phases means that the
timing of each gesture is specified intrinsically rather than by using an
external clock to time movements. Changing the phases specified by the
gestural score can therefore affect the number of gestures per unit time,
but not the rate at which each gesture is made. If no changes in stiffness
are introduced, then changing the phase specifications will change the
amount of overlap between gestures. In speech, this will affect the amount
of coarticulation, or coproduction as it is often called (Fowler 1980),
which can affect the style of speech (e.g. the degree of casualness) and,
indirectly, the overall rate of speech.
The discussion so far has described how movement starts - by the
gestural score feeding the task-dynamic model with information about
gestural targets, stiffnesses, and relative timing - but what causes a
movement to stop has not been mentioned. There is no direct command
18
1 Sarah Hawkins
given to stop the movement resulting from a gesture. The gestural score
specifies that a gesture is either activated, in which case controlled
movement towards the target is initiated, or else not activated. When a
gesture is no longer activated, the movements of the articulators involved
are governed by either of two factors: first, an articulator may participate
in a subsequent gesture; second, each articulator has its own inherent rest
position, described as a neutral attractor, and moves "passively" towards
this rest position whenever it is not involved in an "actively" controlled
tract variable that is participating in a gesture. This rest position should
not be confused with the resting or equilibrium position that is the target
of an actively controlled tract variable and is specified in the gestural
score. The inherent rest position is specific to an articulator, not a tract
variable, and is specified by standard equations in the task-dynamic
model. It may be language-specific (Saltzman and Munhall 1989) and
hence correspond to the "base of articulation" - schwa for English - in
which case it seems possible that it could also contribute towards
articulatory setting and thus be specific to a regional accent or an
individual.
A factor that has not yet been mentioned is how concurrent gestures
combine. The gestural score specifies the periods when the gestures are
activated. The task-dynamic model governs how the action of the articula-
tors is coordinated within a single gesture, making use of the articulator
weightings. When two or more gestures are concurrently active, they may
share a tract variable, or they may involve different tract variables but
affect a common articulator. In both cases, the influences of the overlap-
ping gestures are said to be blended. For blending within a shared tract
variable, the parameters associated with each gesture are combined either
by simple averaging, weighted averaging, or addition. (See Saltzman and
Munhall [1989] for more detailed discussion of gestural blending both
within and across tract variables.)
20
1 Sarah Hawkins
represents a simple compromise between the faster but overshooting
underdamped case, and the slower but non-overshooting overdamped
case. Critical damping is straightforward to use because, as mentioned
above, the damping factor b is specified indirectly via the damping ratio,
b/(2*[mk]1/2), which is constrained to equal 1.0 in a critically damped
system. But since it includes the independent stiffness parameter k, then if
the ratio is constrained, b is a function of k. This dependence of damping
on stiffness may be reasonable for movement of a single articulator, as
used for the decay to the neutral rest position of individual uncontrolled
articulators, but it seems less reasonable for modeling gestures. Moreover,
although the damping factor is crucially important, the choice of critical
damping is not much discussed and seems relatively arbitrary in that other
types of damping can achieve similar asymptotic trajectories. (Fujimura
[1990: 377-81] notes that there are completely different ways of achieving
the same type of asymptotic trajectories.) I would be interested to know
why other choices have been rejected. One question is whether the same
damping factor should be used for all gestures. Is the trajectory close to
the target necessarily the same for a stop as for a fricative, for example? I
doubt it.
The method of coordinating gestures by specifying relative phases poses
problems for a model based on critically damped sinusoids. As mentioned
above, a critically damped system does not lend itself to a description in
terms of phase angles. Browman and Goldstein's current solution of
mapping the critically damped trajectory against an undamped sinusoid is
a straightforward empirical solution that has the advantage of being
readily understandable. It is arguably justifiable since they are concerned
more with developing a comprehensive theory of phonology-phonetics,
for which they need an implementable model, than with developing the
task-dynamic model itself. But the introduction of an additional level of
abstractness in the model - the undamped gestural cycle - to explain the
relative timing of gestures seems to me to need independent justification if
it is to be taken seriously.
Another issue is the relationship between the control of speech move-
ments and the vegetative activity of the vocal tract. The rest position
towards which an articulator moves when it is not actively controlled in
speech is unlikely to be the same as the rest position during quiet
breathing, for example. Similarly, widening of the glottis to resume
normal breathing after phonation is an active gesture, not a gradual
movement towards a rest position. These properties could easily be
included in an ad hoc way in the current model, but they raise questions of
how the model should account for different task-dynamic systems acting
21
Gesture
on the same articulators - in this case the realization of volitional
linguistic intention coordinated with automatic behavior that is governed
by neural chemoreceptors and brainstem activity.
The attempt to answer questions such as how the task-dynamic model
accounts for different systems affecting the same articulators partly
involves questions about the role of the gestural score: is the gestural score
purely linguistic, and if so why; and to what extent does it (or something
like it) include linguistic intentions, as opposed to implementing those
intentions? One reason why these questions will be hard to answer is that
they include some of the most basic issues in phonology and phonetics.
The status of the neutral attractor (rest position of each articulator) is a
case in point. Since the neutral attractor is separately determined for each
articulator, it is controlled within the task-dynamic model in the current
system. But if the various neutral attractors define the base of articulation,
and if, as seems reasonable, the base of articulation is language-specific
and hence is not independent of the phonology of the language, then it
should be specified in the gestural score if Browman and Goldstein are
correct in stating (this volume) that all language-specific phonetic/phono-
logical structure is found there.
A related issue is how the model accounts for learning - the acquisition
of speech motor control. Developmental phonologists have tended to
separate issues of phonological competence from motoric skill. But the
role that Browman and Goldstein, for example, assign to the gestural
score suggests that these authors could take a very different approach to
the acquisition of phonology. The relationship between the organization
of phonology during development and in the adult is almost certainly not
simple, but to attempt to account for developmental phonology within the
task-dynamic (or articulatory phonology) model could help clarify certain
aspects of the model. It could indicate, for example, the extent to which
the gestural score can reasonably be considered to encompass phonologi-
cal primitives, and whether learned, articulator-specific skills like articula-
tory setting or base of articulation are best controlled by the gestural score
or within the task-dynamic system. Browman and Goldstein (1989) have
begun to address these issues.
Finally, there is the question of how much the model offers explanation
rather than description. The large number of variables and parameters
makes it likely that some observed movement patterns can be modelled in
more than one way. Can the same movement trajectory arise from
different parameter values, types of blending, or activation periods and
degree of gestural overlap? If this happens, how does one choose between
alternatives? At the tract variable level, should there be only one possible
way to model a given movement trajectory? In other words, if there are
22
1 Sarah Hawkins
Concluding remarks
Task dynamics offers a systematic, general account of the control of
skilled movement. As befits the data, it is a complicated system involving
many parameters and variables. In consequence, there is tension between
the need to explore the system itself, and the need to use it to explore
problems within (in our case) phonology and phonetics. I restrict these
final comments to the application of task dynamics to phonological and
phonetic processes of speech production, rather than to details of the
execution of skilled movement in general.
As a system in itself, I am not convinced that task dynamics will solve
traditional problems of phonetics and phonology. It is not clear, for
example, that its solutions for linguistic intentions, linguistic units, and the
invariance-variability issue will prove more satisfactory than other solu-
tions. Moreover, there are arguments for seeking a model of speech motor
control that is couched more in physiological rather than dynamic terms,
and that accounts for the learning of skilled movements as well as for their
execution once they are established. Nevertheless, partly because it is
nonspecific, ambitious enough to address wide-ranging issues, and explicit
enough to allow alternative solutions to be tried within the system, task
dynamics is well worth developing because it brings these issues into focus.
The connections it draws between the basic organizing principles of skilled
movement and potential linguistic units raise especially interesting ques-
tions. It may be superseded by other models, but those future models are
likely to owe some of their properties to research within the task-dynamic
approach. Amongst the attributes I find particularly attractive are the
emphasis on speech production as a dynamic process, and the treatment
of coarticulation, rephrased as coproduction, as an inherent property of
gestural dynamics, so that changes in rate and style require relatively
simple changes in global parameter values, rather than demanding new
targets and computation of new trajectories for each new type of
utterance.
It is easier to evaluate the contribution of task dynamics in exploring
general problems of phonology and phonetics. The fact that the task-
dynamic model is used so effectively testifies to its value. Browman and
Goldstein, for example, use the model to synthesize particular speech
patterns, and then use the principles embodied in the model to draw out
the implications for the organization of phonetics and phonology. But
24
1 Sarah Hawkins
beyond the need to get the right general effects, it seems to me that the
details of the implementation are not very important in this approach. The
value of the model in addressing questions like those posed by Browman
and Goldstein is therefore as much in its explicitness and relative ease of
use as in its details. The point, then, at this early stage, is that it does not
really matter if particular details of the task dynamics are wrong. The
value of the task-dynamic model is that it enables a diverse set of problems
in phonology and phonetics to be studied within one framework in a way
that has not been done before. The excitement in this work is that it offers
the promise of new ways of thinking about phonetic and phonological
theory. Insofar as task dynamics allows description of diverse phenomena
in terms of general physical laws, then it provides insights that are as near
as we can currently get to explanation.
25
2
"Targetless" schwa: an articulatory analysis
2.1 Introduction
One of the major goals for a theory of phonetic and phonological structure
is to be able to account for the (apparent) contextual variation of phonologi-
cal units in as general and simple a way as possible.* While it is always
possible to state some pattern of variation using a special "low-level" rule
that changes the specification of some unit, recent approaches have
attempted to avoid stipulating such rules, and instead propose that variation
is often the consequence of how the phonological units, properly defined, are
organized. Two types of organization have been suggested that lead to the
natural emergence of certain types of variation: one is that invariantly
specified phonetic units may overlap in time, i.e., they may be coproduced
(e.g., Fowler 1977, 1981a; Bell-Berti and Harris 1981; Liberman and Matt-
ingly 1985; Browman and Goldstein 1990), so that the overall tract shape and
acoustic consequences of these coproduced units will reflect their combined
influence; a second is that a given phonetic unit may be unspecified for some
dimension(s) (e.g., Ohman 1966b; Keating 1988a), so that the apparent
variation along that dimension is due to continuous trajectories between
neighboring units' specifications for that dimension.
A particularly interesting case of contextual variation involves reduced
(schwa) vowels in English. Investigations have shown that these vowels are
particularly malleable: they take on the acoustic (Fowler 1981a) and articula-
tory (e.g., Alfonso and Baer 1982) properties of neighboring vowels. While
Fowler (1981a) has analyzed this variation as emerging from the coproduc-
tion of the reduced vowels and a neighboring stressed vowel, it might also be
*Our thanks to Ailbhe Ni Chasaide, Carol Fowler, and Doug Whalen for criticizing versions of
this paper. This work was supported by NSF grant BNS 8820099 and NIH grants HD-01994
and NS-13617 to Haskins Laboratories.
26
2 Catherine P. Browman and Louis Goldstein
the case that schwa is completely unspecified for tongue position. This would
be consistent with analyses of formant trajectories for medial schwa in
trisyllabic sequences (Magen 1989) that have shown that F2 moves (roughly
continuously) from a value dominated by the preceding vowel (at onset) to
one dominated by the following vowel (at offset). Such an analysis would
also be consistent with the phonological analysis of schwa in French
(Anderson 1982) as an empty nucleus slot. It is possible (although this is not
Anderson's analysis), that the empty nucleus is never filled in by any
specification, but rather there is a specified "interval" of time between two
full vowels in which the tongue continuously moves from one vowel to
another.
The computational gestural model being developed at Haskins Laborator-
ies (e.g. Browman et al. 1986; Browman and Goldstein, 1990; Saltzman et al.
1988a) can serve as a useful vehicle for testing these (and other) hypotheses
about the phonetic/phonological structure of utterances with such reduced
schwa vowels. As we will see, it is possible to provide a simple, abstract
representation of such utterances in terms of gestures and their organization
that can yield the variable patterns of articulatory behavior and acoustic
consequences that are observed in these utterances.
The basic phonetic/phonological unit within our model is the gesture,
which involves the formation (and release) of a linguistically significant
constriction within a particular vocal-tract subsystem. Each gesture is
modeled as a dynamical system (or set of systems) that regulates the time-
varying coordination of individual articulators in performing these constric-
tion tasks (Saltzman 1986). The dimensions along which the vocal-tract goals
for constrictions can be specified are called tract variables, and are shown in
the left-hand column of figure 2.1. Oral constriction gestures are defined in
terms of pairs of these tract variables, one for constriction location, one for
constriction degree. The right-hand side of the figure shows the individual
articulatory variables whose motions contribute to the corresponding tract
variable.
The computational system sketched in figure 2.2 (Browman and Gold-
stein, 1990; Saltzman et al. 1988a) provides a representation for arbitrary
(English) input utterances in terms of such gestural units and their organiza-
tion over time, called the gestural score. The layout of the gestural score is
based on the principles of intergestural phasing (Browman and Goldstein
1990) specified in the linguistic gestural model. The gestural score is input to
the task-dynamic model (Saltzman 1986; Saltzman and Kelso 1987), which
calculates the patterns of articulator motion that result from the set of active
gestural units. The articulatory movements produced by the task-dynamic
model are then input to an articulatory synthesizer (Rubin, Baer, and
Mermelstein 1981) to calculate an output speech waveform. The operation of
27
Gesture
upper lip
velum
lower lip
GLO
glottis
Linguistic Articulatory
gestural synthesizer
model
cation can be seen in figure 2.3, which shows the gestural score for the
utterance /pam/. Here, the shaded boxes indicate the gestures, and are
superimposed on the tract-variable time functions produced when the
gestural score is input to the task-dynamic model. The horizontal dimension
of the shaded boxes indicates the intervals of time during which each of the
gestural units is active, while the height of the boxes corresponds to the
"target" or equilibrium position parameter of a given gesture's dynamical
control regime. See Hawkins (this volume) for a more complete description
of the model and its parameters.
Note that during the activation interval of the initial bilabial closure
gesture, Lip Aperture (LA - vertical distance between the two lips) gradually
decreases, until it approaches the regime's target. However, even after the
regime is turned off, LA shows changes over time. Such "passive" tract-
variable changes result from two sources: (1) the participation of one of the
(uncontrolled) tract variable's articulators in some other tract variable which
is under active gestural control, and (2) an articulator-specific "neutral" or
"rest" regime, that takes control of any articulator which is not currently
active in any gesture. For example, in the LA case shown here, the jaw
contributes to the Tongue-Body constriction degree (TBCD) gesture (for the
vowel) by lowering, and this has the side effect of increasing LA. In addition,
the upper and lower lips are not involved in any active gesture, and so move
towards their neutral positions with respect to the upper and lower teeth,
thus further contributing to an increase in LA. Thus, the geometric structure
of the model itself (together with the set of articulator-neutral values)
predicts a specific, well-behaved time function for a given tract variable, even
29
Gesture
Input String:/1paam/;
Velic
aperture
Tongue-body
I
constriction
degree ph
narrow
Lip
aperture
Glottal
aperture
Figure 2.3 Gestural score and generated motion variables for /pam/. The input is specified in
ARPAbet, so /pam/ = ARPAbet input string /paam/. Within each panel, the height of the box
indicates degree of opening (aperture) of the relevant constriction: the higher the curve (or box)
the greater the amount of opening
activation intervals, neither target value would be achieved, rather, the value
of the tract variable at the end would be the average of their targets.
In this paper, our strategy is to analyze movements of the tongue in
utterances with schwa to determine if the patterns observed provide evidence
for a specific schwa tongue target. Based on this analysis, specific hypotheses
about the gestural overlap in utterances with schwa are then tested by means
of computer simulations using the gestural model described above.
31
Gesture
(approximately 0.33 mm); thus, movements of a single unit in one direction
and back again did not constitute extrema. Only the interval that included
the full vowels and the medial schwa was analyzed; final schwas were not
analyzed. In general, an extremum was found that coincided with each full
vowel, for each pellet dimension, while such an extremum was missing for
schwa in over half the cases. The pellet positions at these extrema were used
as the basic measurements for each vowel. In cases where a particular pellet
dimension had no extremum associated with a vowel, a reference point was
chosen that corresponded to the time of an extremum of one of the other
pellets. In general, MY was the source of these reference points for full
vowels, and RY was the source for schwa, as these were dimensions that
showed the fewest missing extrema. After the application of this algorithm,
each vowel in each utterance was categorized by the value at a single
reference point for each of the four pellet dimensions. Since points were
chosen by looking only at data from the tongue pellets themselves, these are
referred to as the "tongue" reference points.
To illustrate this procedure, figure 2.4a shows the time courses of the M, R,
and L pellets (only vertical for L) for the utterance /pips'pips/ with the
extrema marked with dashed lines. The acoustic waveform is displayed at the
top. For orientation, note that there are four displacement peaks marked for
LY, corresponding to the raising of the lower lip for the four bilabial-closure
gestures for the consonants. Between these peaks three valleys are marked,
corresponding to the opening of the lips for the three vowels. For MX, MY,
and RX, an extremum was found associated with each of the full vowels and
the medial schwa. For RY, a peak was found for schwa, but not for VI.
While there is a valley detected following the peak for schwa, it occurs during
the consonant closure interval, and therefore is not treated as associated with
V2. Figure 2.4b shows the same utterance with the complete set of "tongue"
reference points used to characterize each vowel. Reference points that have
been copied from other pellets (MY in both cases) are shown as solid lines.
Note that the consonant-closure interval extremum has been deleted.
Figure 2.5 shows the same displays for the utterance /pips'papa/. Note
that, in (a), extrema are missing for schwa for MX, MY, and RX. This is
typical of cases in which there is a large pellet displacement between VI and
V2. The trajectory associated with such a displacement moves from VI to V2,
with no intervening extremum (or even, in some cases, no "flattening" of the
curve).
As can be seen in figures 2.4 and 2.5, the reference points during the schwa
tend to be relatively late in its acoustic duration. As we will be evaluating the
relative contributions of VI and V2 in determining the pellet positions for
schwa, we decided also to use a reference point earlier in the schwa. To
obtain such a point, we used the valley associated with the lower lip for the
32
2 Catherine P. Browman and Louis Goldstein
(a)
Waveform
Tongue middle
horizontal
(MX)
Tongue middle
vertical
(MY)
Tongue rear
horizontal
(RX)
Tongue rear
vertical
(RY)
Lower lip
vertical
(LY)
700
Time (msec.)
(b) p i p 3 p i p 3
Waveform
fffliF
I ^""-—
Tongue middle
horizontal
(MX) - ^ ~~
Tongue middle v
vertical ——^
(MY)
Tongue rear
horizontal _ _ _ _ _ _ ^ - • " " " " "
^ — — ~ ~
(RX)
Tongue rear
vertical
(RY) "T"""--- ———" . -—--^
Lower lip
vertical
(LY) " 1 !
700
Time (msec.)
Figure 2.4 Pellet time traces for /pips'pips/. The higher the trace, the higher (vertical) or more
fronted (horizontal) the corresponding movement, (a) Position extrema indicated by dashed
lines, (b) "Tongue" reference points indicated by dashed and solid lines (for Middle and Rear
pellets)
33
Gesture
(a)
Waveform
Tongue middle
horizontal
(MX)
Tongue middle
vertical
(MY)
Tongue rear
horizontal
(RX)
Tongue rear
vertical
(RY)
Lower lip
vertical
(LY)
700
Time (msec.)
(b) h h
p i p a p a p a
Waveform
Tongue middle
horizontal
(MX)
Tongue middle
vertical
(MY) -
___
___
[ -
Tongue rear
horizontal
(RX) •—-— _j __--—~ ~
Tongue rear
-—-^
—-__ ____—\~
vertical
(RY) i
Lower lip
vertical
(LY) 1 M M ~""M^ 1 1 700
Time (msec.)
Figure 2.5 Pellet time traces for /pipa'papg/. The higher the trace, the higher (vertical) or more
fronted (horizontal) the corresponding movement, (a) Position extrema indicated by dashed
lines, (b) "Tongue" reference points indicated by dashed and solid lines (for Middle and Rear
pellets)
34
2 Catherine P. Browman and Louis Goldstein
schwa - that is, approximately the point at which the lip opening is maximal.
This point, called the "lip" reference, typically occurs earlier in the (acoustic)
vowel duration than the "tongue" reference point, as can be seen in figures
2.4 and 2.5. Another advantage of the "lip" reference point is that all tongue
pellets are measured at the same moment in time. Choosing points at
different times for different dimensions might result in an apparent differen-
tial influence of VI and V2 across dimensions. Two different reference points
were established only for the schwa, and not for the full vowels. That is, since
the full vowels provided possible environmental influences on the schwa, the
measure of that influence needed to be constant for comparisons of the "lip"
and "tongue" schwa points. Therefore, in analyses to follow, when "lip" and
"tongue" reference points are compared, these points differ only for the
schwa. In all cases, full vowel reference points are those determined using the
tongue extremum algorithm described above.
2.2.1 Results
Figure 2.6 shows the positions of the M (on the right) and R (on the left)
pellets for the full vowels plotted in the mid-sagittal plane such that the
speaker is assumed to be facing to the right. The ten points for a given vowel
are enclosed in an ellipse indicating their principal components (two stan-
dard deviations along each axis). The tongue shapes implied by these pellet
positions are consistent with cinefluorographic data for English vowels (e.g.,
Perkell 1969; Harshman, Ladefoged, and Goldstein, 1977; Nearey 1980). For
example, /i/ is known to involve a shape in which the front of the tongue is
bunched forward and up towards the hard palate, compared, for example, to
/e/, which has a relatively unconstricted shape. This fronting can be seen in
both pellets. In fact, over all vowels, the horizontal components of the
motion of the two pellets are highly correlated (r = 0.939 in the full vowel
data, between RX and MX over the twenty-five utterances). The raising for
I'll can be seen in M (on the right), but not in R, for which /i/ is low - lower,
for example, than /a/. The low position of the back of the tongue dorsum for
I'll can, in fact, be seen in mid-sagittal cinefluorographic data. Superimposed
tongue surfaces for different English vowels (e.g. Ladefoged 1982) reveal that
the curves for /i/ and /a/ cross somewhere in the upper pharyngeal region, so
that in front of this point, /i/ is higher than /a/, while behind this point, /a/ is
higher. This suggests that the R pellet in the current experiment is far enough
back to be behind this cross-over point, /u/ involves raising of the rear of the
tongue dorsum (toward the soft palate), which is here reflected in the raising
of both the R and M pellets. In general, the vertical components of the two
pellets are uncorrelated across the set of vowels as a whole (r = 0.020),
35
Gesture
300,
260i
/ x x7
A
V
220
140 180 220
Figure 2.6 Pellet positions for full vowels, displayed in mid-sagittal plane with head facing to the
right: Middle pellets on the right, Rear pellets on the left. The ellipses indicate two standard
deviations along axes determined by principal-component analysis. Symbols I = IPA /i/, U =
/u/, E = /e/, X = /A/, and A = /a/. Units are X-ray units (= 0.33 mm)
36
2 Catherine P. Browman and Louis Goldstein
(a) 300
220
140 220
Figure 2.7 Pellet positions for schwa at "tongue" reference points, displayed in right-facing
mid-sagittal plane as in figure 2.6. The ellipses are from the full vowels (figure 2.6), for
comparison. Symbols I = IPA /i/, U = /u/, E = /e/, X = /A/, and A = /a/. Units are X-ray
units ( = 0.33 mm), (a) Schwa pellet positions labeled by the identity of the following vowel (V2).
(b) Schwa pellet positions labeled by the identity of the preceding vowel (VI)
grand mean for both the M and R pellets. This pattern of distribution of
schwa points is exactly what would be expected if there were no independent
target for schwa but rather a continuous tongue trajectory from VI to V2.
Given all possible combinations of trajectory endpoints (VI and V2), we
would expect the mean value of a point located at (roughly) the midpoint of
37
Gesture
(a) 300
260-
220
140
(b) 300
260
220
140 180 220
Figure 2.8 Pellet positions for schwa at "lip" reference points, displayed as in figure 2.7
(including ellipses from figure 2.6). Symbols I = IPA /i/, U = /u/, E = /e/, X = /A/, and A = /a/.
Units are X-ray units ( = 0.33 mm). (a) Schwa pellet positions labeled by the identity of the
following vowel (V2). (b) Schwa pellet positions labeled by the identity of the preceding vowel
(VI)
these twenty-five trajectories to have the same value as the mean of the
endpoints themselves.
If it is indeed the case that the schwa can be described as a targetless point
on the continuous trajectory from VI to V2, then we would expect that the
schwa pellet positions could be predicted from knowledge of VI and V2
38
2 Catherine P. Browman and Louis Goldstein
300.
280. [B
UE>
•U
260.
a • %
A •. <
240.
220. aB
POO
120 140 160 180 200 220 240
Figure 2.9 Mean pellet positions for full vowels and schwa, displayed in right-facing mid-
sagittal plane as in figure 2.6. The grand mean of all the full vowels is indicated by a circled
square. Units are X-ray units ( = 0.33 mm)
notice that, although the differences are small, there is more spread between
the schwa pellets at the "lip" point (figure 2.11) than at the "tongue" point
(figure 2.10). This indicates that the schwa pellet was more affected by VI at
the "lip" point. There is also somewhat less cross-over for MX and RX in the
"lip" figure, indicating increased systematicity of the VI effect.
In summary, it appears that the tongue position associated with medial
schwa cannot be treated simply as an intermediate point on a direct tongue
trajectory from VI to V2. Instead, there is evidence that this V1-V2
trajectory is warped by an independent schwa component. The importance of
this warping can be seen, in particular, in utterances where VI and V2 are
identical (or have identical values on a particular pellet dimension). For
example, returning to the utterance /pips'pips/ in figure 2.4, we can clearly
see (in MX, MY, and RX) that there is definitely movement of the tongue
away from the position for /i/ between the VI and V2. This effect is most
pronounced for /i/. For example, for MY, the prediction error for the
41
Gesture
u- • 1
"Tongue" reference "A
a° °
e
MX MY RX RY
250
V1 3 V2 V1 3 V2 V1 V2 V1 a V2
Figure 2.10 Relation between full vowel pellet positions and "tongue" pellet positions for
schwa. The top row displays the pellet positions for utterances with the indicated initial vowels,
averaged across five utterances (each with a different final vowel). The bottom row displays the
averaged pellet positions for utterances with the indicated final vowels. Units are X-ray units ( =
0.33 mm)
equation without a constant is worse for /pips'pips/ than for any other
utterance (followed closely by utterances combining /i/ and /u/; MY is very
similar for /i/ and /u/). Yet, it may be inappropriate to consider this warping
to be the result of a target specific to schwa, since, as we saw earlier, the mean
tongue position for schwa is indistinguishable from the mean position of the
tongue across all vowels. Rather the schwa seems to involve a warping of the
trajectory toward an overall average or neutral tongue position. Finally, we
saw that VI and V2 affect schwa position differentially at two points in time.
The influence of the VI endpoint is strong and consistent at the "lip" point,
relatively early in the schwa, while V2 influence is strong throughout. In the
next section, we propose a particular model of gestural structure for these
utterances, and show that it can account for the various patterns that we
have observed.
u- • 1
"Lip" reference "A
a° °
e
MX MY RX RY
250
2001
V1 a V2 V1 a V2 V1 a V2 V1 s V2
Figure 2.11 Relation between full vowel pellet positions and "lip" pellet positions for schwa.
The top row displays the pellet positions for utterances with the indicated initial vowels,
averaged across five utterances (each with a different final vowel). The bottom row displays the
averaged pellet positions for utterances with the indicated final vowels. Units are X-ray units
( = 0.33 mm)
complete temporal overlap of this gesture and the gesture for the following
vowel. The blending caused by this overlap should yield the V2 effect on
schwa, while the VI effects should emerge as a passive consequence of the
differing initial conditions for movements out of different preceding vowels.
An example of this type of organization is shown in figure 2.12, which is
the gestural score we hypothesized for the utterance /pips'paps/. As in figure
2.3, each box indicates the activation interval of a particular gestural control
regime, that is, an interval of time during which the behavior of the particular
tract variable is controlled by a second-order dynamical system with a fixed
"target" (equilibrium position), frequency, and damping. The height of the
box represents the tract-variable "target." Four LA closure-and-release
gestures are shown, corresponding to the four consonants. The closure-and-
release components of these gestures are shown as separate boxes, with the
closure components having the smaller target for LA, i.e., smaller interlip
distance. In addition, four tongue-body gestures are shown, one for each of
the vowels - VI, schwa, V2, schwa. Each of these gestures involves
simultaneous activation of two tongue-body tract variables, one for constric-
tion location and one for constriction degree. The control regimes for the VI
and medial schwa gestures are contiguous and nonoverlapping, whereas the
V2 gesture begins at the same point as the medial schwa and thus completely
43
Gesture
p i p 9 p a pa
VEL
TTCL
TTCD
TBCL
TBCD
LA
LP
GLO
Time (msec.)
Figure 2.12 Gestural score for /pipa'papa/. Tract variable channels displayed, from top to
bottom, are: velum, tongue-tip constriction location and constriction degree, tongue-body
constriction location and constriction degree, lip aperture, lip protrusion, and glottis. Horizon-
tal extent of each box indicates duration of gestural activation; the shaded boxes indicate
activation for schwa. For constriction-degree tract variables (VEL, TTCD, TBCD, LA, GLO),
the higher the top of the box, the greater the amount of opening (aperture). The constriction-
location tract variables (TTCL, TBCL) are defined in terms of angular position along the curved
vocal tract surface. The higher the top of the box, the greater the angle, and further back and
down (towards the pharynx) the constriction
overlaps it. In other words, during the acoustic realization of the schwa
(approximately), the schwa and V2 gestural control regimes both control the
tongue movements; the schwa relinquishes active control during the follow-
ing consonant, leaving only the V2 tongue gesture active in the next syllable.
While the postulation of an explicit schwa gesture overlapped by V2 was
motivated by the particular results of section 2.2, the general layout of
gestures in these utterances (their durations and overlap) was based on
stiffness and phasing principles embodied in the linguistic model (Browman
and Goldstein, 1990).
Gestural scores for each of the twenty-five utterances were produced. The
activation intervals were identical in all cases; the scores differed only in the
TBCL and TBCD target parameters for the different vowels. Targets used
for the full vowels were those in our tract-variable dictionary. For the schwa,
the target values (for TBCL and TBCD) were calculated as the mean of the
targets for the five full vowels. The gestural scores were input to the task-
44
2 Catherine P. Browman and Louis Goldstein
Waveform
Tongue center
horizontal
(CX)
Tongue center
vertical
(CY)
Lower lip
vertical
Time (msec.)
Figure 2.13 Gestural score for /pipa'pipa/. Generated movements (curves) are shown for the
tongue center and lower lip. The higher the curve, the higher (vertical) or more fronted
(horizontal) the corresponding movement. Boxes indicate gestural activation; the shaded boxes
indicate activation for schwa. CX is superimposed on TBCL, CY on TBCD, lower lip on LA.
Note that the boxes indicate the degree of opening and angular position of the constriction (as
described infigure2.12), rather than the vertical and horizontal displacement of articulators, as
shown in the curves
45
Gesture
Waveform
Tongue center
horizontal
(CX)
Tongue center
vertical
(CY)
Lower lip
vertical
Time (msec.)
Figure 2.14 Gestural score for /pipa'papa/. Generated movements (curves) are shown for the
tongue center and lower lip. The higher the curve, the higher (vertical) or more fronted
(horizontal) the corresponding movement. Boxes indicate gestural activation; the shaded boxes
indicate activation for schwa. Superimposition of boxes and curves as in figure 2.13
/i/ to schwa to /a/, and the target for /a/ tends to be achieved relatively late,
compared to CX. Movements corresponding to RY motions are not found in
the displacement of the tongue-body circle, but would probably be reflected
by a point on the part of the model tongue's surface that is further back than
that section lying on the arc of a circle.
The model articulator motions were analyzed in the same manner as the X-
ray data, once the time points for measurement were determined. Since for
the X-ray data we assumed that displacement extrema indicated the effective
target for the gesture, we chose the effective targets in the simulated data as
the points to measure. Thus, points during VI and V2 were chosen that
corresponded to the point at which the vowel gestures (approximately)
reached their targets and were turned off (right-hand edges of the tongue
boxes in figures 2.13 and 2.14). For schwa, the "tongue" reference point was
chosen at the point where the schwa gesture was turned off, while the "lip"
reference was chosen at the lowest point of the lip during schwa (the same
criterion as for the X-ray data).
The distribution of the model full vowels in the mid-sagittal plane (CX x
CY) is shown in figure 2.15. Since the vowel gestures are turned off only after
they come very close to their targets, there is very little variation across the
ten tokens of each vowel. The distribution of schwa at the "tongue" reference
point is shown infigure2.16, labeled by the identity of V2 (in a) and VI (in b),
with the full vowel ellipses added for comparison. At this reference point that
occurs relatively late, the vowels are clustered almost completely by V2, and
the tongue center has moved a substantial portion of the way towards the
46
2 Catherine P. Browman and Louis Goldstein
1300
1200
850 950 1050
Figure 2.15 Tongue-center (C) positions for model full vowels, displayed in mid-sagittal plane
with head facing to the right. The ellipses indicate two standard deviations along axes
determined by principal-component analysis. Symbols I = IPA /i/, U = /u/, E = /e/, X = /A/,
and A = /a/. Units are ASY units ( = 0.09 mm), that is, units in the vocal tract model, measured
with respect to the fixed structures.
following full vowel. The distribution of schwa values at the "lip" reference
point is shown infigure2.17, labeled by the identity of V2 (in a), and of VI (in
b). Comparing figure 2.17(a) with figure 2.16(a), we can see that there is
considerably more scatter at the "lip" point than at the later "tongue" point.
We tested whether the simulations captured the regularities of the X-ray
data by running the same set of regression analyses on the simulations as
were performed on the X-ray data. The results are shown in table 2.2, which
has the same format as the X-ray data results in table 2.1. Similar patterns
are found for the simulations as for the data. At the "tongue" reference
point, for both CX and CY the best two-term prediction involves the schwa
component (constant) and V2, and this prediction is nearly as good as that
using all three terms. Recall that this was the case for all pellet dimensions
except for MY, whose differences were attributed to differences in the time
point at which this dimension was measured. (In the simulations, CX and CY
were always measured at the same point in time.) These results can be seen
graphically in figure 2.18, where the top row of panels shows the relation
between VI and schwa, and the bottom row shows the relation between V2
47
Gesture
1400
1300
1200
850 950 1050
(b) 1400
1300
1200 1 1 1 I I
850 950 1050
Figure 2.16 Tongue-center (C) positions for model schwa at "tongue" reference points,
displayed in right-facing mid-sagittal plane as in figure 2.15. The ellipses are from the model full
vowels (figure 2.15), for comparison. Symbols I = IPA /i/, U = /u/, E = /e/, X = /A/, and
A = /a/. Units are ASY units ( = 0.09 mm), (a) Model schwa positions labeled by the identity
of the following vowel (V2). (b) Model schwa positions labeled by the identity of the preceding
vowel (VI)
48
2 Catherine P. Browman and Louis Goldstein
(a) 1400
1300 -
1200
850 950 1050
(P) 1400
1300 -
1200
850 950 1050
Figure 2.17 Tongue-center (C) positions for model schwa at "lip" reference points, displayed
as in figure 2.16 (including ellipses from figure 2.15). Symbols I = IPA /i/, U = /u/, E = /e/,
X = /A/, and A = /a/. Units are ASY units ( = 0.09 mm). (a) Model schwa positions labeled
by the identity of the following vowel (V2). (b) Model schwa positions labeled by the identity
of the preceding vowel (VI)
49
Gesture
Table 2.2 Regression results of simulations
and schwa. The same systematic relation between schwa and V2 can be seen
in the bottom row as in figure 2.10 for the X-ray data, that is, no crossover.
(The lack of systematic relations between VI and schwa in the X-ray data,
indicated by the cross-overs in the top row of figure 2.10, is captured in the
simulations in figure 2.18 by the lack of variation for the schwa in the top
row.) Thus, the simulations capture the major statistical relation between the
schwa and the surrounding full vowels at the "tongue" reference point,
although the patterns are more extreme in the simulations than in the
data.
At the earlier "lip" reference point, the simulations also capture the
patterns shown by the data. For both CX and CY, the three-term predictions
in table 2.2 show substantially less error than the best two-term prediction.
This was also the case for the data in table 2.1 (except for RX), where VI, V2
and a schwa component (constant) all contributed to the prediction of the
value during schwa. This can also be seen in the graphs in figure 2.19, which
shows a systematic relationship with schwa for both VI and V2.
In summary, for the simulations, just as for the X-ray data, VI contributed
to the pellet position at the "lip" reference, but not to the pellet position at
the "tongue" point, while V2 and an independent schwa component contri-
buted at both points. Thus, our hypothesized gestural structure accounts for
the major regularities observed in the data (although not for all aspects of the
data, such as its noisiness or differential behavior among pellets). The
gestural-control regime for V2 begins simultaneously with that for schwa and
overlaps it throughout its active interval. This accounts for the fact that V2
and schwa effects can be observed throughout the schwa, as both gestures
50
2 Catherine P. Browman and Louis Goldstein
"Tongue" reference
CY
cx
1000 1350
900 1250
1000 1350
900 1250
V2 V2
Figure 2.18 Relation between model full vowel tongue-center positions and tongue-center
positions at "tongue" reference point for model schwas. The top row displays the tongue-center
positions for utterances with the indicated initial vowels, averaged across five utterances (each
with a different final vowel). The bottom row displays the averaged tongue-center positions for
utterances with the indicated final vowels. Units are ASY units (= 0.09 mm)
"Lip" reference
CY
cx
1000 1350
900 1250
1000 1350
900 1250
V2 V2
Figure 2.19 Relation between model full vowel tongue-center positions and tongue-center
positions at "lip" reference point for model schwas. The top row displays the tongue-center
positions for utterances with the indicated initial vowels, averaged across five utterances (each
with a different final vowel). The bottom row displays the averaged tongue-center positions for
utterances with the indicated final vowels. Units are ASY units ( = 0.09 mm)
exemplified in figure 2.20, the gestural scores took the same form as in figure
2.12, except that the schwa tongue-body gestures were removed. Thus, active
control of V2 began at the end of VI, and, without a schwa gesture, the
tongue trajectory moved directly from VI to V2. During the acoustic interval
corresponding to schwa, the tongue moved along this V1-V2 trajectory. The
resulting simulations in most cases showed a good qualitative fit to the data,
and produced utterances whose medial vowels were perceived as schwas. The
problems arose in utterances in which VI and V2 were the same (particularly
when they were high vowels). Figure 2.20 portrays the simulation for
/pipa'pipa/: the motion variables generated can be compared with the data in
figure 2.4. The "dip" between VI and V2 was not produced in the simulation,
and, in addition, the medial vowel sounded like /i/ rather than schwa. This
organization does not, then, seem possible for utterances where both VI and
V2 are high vowels.
We investigated the worst utterance (/pipa'pipa/) from the above set of
52
2 Catherine P. Browman and Louis Goldstein
P i P " a" P i P
Waveform
Tongue center
horizontal
(CX)
Tongue center _ - ^
vertical
(CY)
Lower lip y
vertical
Time (msec.)
Figure 2.20 Gestural score plus generated movements for /pip_'pip_/, with no activations for
schwa. The acoustic interval between the second and third bilabial gestures is perceived as an /i/.
Generated movements (curves) are shown for the tongue center and lower lip. The higher the
curve, the higher (vertical) or more fronted (horizontal) the corresponding movement. Super-
imposition of boxes and curves as in figure 2.13
Waveform
Tongue center
horizontal
(CX)
Tongue center
vertical
(CY)
Lower lip
vertical
Figure 2.21 The same gestural score for /pip_'pip_/ as in figure 2.20, but with the second and
third bilabial gestures closer together than in figure 2.20. The acoustic interval between the
second and third bilabial gestures is perceived as a schwa. Generated movements (curves) are
shown for the tongue center and lower lip. The higher the curve, the higher (vertical) or more
fronted (horizontal) the corresponding movement. Superimposition of boxes and curves as in
figure 2.13
displayed in figure 2.22, showed that the previous problem with /pip9'pip9/
was solved, since during the unspecified interval between the two full vowels,
the tongue-body lowered (from /i/ position) and produced a perceptible
schwa. Unfortunately, this "dip" between VI and V2 was seen for all
combinations of VI and V2, which was not the case in the X-ray data. For
example, this dip can be seen for /papa'papa/ in figure 2.23; in the X-ray
data, however, the tongue raised slightly during the schwa, rather than
lowering. (The "dip" occurred in all the simulations because the neutral
position contributing to the tongue-body movement was that of the tongue-
body articulators rather than that of the tongue-body tract variables;
consequently the dip was relative to the jaw, which, in turn, was lowering as
part of the labial release). In addition, because the onset for V2 was so late, it
would not be possible for V2 to affect the schwa at the "lip" reference point,
as was observed in the X-ray data. Thus, this hypothesis also failed to
capture important aspects of the data. The best hypothesis remains the one
tested first - where schwa has a target of sorts, but is still "colorless," in that
its target is the mean of all the vowels, and is completely overlapped by the
following vowel.
2.4 Conclusion
We have demonstrated how an explicit gestural model of phonetic structure,
embodying the possibilities of underspecification ("targetlessness") and
54
2 Catherine P. Browman and Louis Goldstein
Waveform
Tongue center
horizontal
(CX)
Tongue center
vertical
(CY)
Lower lip
vertical
Time (msec.)
Figure 2.22 The same gestural score for /pip_'pip_/ as in figure 2.20, but with the onset of the
second full vowel /i/ delayed. The acoustic interval between the second and third bilabial
gestures is perceived as a schwa. Generated movements (curves) are shown for the tongue center
and lower lip. The higher the curve, the higher (vertical) or more fronted (horizontal) the
corresponding movement. Superimposition of boxes and curves as in figure 2.13
P a i) 9 p a p
Waveform
Tongue center
horizontal ^~
(CX)
Tongue center
—-^
vertical
(CY)
^-~—
-^—_ —
Lower lip
vertical
y^
\
- r| |
200 300 400 500 800
Time (msec.)
Figure 2.23 The same gestural score as in figure 2.22, except with tongue targets appropriate for
the utterance /pap_'pap_/. The acoustic interval between the second and third bilabial gestures is
perceived as a schwa. Generated movements (curves) are shown for the tongue center and lower
lip. The higher the curve, the higher (vertical) or more fronted (horizontal) the corresponding
movement. Superimposition of boxes and curves as in figure 2.13
Comments on Chapter 2
SARAH HAWKINS
The speaker's task is traditionally conceptualized as one of producing
successive articulatory or acoustic targets, with the transitions between them
being planned as part of the production process.* A major goal of studies of
coarticulation is then to identify the factors that allow or prevent coarticula-
tory spread of features, and so influence whether or not targets are reached.
In contrast, Browman and Goldstein offer a model of phonology that is
couched in gestural terms, where gestures are abstractions rather than
movement trajectories. In their model, coarticulation is the inevitable
*The structure of this discussion is influenced by the fact that it originally formed part of a joint
commentary covering this paper and the paper by Hewlett and Shockey. Since the latter's paper
was subsequently considerably revised, mention of it has been removed and a separate
discussion prepared.
56
2 Comments
consequence of coproduction of articulatory gestures. Coarticulation is
planned only in the sense that the gestural score is planned, and traditional
notions of target modification, intertarget smoothing, and look-ahead pro-
cesses are irrelevant as explanations, although the observed properties they
are intended to explain are still, of course, of central concern.
Similarly, coarticulation is traditionally seen as a task of balancing
constraints imposed by the motoric system and the perceptual system - of
balancing ease of articulation with the listener's need for acoustic clarity.
These two opposing needs must be balanced within constraints imposed by a
third factor, the phonology of the particular language. Work on coarticula-
tion often tries to distinguish these three types of constraint.
For me, one of the exciting things about Browman and Goldstein's work is
that they are being so successful in linking, as opposed to separating,
motoric, perceptual, and phonological constraints. In their approach, the
motoric constraints are all accounted for by the characteristics of the task-
dynamic model. But the task-dynamic model is much more than an expres-
sion of universal biomechanical constraints. Crucially, the task-dynamic
model also organizes the coordinative structures. These are flexible, func-
tional groupings of articulators whose organization is not an inevitable
process of maturation, but must be learned by every child. Coordinative
structures involve universal properties and probably some language-specific
properties. Although Browman and Goldstein assign all language-specific
information to the gestural score, I suspect that the sort of things that are
hard to unlearn, like native accent and perhaps articulatory setting, may be
better modeled as part of the coordinative structures within the task
dynamics. Thus the phonological constraints reside primarily in the gestural
score, but also in its implementation in the task-dynamic model.
Browman and Goldstein are less explicitly concerned with modeling
perceptual constraints than phonological and motoric ones, but they are, of
course, concerned with what the output of their system sounds like. Hence
perceptual constraints dictate much of the organization of the gestural score.
The limits set on the temporal relationships between components of the
gestural score for any given utterance represent in part the perceptual
constraints. Variation in temporal overlap of gestures within these limits will
affect how the speech sounds. But the amount of variation possible in the
gestural score must also be governed by the properties and limits on
performance of the parameters in the task-dynamic model, for it is the task-
dynamic model that limits the rate at which each gesture can be realized. So
the perceptual system and the task-dynamic model can be regarded as in
principle imposing limits on possible choices in temporal variation, as
represented in the gestural score. (In practice, these limits are determined
from measurement of movement data.) Greater overlap will result in greater
57
Gesture
measurable coarticulation; too little or too much overlap might sound like
some dysarthric or hearing-impaired speakers. Browman and Goldstein's
work on schwa is a good demonstration of the importance of appropriate
temporal alignment of gestures. It also demonstrates the importance to
acceptable speech production of getting the right relationships between the
gestural targets and their temporal coordination.
Thus Browman and Goldstein offer a model in which perception and
production, and universal and language-specific aspects of the phonology,
are conceptually distinguishable yet interwoven in practice. This, to my way
of thinking, is as it should be.
The crucial issue in work on coarticulation, however, is not so much to say
what constraints affect which processes, as to consider what the controlled
variables are. Browman and Goldstein model the most fundamental con-
trolled variables: tongue constriction, lip aperture, velar constriction, and so
on. There are likely to be others. Some, like fundamental frequency, are not
strongly associated with coarticulation but are basic to phonology and
phonetics, and some, like aerodynamic variables, are very complex.
Let us consider an example from aerodynamics. Westbury (1983) has
shown allophonic differences in voiced stops that depend on position in
utterance and that all achieve cavity enlargement to maintain voicing. The
details of what happens vary widely and depend upon the place of articula-
tion of the stop, and its phonetic context. For example, for initial /b/, the
larynx is lowered, the tongue root moves backwards, and the tongue dorsum
and tip both move down. For final /b/, the larynx height does not change, the
tongue root moves forward, and the dorsum and tip move slightly upwards.
In addition, the rate of cavity enlargement, and the time function, also vary
between contexts. Does it make sense to try to include these differences? If
the task-dynamic system is primarily universal, then details of the sort
Westbury has shown are likely to be in the gestural score. But to include them
would make the score very complicated. Do we want that much detail in the
phonology, and if so, how should it be included? Browman and Goldstein
have elsewhere (1990) suggested a tiered system, and if that solution is
pursued, we could lose much of the distinction between phonetics and
phonology. While I can see many advantages in losing that distinction, we
could, on the other hand, end up with a gestural score of such detail that
some of the things phonologists want to do might become undesirably
clumsy. The description of phonological alternations is a case in point.
So to incorporate these extra details, we will need to consider the structure
and function of the gestural score very carefully. This will include consider-
ation of whether the gestural score really is the phonology-phonetics, or
whether it is the interface between them. In other words, do we see in the
gestural score the phonological primitives, or their output? Browman and
58
2 Comments
Goldstein say it is the former. I believe they are right to stick to their strong
hypothesis now, even though it may need to be modified later.
Another issue that interests me in Browman and Goldstein's model is
variability. As they note, the values of schwa that they produce are much less
variable than in real speech. There are a number of ways that variability
could be introduced. One, for schwa in particular, is that its target should not
be the simple average of all the vowels in the language, as Browman and
Goldstein suggest, but rather a weighted average, with higher weighting
given to the immediately preceding speech. How long this preceding domain
might be I do not know, but its length may depend on the variety of the
preceding articulations. Since schwa is schwa basically because it is centra-
lized relative to its context, schwa following a lot of high articulations could
be different from schwa in the same immediate context but following a
mixture of low and high articulations.
A second possibility, not specific to schwa, is to introduce errors. The
model will ultimately need a process that generates errors in order to produce
real-speech phenomena like spoonerisms. Perhaps the same type of system
could produce articulatory slop, although I think this is rather unlikely.
If the variability we are seeking for schwa is a type of articulatory slop, it
could also be produced by variability in the temporal domain. In Browman
and Goldstein's terms, the phase relations between gestures may be less
tightly tied together than at present.
A fourth possibility is that the targets in the gestural score could be less
precisely specified. Some notion of acceptable range might add the desired
variability. This idea is like Keating's (1988a) windows, except that her
windows determine an articulator trajectory, whereas Browman and Gold-
stein's targets are realized via the task-dynamic model, which adds its own
characteristics.
Let me finish by saying that one of the nice things about Browman and
Goldstein's work is how much it tells us that we know already. Finding out
what we already know is something researchers usually hope to avoid. But in
this case we "know" a great number of facts of acoustics, movement, and
phonology, but we do not know how they fit together. Browman and
Goldstein's observations on intrusive schwa, for example,fitwith my own on
children's speech (Hawkins 1984: 345). To provide links between disparate
observations seems to me to achieve a degree of insight that we sorely need in
this field.
59
Gesture
Comments on Chapter 2
JOHN KINGSTON
Introduction
Models are valued more for what they predict, particularly what they predict
not to occur, than what they describe. While the capacity of Browman and
Goldstein's gestural model to describe articulatory events has been demon-
strated in a variety of papers (see Browman et al. 1984; Browman and
Goldstein 1985, 1986, 1990; Browman et al. 1986), and there is every reason
to hope that it will continue to achieve descriptive success, I am less sanguine
about its predictive potential. The foundation of my pessimism is that
gestural scores are not thus far constructed in terms of independent princi-
ples which would motivate some patterns of gestural occurrence and
coordination, while excluding others.
"Independent principles" are either such as constrain nonspeech and
speech movement alike, or such as arise from the listener's demands on the
speaker. That such principles originate outside the narrowly construed events
of speaking themselves guards models built on them from being hamstrung
by the ad hoc peculiarities of speech movements. The scores' content is
constrained by the limited repertoire of gestures used, but because gestures'
magnitude may be reduced in casual speech, even to the point of deletion
(Browman and Goldstein 1990), the variety of gestures in actual scores is
indefinitely large. Further constraints on the interpretation of scores come
from the task dynamics, which are governed by principles that constrain
other classes of movements (see Kelso et al. 1980; Nelson 1983; Ostry, Keller,
and Parush 1983; Saltzman and Kelso 1987). The task dynamics rather than
the gestural score also specify which articulatory movements will produce a
particular gesture. The gestural score thus represents the model's articulatory
goals, while the specific paths to these goals are determined by entirely
dynamical means. Gestural coordination is not, however, constrained by the
task dynamics and so must be stipulated, and again the number of possible
patterns is indefinitely large. Despite this indefiniteness in the content and
coordination of scores, examining the articulation of schwa should be
informative about what is in a score and how the gestures are coordinated,
even if in the end Browman and Goldstein's account does not extend beyond
description.
The next two sections of this commentary examine Browman and Gold-
stein's claim that English schwa has an articulation of its own. This
examination is based on an extension of their statistical analysis and leads to
a partial rejection of their claim. In the final section, the distinction between
predictive vs. descriptive models is taken up again.
60
2 Comments
Table 2.3 Variances for lip and tongue reference
positions
MX MY RX RY
k + Vl+V2 k + Vl k + V2 V1+V2 k VI V2
k + Vl+V2 k + Vl k + V2 V1+V2 k VI V2
between dependent variables, the absolute magnitude of the SEs for models
with different dependent variables cannot be compared. Accordingly, to
evaluate how well the various regression models fare across the pellet
positions, a measure of variance should be used that is independent of the
effect of different variances among the dependent variables, i.e. R\ rather
than SE. More to the point, the R2s can be employed in significance tests of
differences between models of the same .dependent variable with different
numbers of terms. The equation in (1) can be solved for R2, but only if one
knows L(Y — Y)2 (solving this equation for R2 shows that the values listed
by Browman and Goldstein must be the squared standard error of estimate).
This variance was obtained from Browman and Goldstein's figures, which
plot the four tongue coordinates; measurements were to the nearest division
along each axis, and their precision is thus ± 1 mm for MY and RY,
±0.55 mm for RX, and ±0.625 mm for MX (measurement error in either
direction is roughly equal to half a division for each of the pellet coordi-
nates). The variances obtained do differ substantially for the four measures
(see table 2.3), with the variances for vertical position consistently larger than
for horizontal position at both reference points. The resulting shrunken R2s
for the various regression models at lip and tongue reference positions are
shown in tables 2.4 and 2.5 (the gaps in these tables are for regression models
not considered by the stepwise procedure). Shrunken R2s are given in these
tables because they are a better estimate of the proportion of variance
62
2 Comments
accounted for in the population from which the sample is taken when the ratio
of independent variables q to n is large, as here, and when independent
variables are selected post hoc, as in the step wise regression. (Shrunken R2s
were calculated according to formula (3.6.4) in Cohen and Cohen (1983:
106-7), in which q was always the total number of independent variables
from which selections were made by the stepwise procedure, i.e. 3.) The
various models were tested for whether adding a term to the equation
significantly increased the variance, assuming Model I error (see Cohen and
Cohen 1983: 145-7). Comparisons were made of k + Vl + V2 with k + VI or
k + V2 and with V1+V2.
The resulting F-statistics confirmed Browman and Goldstein's contention
that adding VI to the k + V2 model does not increment the variance
significantly at MX, RX, or RY at the tongue reference positions (for
k + Vl+V2vs. k + V2:MXF (219) = 0.637,p>0.05; RX F(2l9) = 0,/>>0.05;
and RY F(219) = 0.668, p > 0.05) and also supports their observation that
both VI and V2 increment R2 substantially for MY at the tongue reference
positions (for k +VI + V2 vs. k +VI, F(2l9) = 4.025, p< 0.05).
However, their claim that for the lip reference position, the two-term
models k + Vl or k + V2 account for substantially less variance than the
three-term model k + Vl + V2 is not supported, for any dependent variable
(for k + Vl +V2 vs. k + Vl: MX F(219) = 2.434, p > 0.05 and MY F(2l9) =
2.651, p > 0.05, and for k + Vl+V2 vs. k + V2: RXF(2>19) = 0.722,/? > 0.05
and RY F{219) = 2.915, p > 0.05). Comparisons of the other two-term
model, V1+V2, with k + Vl+V2 yielded significant differences for MY
(F(2l9) = 8.837,/? < 0.01) and RX(F(219) = 8.854,/? < 0.01), but for neither
dependent variable was V1+V2 the second-best model. At MX and RY,
the differences in amount of variance accounted for by VI +V2 vs. k + Vl
(MX) or k + V2 (RY) are very small (less than half of 1 percent in each
case), so choosing the second-best model is impossible. In any case, there
is no significant increment in the variance in the three-term model,
k + Vl+V2, with respect to the V1+V2 two-term model at MX (F(219) =
2.591, p > 0.05) or RY (F(2>19) = 2.483, p > 0.05). Thus at the lip refe-
rence position, schwa does not coarticulate strongly with V2 at MX
or MY, nor does it coarticulate strongly with VI at RX or RY. There is
evidence for an independent schwa target at MY and RX, but not MX or
RY. Use of R2s rather than SEs to evaluate the regression models has thus
weakened Browman and Goldstein's claims regarding both schwas having
a target of its own and the extent to which it is coproduced with flanking
vowels.
63
Gesture
64
2 Comments
gestural overlap? On the other hand, if more overlap is always observed
between the schwa and the following full vowel, why should anticipatory
coarticulation be more extensive than carry-over? And in this case, what does
greater anticipatory coarticulation indicate about the relationship between
the organization of gestures and the trochaic structure of stress feet in
English? All of these are questions that we might expect an explanatory or
predictive theory of gestural coordination to answer.
The gestural theory developed by Browman and Goldstein may have all
the pieces needed to construct a machine that will produce speech, indeed, it
is already able to produce particular speech events, but as yet there is no
general structure into which these pieces may be put which would produce
just those kinds of speech events that do occur and none of those that do not.
Browman and Goldstein's gestural theory is not incapable of incorporating
general principles which would predict just those patterns of coordination
that occur; the nature of such principles is hinted at by Kelso, Saltzman, and
Tuller's (1986a) replication of Stetson's (1951) demonstration of a shift from
a VC to CY pattern of articulatory coordination as rate increased. Kelso,
Saltzman, and Tuller suggest that the shift reflects the greater stability of CV
over VC coordination, but it could just as well be that place and perhaps
other properties of consonants are more reliably perceived in the transition
from C to V than from V to C (see Ohala 1990 and the references cited there,
as well as Kingston 1990 for a different view). If this latter explanation is
correct, then the search for the principles underlying the composition of
gestural scores must look beyond the facts of articulation, to examine the
effect the speaker is trying to convey to the listener and in turn what
articulatory liberties the listener allows the speaker (see Lindblom 1983,
Diehl and Kluender 1989, and Kingston and Diehl forthcoming for more
discussion of this point).
Comments on Chapter 2
WILLIAM BARRY
In connection with Browman and Goldstein's conclusion that schwa is
"weak but not completely targetless," I should like to suggest that they reach
it because their concept of schwa is not totally coherent with the model
within which the phenomenon "neutral vowel" is being examined. The two
"nontarget" simulations that are described represent two definitions:
1 A slot in the temporal structure which is empty with regard to vowel quality,
the vowel quality being determined completely by the preceding and
65
Gesture
following vowel targets. This conflicts, in spirit at least, with the basic
concept of a task-dynamic system, which explicitly evokes the physiologi-
cally based "coordinative structures" of motor control (Browman and
Goldstein 1986). A phonologically targetless schwa could still not escape the
residual dynamic forces of the articulatory muscular system, i.e. it would be
subject to the relaxation forces of that system.
2 A relaxation target. The relaxation of the tongue-height parameter in the
second simulation is an implicit recognition of the objection raised in
point 1, but it still clashes with the "coordinative" assumption of articula-
tory control, which argues against one gesture being relaxed independent of
other relevant gestural vowel parameters.
66
2 Comments
for many different functions, a total physiological independence of one
functional system from another using the same muscles cannot be expected.
A critical differentiation within speech, for example, is between possible
vocalic vs. consonantal functional subsystems. It has long been postulated as
descriptively convenient and physiologically supportable that consonantal
gestures are superimposed on an underlying vocalic base (Ohman 1966a;
Perkell 1969; Hardcastle 1976). Browman and Goldstein's gestural score is
certainly in accordance with this view. A resolution of the problem within the
present discussion is not necessary, however, since the bilabial consonantal
context is maximally independent of the vocalic system and is kept constant.
67
3
Prosodic structure and tempo in
a sonority model of articulatory dynamics
3.1 Introduction
One of the most difficult facts about speech to model is that it unfolds in
time.* The phonological structure of an utterance can be represented in
terms of a timeless organization of categorical properties and entities -
phonemes in sequence, syllables grouped into stress feet, and the like. But a
phonetic representation must account for the realization of such structures as
physical events. It must be able to describe, and ultimately to predict, the
time course of the articulators moving and the spectrum changing.
Early studies in acoustic phonetics demonstrated a plethora of influences
on speech timing, with seemingly complex interactions (e.g. Klatt 1976). The
measured acoustic durations of segments were shown to differ widely under
variation in overall tempo, in the specification of adjacent segments, in stress
placement or accentuation, in position relative to phrase boundaries, and so
on. Moreover, the articulatory kinematics implicated in any one linguistic
specification - tempo or stress, say - showed a complicated variation across
speakers and conditions (e.g. Gay 1981). The application of a general model
of limb movement (task dynamics) shows promise of resolving this variation
by relating the durational correlates of tempo and stress to the control of
dynamic parameters such as gestural stiffness and amplitude (e.g. Kelso et al.
1985; Ostry and Munhall 1985). However, the mapping between these
68
3 M. Beckman, J. Edwards, and J. Fletcher
parameters and the underlying phonological specification of prosodic struc-
ture is not yet understood.
A comparison of the articulatory dynamics associated with several different
lengthening effects suggests an approach to this mapping. This paper explores
how the general task-dynamic model can be applied to the durational
correlates of accent, of intonation-phrase boundaries, and of slower overall
speaking tempo. It contrasts the descriptions of these three different effects in a
corpus of articulatory measurements of jaw movement patterns in [pap]
sequences. We will begin by giving an overview of the task-dynamic model
before presenting the data, and conclude by describing what the data suggest
concerning the nature of timing control. We will propose that, underlying the
more general physical representation of gestural stiffness and amplitude, there
must be an abstract language-specific phonetic representation of the time
course of sonority at the segmental and various prosodic levels.
70
3 M. Beckman, J. Edwards, and J. Fletcher
Figure 3.1 Predicted relationships among kinematic measures for sequences of gestures with
(b)-(d) differing stiffnesses, (e)-(g) differing displacements, and (h)-(j) differing intergestural
phasings
71
Gesture
durations. In this case, the ratio of the displacement to the peak velocity
should be a good linear predictor of the observed duration (fig. 3.Id). In a
pure amplitude change as well, peak velocity should change, but here
observed displacement should also change, in constant proportion to the
velocity change (fig. 3.If). In accordance with the constant displacement-
velocity ratio, the observed duration is predicted to be fairly constant (fig.
3.1g). Finally, in a phase change, peak velocity and displacement might
remain relatively unchanged (fig. 3.1i), but the observed duration would
change as the following gesture is phased earlier or later; it would be shorter
or longer than predicted by the displacement-velocity ratio (fig. 3.1j). If the
following gesture is phased early enough, the effective displacement might
also be measureably smaller for the same peak velocity ("truncated" tokens
in figs. 3.1i and 3.1j).
3.3 Methods
In our experiment, we measured the kinematics of movements into and out
of a low vowel between two labial stops. These [pap] sequences occurred
in the words pop versus poppa in two different sentence types, shown in
table 3.1. In the first type, the target word is set off as a separate intonation
phrase, and necessarily bears a nuclear accent. In the other type, the noun is
likely to be part of a longer intonation phrase with nuclear accent falling
later.
We had four subjects read these sentences at least five times in random
order at each of three self-selected tempi. We used an optoelectronic tracking
system (Kay et al. 1985) to record jaw height during these productions. We
looked at jaw height rather than, say, lower-lip height because of the jaw's
contribution both to overall openness in the vowel and to the labial closing
gesture for the adjacent stops. We defined jaw opening and closing gestures
as intervals between moments of zero velocity, as shown infigure3.2, and we
measured their durations, displacements, and peak velocities. We also made
72
3 M. Beckman, J. Edwards, and J. Fletcher
Figure 3.2 Sample jaw height and velocity traces showing segmentation points for vowel
opening gesture and consonant closing gesture in Poppa
73
Gesture
(a)
Hz
180 KAJ
140 r
100 LL H%
I
- Pop, opposing the question strongly,
• 1 . . , , , • • • , 2 , |
(b)
H*+L KAJ
180
- ^ H* + L
H*+L
140 \ H*
L
100
Pop opposed the question strongly
i i . . . . . 1 . i i . , 2 ,
(c)
280 JRE
\ H+L*
220 H%
first that the gestures are longer in the nuclear-accented syllable in poppa,
posing. Note also the distribution of the durational increase; at all three
tempi, it affects both the opening and closing gesture of the syllable, although
it affects the closing gesture somewhat more. In the next row we see that the
gestures are larger in the accented syllable. Again, the increase in the
kinematic measure affects both the opening and the closing gesture; both
move about 2 mm further. Finally, in the last row we see the effect of accent
on the last kinematic measure. Here, by contrast, there is no consistent
74
3 M. Beckman, J. Edwards, and J. Fletcher
o — o Accented, • — • Unaccented
Subject JRE
Opening gesture Closing gesture
. 300 300-
Figure 3.4 Mean durations, displacements, and peak velocities of opening and closing gestures
for nuclear-accented syllables (in Poppa, posing) vs. unaccented syllables (in Poppa posed) for
subject JRE
pattern. The opening gesture is faster in the accented syllable, but the closing
gesture is clearly not.
The overall pattern of means agrees with Summers's (1987) results for
accented and unaccented monosyllabic nonsense words. Interpreting this
pattern in terms of intragestural dynamics alone, we would be forced to
conclude that accent is not realized as a uniform change in a single
specification for the syllable as a whole. A uniform increase in the amplitude
specification for the accented syllable would be consistent with the greater
75
Gesture
accented O, unaccented #
550
150
0.100 0.150 0.200 0.250 0.300
Predicted syllable duration (sec.) =
Idisplacement/velocity (mm/[mm/sec])
Figure 3.5 Observed syllable durations against predicted syllable durations for the accented vs.
unaccented syllables in figure 3.4
displacements of both gestures and with the greater velocity of the opening
gesture, but not with the velocity of the closing gesture. A decrease in
stiffness for the accented syllable could explain the increased durations of the
two gestures, but it must be just enough to offset the velocity increase caused
by the increased displacement of the closing gesture.
If we turn to the intergestural dynamics, however, we can explain both the
displacement and the length differences in terms of a single specification
change: a different phasing for the closing gesture relative to the opening
gesture. That is, the opening gesture is longer in the accented syllable because
its gradual approach towards the asymptotic target displacement is not
interrupted until later by the onset of the closing gesture. Consequently, its
effective displacement is larger because it is not truncated before reaching its
target value in the vowel. The closing gesture, similarly, is longer because the
measured duration includes a quasi-steady-state portion where its rapid
initial rise is blended together with the gradual fall of the opening gesture's
asymptotic tail. And its displacement is larger because it starts at a greater
distance from its targeted endpoint in the following consonant.
Figure 3.5 shows some positive evidence in favor of this interpretation.
The value plotted along the y-axis in this figure is the observed overall
duration of each syllable token, calculated by adding the durations of the
opening and closing gestures. The value along the x-axis is a relative measure
76
3 M. Beckman, J. Edwards, and J. Fletcher
77
Gesture
final o—o, non-final • •
Subject JRE
Opening gesture Closing gesture
300- 300-
I 250--
200
| +
c 150--
2
Figure 3.6 Mean durations, displacements, and peak velocities of opening and closing gestures
for phrase-final syllables (in Pop, opposing) versus nonfinal syllables (in Poppa, posing) for
subject JRE
78
3 M. Beckman, J. Edwards, and J. Fletcher
final O, non-final
600
200
0.100 0.150 0.200 0.250 0.300
Predicted syllable duration (sec.) =
Xdisplacement/velocity (mm/[mm/sec])
Figure 3.7 Observed syllable durations against predicted syllable durations for the final vs.
nonfinal syllables in figure 3.6
lengthening makes a syllable primarily slower rather than bigger. That is,
phrase-final syllables are longer not because their closing gestures are phased
later, but rather because they are less stiff. In terms of the underlying
dynamics, then, we might describe final lengthening as an actual targeted
slowing down, localized to the last gesture at the edge of a phrase.
Applying the reasoning that we used to interpret figure 3.5 above, this
description predicts that the relationship between observed duration and
predicted duration should be the same for final and nonfinal closing gestures.
That is, in figure 3.7, which plots each syllable's duration against the sum of
its two displacement-velocity ratios, the phrase-final tokens should be part of
the same general trend, differing from nonfinal tokens only in lying further
towards the upper-right corner of the graph.
However, the figure does not show this predicted pattern. While the fast
tempo and normal tempo tokens are similar for the two phrasal positions,
the five slow-tempo phrase-final tokens are much longer than predicted by
their displacement-velocity ratios, making the regression curve steeper and
pulling it away from the curve for nonfinal tokens. The meaning of this
unexpected pattern becomes clear when we compare our description of final
lengthening as a local slowing down to the overall slowing down of tempo
change.
79
Gesture
final O O, non-final
600
200
fast normal slow
Figure 3.8 Mean syllable duration for fast, normal, and slow tempi productions of thefinaland
nonfinal syllables in figure 3.6
Subject JRE
Opening gesture Closing gesture
300 300-
160-
130-
100-
70-
40- 1 1
final nonfinal final nonfinal
Figure 3.9 Mean durations, displacements, and peak velocities of opening and closing gestures
for fast, normal, and slow tempi productions of final and nonfinal syllables shown previously in
figure 3.8
data as in fig. 3.6, replotted to emphasize the effect of tempo.) For the
opening gesture, shown in the left-hand column, slowing down tempo
resulted in longer movement durations and lower peak velocities, unaccom-
panied by any change in movement size. Conversely, speeding up tempo
resulted in overall shorter movement durations and higher peak velocities,
again unaccompanied by any general increase in displacement. (The small
increase in displacement for the nonfinal syllables is statistically significant,
but very small when compared to the substantial increase in velocity.) These
81
Gesture
patterns suggest that, for JRE, the primary control parameter in tempo
variation is gestural stiffness. She slows down tempo by decreasing stiffness
to make the gestures slower and longer, whereas she speeds up tempo by
increasing stiffness to make the gestures faster and shorter. This general
pattern was true for the opening gestures of the three other subjects as well,
although it is obscured somewhat by their differing abilities to produce three
distinct rates.
The closing gestures shown in the right-hand column of Figure 3.9, by
contrast, do not show the same consistent symmetry between speeding up
and slowing down. In speeding up tempo, the closing gestures pattern like the
opening gestures; at fast tempo the gesture has shorter movement durations
and higher peak velocities for both thefinaland nonfinal syllables. In slowing
down tempo, however, there was an increase in movement duration and a
substantial decrease in movement velocity only for syllables in nonfinal
position; phrase-final closing gestures were neither longer nor significantly
slower at slow than at normal overall tempo. Subject CDJ showed this same
asymmetry.
The asymmetry can be summarized in either of the following two ways:
subjects JRE and CDJ had little or no difference between normal and slow
tempo durations and velocities for final syllables; or, these two subjects had
little or no difference between final and nonfinal closing gesture durations
and velocities at slow tempo. It is particularly noteworthy that, despite this
lack of any difference in closing gesture duration, the contrast in overall
syllable duration was preserved, as was shown above for JRE infigure3.8. It
is particularly telling that these two subjects had generally longer syllable
durations than did either subject KAJ or KDJ.
We interpret this lack of contrast in closing gesture duration for JRE and
CDJ as indicating some sort of lower limit on movement velocity or stiffness.
That this limit is not reflected in the overall syllable duration, on the other
hand, suggests that the subjects use some other mechanism - here probably a
later phasing of the closing gesture - to preserve the prosodic contrast in the
face of a limit on its usual dynamic specification. This different treatment of
slow-tempo final syllables would explain the steeper regression curve slope in
figure 3.7 above. Suppose that for fast and normal tempi, the nonfinal and
final syllables have the same phasing, and that the difference in observed
overall syllable duration results from the closing gestures being longer
because they are less stiff in the phrase-final position. For these two tempi,
the relationship between observed duration and predicted duration would be
the same. At slow tempo, on the other hand, the phrase-final gestures may
have reached the lower limit on gestural stiffness. Perhaps this is a physio-
logical limit on gestural speed, or perhaps the gesture cannot be slowed any
further without jeopardizing the identity of the [p] as a voiceless stop. In
82
3 M. Beckman, J. Edwards, and J. Fletcher
order to preserve the durational correlates of the prosodic contrast, however,
the closing gesture is phased later, making the observed period of the syllable
longer relative to its predicted period, and thus preserving the prosodic
contrast in the face of this apparent physiological or segmental limit.
83
Gesture
Our understanding of sonority owes much to earlier work on tone scaling
(Liberman and Pierrehumbert 1984; Pierrehumbert and Beckman 1988). We
see the inherent phonological sonority of a segment as something analogous
to the phonological specification of a tone, except that the intrinsic scale is
derived from the manner features of a segment, as proposed by Clements
(1990a): stops have L (Low) sonority and open vowels have H (High)
sonority. These categorical values are realized within a quantitative space
that reflects both prosodic structure and paralinguistic properties such as
overall emphasis and tempo. That is, just as prosodic effects on F o are
specified by scaling H and L tones within an overall pitch range, prosodic
effects on phonetic sonority are specified by scaling segmental sonority values
within a sonority space.
This sonority space has two dimensions. One dimension is Silverman and
Pierrehumbert's (1990) notion of sonority: the impedance of the vocal-tract
looking forward from the glottis. We index this dimension by overall vocal
tract openness, which can be estimated by jaw height in our target [pap]
sequences. The other dimension of the sonority space is time; a vertical
increase in overall vocal-tract openness is necessarily coupled to a horizontal
increase in the temporal extent of the vertical specification. In this two-
dimensional space, prosodic constituents all can be represented as rectangles
having some value for height and width, as shown in figure 3.10b for the
words poppa and pop. The phonological properties of a constituent will help
to determine these values. For example, in the word poppa, the greater
prosodic strength of the stressed first syllable as compared to the unstressed
second syllable is realized by the larger sizes of the outer box for this syllable
and of the inner box for its bimoraic nucleus. (Figure 3.10a shows the
phonological representation for these constituents, using the moraic analysis
of heavy syllables first proposed by Hyman 1985.)
We understand the lengthening associated with accent, then, as part of an
increase in overall sonority for the accented syllable's nucleus. The larger
mean displacements of accented gestures reflect the vertical increase, and the
later phasing represents the coupled horizontal increase, as in figure 3.10c.
This sort of durational increase is analogous to an increase in local tonal
prominence for a nuclear pitch accent within the overall pitch range.
Final lengthening and slowing down tempi are fundamentally different
from this lengthening associated with accent, in that neither of these effects is
underlying a sonority change. Instead, both of these are specified as increases
in box width uncoupled to any change in box height; they are strictly
horizontal increases that pull the sides of a syllable away from its centre, as
shown in figure 3.10d and e.
The two strictly horizontal effects differ from each other in their locales.
Slowing down tempo overall is a more global effect that stretches out a
84
3 M. Beckman, J. Edwards, and J. Fletcher
(a) F F
(b)
Time-
(c)
1 13 1 p
\ / 9
a a
(f) m
Figure 3.10 (a) Prosodic representations and (b) sonority specifications for the words pop and
poppa. Effects on sonority specification of (c) accentual lengthening, (d) final lengthening, and
(e) slowing down tempo, (f) Prosodic representation and sonority representation of palm
85
Gesture
syllable on both sides of its moraic center. It is analogous in tone scaling to a
global change of pitch range for a phrase. Final lengthening, by contrast, is
local to the phrase edge. It is analogous in tone scaling to final lowering.
In proposing that these structures in a sonority-time space mediate
between the prosodic hierarchy and the gestural dynamics, we do not mean
to imply that they represent a stage of processing in a hypothetical derivatio-
nal sequence. Rather, we understand these structures as a picture of the
rhythmic framework for interpreting the dynamics of the segmental gestures
associated to a prosodic unit. For example, in our target [pap] sequences,
accentual lengthening primarily affected the phasing of the closing gesture
into the following consonant. We interpret this pattern as an increase in the
sonority of the accented syllable's moraic nucleus. Since the following [p] is
associated to the following syllable, the larger sonority decreases its overlap
with the vowel gesture. If the accented syllable were [pam] (palm), on the
other hand, we would not predict quite so late a phasing for the [m] gesture,
since the syllable-final [m] is associated to the second mora node in the
prosodic tree, as shown in figure 3.1 Of.
This understanding of the sonority space also supports the interpretation
of Fo alignment patterns. For example, Steele (1987) found that the relative
position of the Fo peak within the measured acoustic duration of a nuclear-
accented syllable remains constant under durational increases due to overall
tempo change. This is just what we would expect in our model of the
sonority-time space if the nuclear accent is aligned to the sonority peak for
the syllable nucleus. We would model this lengthening as a stretching of the
syllable to both sides of the moraic center. In phrase-final position, on the
other hand, Steele found that the Fo peak comes relatively earlier in the
vowel. Again, this is just what we would predict from our representation of
final lengthening as a stretching that is local to the phrase edge. This more
abstract representation of the rhythmic underpinnings of articulatory
dynamics thus allows us to understand the alignment between the F o pattern
and the segmental gestures.
In further work, we hope to extend this understanding to other segmental
sequences and other prosodic patterns in English. We also hope to build on
Vatikiotis-Bateson's (1988) excellent pioneering work in cross-linguistic
studies of gestural dynamics, to assess the generality of our conclusions to
analogous prosodic structures in other languages and to rhythmic structures
that do not exist in English, such as phonemic length distinctions.
86
3 Comments
Comments on chapter 3
OSAMU FUJIMURA
The paper by Beckman, Edwards, and Fletcher has two basic points: (1) the
sonority contour defines temporal organization; and (2) mandible height is
assumed to serve as a measure of sonority. In order to relate mandible
movement to temporal patterns, the authors propose to use the task-
dynamics model. They reason as follows. Since task dynamics, by adopting a
given system time constant ("stiffness," see below), defines a fixed relation
between articulatory movement excursion ("amplitude") and the duration of
each movement, measuring the relation between the two quantities should
test the validity of the model and reveal the role of adjusting control variables
of the model for different phonetic functions. For accented vs. unaccented
syllables, observed durations deviate from the prediction when a fixed
condition for switching from one movement to the next is assumed, while
under phrase-final situations, data conform with the prediction by assuming
a longer time constant of the system itself. (Actually, the accented syllable
does indicate some time elongation in the closing gesture as well.)
Based on such observations, the authors suggest that there are two
different mechanisms of temporal control: (1) "stiffness," which in this model
(as in Browman and Goldstein 1986, 1990) means the system time constant;
and (2) "phase," which is the timing of resetting of the system for a new
target position. This resetting is triggered by the movement across a preset
threshold position value, which is specified in terms of a percentage of the
total excursion. The choice between these available means of temporal
modulation depends on phonological functions of the control (amplitude
should not interact with duration in this linear model). This is a plausible
conclusion, and it is very interesting. There are some other observations that
can be compared with this; for example, Macchi (1988) demonstrated that
different articulators (the lower lip vs. the mandible) carry differentially
segmental and suprasegmental functions in lip-closing gestures.
I have some concern about the implication of this work with respect to the
model used. Note that the basic principle of oscillation in task dynamics most
naturally suggests that the rest position of the hypothetical spring-inertia
system is actually the neutral position of the articulatory mechanism. In a
sustained repetition of opening and closing movements for the same syllable,
for example, this would result in a periodic oscillatory motion which is
claimed to reflect the inherent nature of biological systems. In the current
model, however, (as in the model proposed by Browman and Goldstein
[1986, 1900]), the rest position of the mass, which represents the asymptote of
a critically damped system, is not the articulatory-neutral position but the
target position of the terminal gesture of each (demisyllabic) movement.
87
Gesture
Thus the target position must be respecified for each new movement. This
requires some principle that determines the point in the excursion at which
target resetting should take place. For example, if the system were a simple
undamped oscillatory system consisting of a mass and a spring, it could be
argued that one opening gesture is succeeded immediately by a closing
gesture after the completion of the opening movement (i.e. one quarter cycle
of oscillation); this model would result in a durational property absolutely
determined by the system characteristics, i.e. stiffness of the spring (since
inertia is assumed to equal unity), for any amplitude of the oscillation. In
Beckman, Edwards, and Fletcher's model, presumably because of the need to
manipulate the excursion for each syllable, a critically damped second-order
(mass-spring) system is assumed. This makes it absolutely necessary to
control the timing of resetting. However, this makes the assertion of task
dynamics - that the biological oscillatory system dictates the temporal
patterning of speech - somewhat irrelevant. Beckman, Edwards, and
Fletcher's model amounts to assuming a critically damped second-order
linear system as an impulse response of the system. This is a generally useful
mathematical approximation for implementing each demisyllabic movement
on the basis of a command. The phonetic control provides specifications of
timing and initial (or equivalently target) position, and modifies the system
time constant.
The duration of each (upgoing or downgoing) mandibular movement is
actually measured as the interval between the two extrema at the end of each
(demisyllabic) movement. Since the model does not directly provide such
smooth endpoints, but rather a discontinuous switching from an incomplete
excursion towards the target position to the next movement (presumably
starting with zero velocity), there has to be some ad hoc interpretation of the
observed smooth function relative to the theoretical break-point representing
the system resetting. Avoiding this difficulty, this study measures peak
velocity, excursion, and duration for each movement. Peak velocity is
measurable relatively accurately, assuming that the measurement does not
suffer from excessive noise. The interpretation of excursion, i.e. the observed
displacement, as the distance between the initial position and the terminating
position (at the onset of next movement) is problematic (according to the
model) because the latter cannot be related to the former unless a specific
method is provided of deriving the smooth time function to be observed.
A related problem is that the estimation of duration is not accurate.
Specifically, Beckman, Edwards, and Fletcher's method of evaluating end-
points is not straightforward for two reasons. (1) Measuring the time value
for either endpoint at an extremum is inherently inaccurate, due to the nature
of extrema. Slight noise and small bumps, etc. affect the time value
considerably. In particular, an error seems to transfer a portion of the
88
3 Comments
opening duration to the next closing duration according to Beckman,
Edwards, and Fletcher's algorithm. The use of the time derivative zero-
crossing is algorithmically simple, but the inherent difficulty is not resolved.
(2) Such measured durations cannot be compared accurately with prediction
of the theory, as discussed above. Therefore, while the data may be useful on
their own merit, they cannot evaluate the validity of the model assumed.
If the aim is to use specific properties of task dynamics and determine
which of its particular specifications are useful for speech analyses, then one
should match the entire time function by curve fitting, and forget about the
end regions (and with them amplitude and duration, which depend too much
on arbitrary assumptions). In doing so, one would probably face hard
decisions about the specific damping condition of the model. More impor-
tantly, the criteria for newly identifying the rest position of the system at each
excursion would have to be examined.
The finding that phrase-final phenomena are different from accent or
utterance-speed control is in conformity with previous ideas. The commonly
used term "phrase-final (or preboundary) elongation" (Lehiste 1980) implies
qualitatively and intuitively a time-scale expansion. The value of Beckman,
Edwards, and Fletcher's work should be in the quantitative characterization
of the way this modification is done in time. The same can be said about the
conclusion that in phrase-final position, the initial and final parts of the
syllable behave differently. One interesting question is whether such an
alteration of the system constant, i.e. the time scale, is given continuously
towards the end, or uniformly for the last phonological unit, word, foot,
syllable, or demisyllable, in phrase-final position. Beckman, Edwards, and
Fletcher suggest that if it is the latter, it may be smaller than a syllable, but a
definitive conclusion awaits further studies.
My own approach is different: iceberg measurement (e.g. Fujimura 1986)
uses the consonantal gesture of the critical articulator, not the mandible, for
each demisyllable. It depends on the fast-moving portions giving rather
accurate assessment of timing. Some of the results show purely empirically
determined phrasing patterns of articulatory movements, and uniform
incremental compression/stretching of the local utterance speed over the
pertinent phrase as a whole, depending on prominence as well as phrasing
control (Fujimura 1987).
I think temporal modulation in phrasal utterances is a crucial issue for
phonology and phonetics. I hope that the authors will improve their
techniques and provide definitive data on temporal control, and at the same
time prove or refute the validity of the task-dynamics model.
89
4
Lenition of I hi and glottal stop
4.1 Introduction
In this paper we examine the effect of prosodic structure on how segments are
pronounced. The segments selected for study are /h/ and glottal stop /?/.
These segments permit us to concentrate on allophony in source characteris-
tics. Although variation in oral gestures may be more studied, source
variation is an extremely pervasive aspect of obstruent allophony. As is well
known, /t/ is aspirated syllable-initially, glottalized when syllable-final and
unreleased, and voiced throughout when flapped in an intervocalic falling
stress position; the other unvoiced stops also have aspirated and glottalized
variants. The weak voiced fricatives range phonetically from essentially
sonorant approximants to voiceless stops. The strong voiced fricatives
exhibit extensive variation in voicing, becoming completely devoiced at the
end of an intonation phrase. Studying /h/ and /?/ provides an opportunity to
investigate the structure of such source variation without the phonetic
complications related to presence of an oral closure or constriction. We hope
that techniques will be developed for studying source variation in the
presence of such complications, so that in time a fully general picture
emerges.
Extensive studies of intonation have shown that phonetic realization rules
for the tones making up the intonation pattern (that is, rules which model
what we do as we pronounce the tones) refer to many different levels of
prosodic structure. Even for the same speaker, the same tone can correspond
to many different Fo values, depending on its prosodic environment, and a
given Fo value can correspond to different tones in different prosodic
environments (see Bruce 1977; Pierrehumbert 1980; Liberman and Pierre-
humbert 1984; Pierrehumbert and Beckman 1988). This study was motivated
by informal observations that at least some aspects of segmental allophony
90
4 Janet Pierrehumbert and David Talkin
Figure 4.1 Wide-band spectrogram and waveform of the word hibachi produced with contras-
tive emphasis. Note the evident aspiration and the movement in F, due to the spread glottis
during the /h/. The hand-marked segment locators and word boundaries are indicated in the
lower panel: m is the /m/ release; v marks the vowel centers; h the /h/ center; b the closure onset
of the /b/ consonant. The subject is DT
behave in much the same way. That is, we suspected that tone has no special
privilege to interact with prosody; phonetic realization rules in general can be
sensitive to prosodic structure. This point is illustrated in the spectrograms
and waveforms of figures 4.1 and 4.2. In figure 4.1 the word hibachi carries
contrastive stress and is well articulated. In figure 4.2, it is in postnuclear
position and the /h/ is extremely lenited; that is, it is produced more like a
vowel than the /h/ in figure 4.1. A similar effect of sentence stress on /h/
articulation in Swedish is reported in Gobi (1988).
Like the experiments which led to our present understanding of tonal
realization, the work reported here considers the phonetic outcome for
particular phonological elements as their position relative to local and
nonlocal prosodic features is varied. Specifically, the materials varied pos-
ition relative to the word prosody (the location of the word boundary and the
word stress) and relative to the phrasal prosody (the location of the phrase
boundary and the phrasal stress as reflected in the accentuation). Although
there is also a strong potential for intonation to affect segmental source
characteristics (since the larynx is the primary articulator for tones), this
issue is not substantially addressed in the present study because the difficul-
91
Gesture
Figure 4.2 Wide-band spectrogram and waveform of the word hibachi in postnuclear position.
Aspiration and F, movement during /h/ are less than in figure 4.1. The subject is DT
4.2 Background
4.2.1 jhj and glottal stop
Both /h/ and glottal stop /?/ are produced by a laryngeal gesture. They make
no demands on the vocal-tract configuration, which is therefore determined
by the adjacent segments. They are both less sonorous than vowels, because
both involve a gesture which reduces the strength of voicing. For /h/, the
folds are abducted. /?/ is commonly thought to be produced by adduction
(pressing the folds together), as is described in Ladefoged (1982), but
examination of inversefilteringresults and electroglottographic (EGG) data
raised doubts about the generality of this characterization. We suggest that a
braced configuration of the folds produces irregular voicing even when the
folds are not pressed together (see further discussion below).
94
4 Janet Pierrehumbert and David Talkin
frequency lead us to look for sensitivity to boundaries (Is the segment at a
boundary or not? If so, what type of boundary?) and to the strength of the
nodes above the segment.
Pronunciation rules are also sensitive to the substantive context. For
example, in both Japanese and English, downstep or catathesis applies only
when the tonal sequence contains particular tones. Similarly, /h/ has a less
vocalic pronunciation in a consonantal environment than in a vocalic one.
Such effects, widely reported in the literature on coarticulation and assimila-
tion, are not investigated here. Instead, we control the segmental context in
order to concentrate on the less well understood prosodic effects.
Although separate autosegmental tiers are phonologically independent,
there is a strong potential for phonetic interaction between tiers in the case
examined here, since both tones and laryngeal consonants make demands on
the laryngeal configuration. This potential interaction was not investigated,
since our main concern was the influence of prosodic structure on segmental
allophony. Instead, intonation was carefully controlled to facilitate the
interpretation of the acoustic signal.
4.3.2 Materials
In the materials for the experiment, the position of /h/ and /?/ relative to
word-level and phrase-level prosodic structure is varied. We lay out the full
experimental design here, although we will only have space to discuss some
subsets of the data which showed particularly striking patterns.
In the materials /h/ is exemplified word-initially and -medially, before both
vowels with main word stress and vowels with less stress:
This set of words was generated using computerized searches of several on-
line dictionaries. The segmental context of the target consonant was designed
to have a high first formant value and minimize formant motion, in order to
simplify the acoustic phonetic analysis. The presence of a nasal in the vicinity
of the target consonant is undesirable, because it can introduce zeroes which
complicate the evaluation of source characteristics. This suboptimal choice
was dictated by the scarcity of English words with medial /h/, even in the
educated vocabulary. We felt that it was critical to use real words rather than
nonsense words, in order to have accurate control over the word-level
prosody and in order to avoid exaggerated pronunciations.
Words in which the target consonant was word-initial were provided with
a /ma/ context on the left by appropriate choice of the preceding word.
Phrases such as the following were used:
96
4 Janet Pierrehumbert and David Talkin
Oklahoma August lima abundance figures
plasma augmentation
The position of the target words in the phrasal prosody were also manipu-
lated. The phrasal positions examined were (1) accented without special
focus, (2) accented and under contrast, (3) accented immediately following
an intonational phrase boundary, and (4) postnuclear. In order to maximize
the separation of F o and F p the intonation patterns selected to exemplify
these positions all had L tones (leading to low Fo) at the target location. This
allows the source influences on the low-frequency region to be seen more
clearly. The intonation patterns were also designed to display a level (rather
than time-varying) intonational influence on F o , again with a view to
minimizing artifacts. The accented condition used a low-rising question
pattern (L* H H% in the transcription of Pierrehumbert [1980]):
(1) Is he an Oklahoma hogfarmer?
The vocative had a H* L pattern (that is, it had a falling intonation and was
followed by an intermediate phrase boundary rather than a full intonation
break). Non-final list items had a L* H pattern, while the final list item (which
was not examined) had a H* L L% pattern. The juncture of the H* L vocative
pattern with the L* H pattern of the first list item resulted in a low, level F o
configuration at the target consonant. Subjects were instructed to produce
the sentences without a pause after the vocative, and succeeded in all but a
few utterances (which were eliminated from the analysis). No productions
lacked the desired intonational boundary.
In the "postnuclear" condition, the target word followed a word under
contrast:
(4) They're Alabama hogfarmers, not Oklahoma hogfarmers.
In this construction, the second occurrence of the target word was the one
analyzed.
97
Gesture
98
4 Janet Pierrehumbert and David Talkin
conclusions about the laryngeal configuration from anything less. When the
analysis window is matched to the cycle in both length and phase, the results
are well behaved. In contrast, when analysis windows the length of a cycle are
applied in arbitrary phase to the cycle, extensive signal-processing artifacts
result. Therefore non-pitch-synchronous moving-window analyses are typi-
cally forced to a much longer window length in order to show well-behaved
results. The longer window lengths in turn obscure the speech events, which
can be extremely rapid.
Pitch-synchronous analysis is feasible for segments which are voiced
throughout because the point of glottal closure can be determined quite
reliably from the acoustic waveform (Talkin 1989). We expected it to be
applicable in our study since the regions of speech to be analyzed were
designed to be entirely voiced. For subject DT, our expectations were
substantially met. For subject MR, strong aspiration and glottalization in
some utterances interfered with the analysis.
Talkin's algorithm for determining epochs, or points of glottal closure,
works as follows: speech, recorded and digitized using a system with known
amplitude and phase characteristics, is amplitude- and phase-corrected and
then inverse-filtered using a matched-order linear predictor to yield a rough
approximation to the derivative of the glottal volume velocity (LP). The
points in the U' signal corresponding to the epochs have the following
relatively stable characteristics: (1) Constant polarity (negative), (2) Highest
amplitude within each cycle, (3) Rapid return to zero after the extremum, (4)
Periodically distributed in time, (5) Limited range of inter-peak intervals, and
(6) Similar size and shape to adjacent peaks. A set of peak candidates is
generated from all local maxima in the U' signal. Dynamic programming is
then used to find the subset of these candidates which globally best matches
the known characteristics of U' at the epoch. The algorithm has been
evaluated using epochs determined independently from simultaneously
recorded EGG signals and was found to be quite accurate and robust. The
only errors that have an impact on the present study occur in strongly
glottalized or aspirated segments.
"n
34 dB
^^Y /v Y r ^ / V X/ V u ^^
Figure 4.3 HR and RMS measured for each glottal period throughout the target intervals of the
utterances introduced infigures4.1 and 4.2. Note that the difference of ~34 dB between the HR
in the /h/ and following vowel for the well-articulated case (top) is much greater than the ~2dB
observed in the lenited case (bottom). The RMS values discussed in the text are based on the
(linear) RMS displayed in this figure
101
Gesture
Liljencrants-Fant voice source run at 12 kHz sampling frequency was used
to generate synthetic voiced-speech-like sounds. These signals were then
processed using the procedures outlined above. F, and Fo were orthogonally
varied over the ranges observed in the natural speech tokens. ¥ x bandwidth
was held constant at 85 Hz while its frequency took on values of 500 Hz, 700
Hz and 800 Hz. For each of these settings the source fundamental was swept
from 75 Hz to 150 Hz during one second with the open quotient and leakage
time held constant. The bandwidths and frequencies of the higher formants
were held constant at nominal 17 cm, neutral vocal-tract values. Note that
the extremes in F{ and Fo were not simultaneously encountered in the natural
data, so that this test should yield conservative bounds on the artifactual
effects.
As expected, PDEV did not vary significantly throughout the test signal.
The range of variation in HR for these test signals was less than 3 dB. The
maximum peak-to-valley excursion in RMS due to F o harmonics crossing the
formants was 2 dB with a change in Fo from 112 Hz to 126 Hz and FY at 500
Hz. This is small compared to RMS variations observed in the natural speech
tokens under study.
102
4 Janet Pierrehumbert and David Talkin
intensity behaviour of the /?/s did not support the segmentation scheme
developed for the /h/s. Durations for /?/ were not estimated.
4.5 Results
After mentioning some characteristics of the two subjects' speech, we first
present results for /h/ and then make some comparisons to /?/.
103
O
o
CD
CD
O
40 60 70
more V-like •
Figure 4.4 A schema for interpreting the relation of RMS in the /h/ to RMS in the following
vowel. Greater values of RMS correspond to more vowel-like articulations, and lesser values
correspond to more /h/-like articulations. The line y = x represents the case in which the /h/ and
the following vowel do not contrast in terms of RMS. Distance perpendicular to this line
represents the degree of contrast between the /h/ and the vowel. Distance parallel to this line
cannot be explained by gestural magnitude, but is instead attributed to a background effect on
both the /h/ and the vowel. The area below and to the right of y = x is predicted to be empty
o
h/-li
CO
-->
CD
JOLU
O
CM
like
> O
CD
10LU
10 20 30 40
more vocalic than the following vowel, so that the lower-right half is
expected to be empty. In the upper-left half, the distance from the diagonal
describes the degree of contrast between the /h/ and the vowel. Situations in
which both the /h/ and the vowel are more fully produced would exhibit
greater contrast, and would therefore fall further from the diagonal. Note
again that a given magnitude of contrast can correspond to many different
values for the /h/ and vowel RMS.
Figure 4.5 shows a corresponding schema for HR relations. The structure
is the same except that higher, rather than lower, x and y values correspond
to more /h/-like articulations.
In view of this discussion, RMS and HR data will be interpreted with
respect to the diagonal directions of each plot. Distance perpendicular to the
y = x line (shown as a dotted line in each plot) will be related to the strength
or magnitude of the CV gesture. Location parallel to this line, on the other
hand, is not related to the strength of the gesture, but rather to a background
effect on which the entire gesture rides. One of the issues in interpreting the
data is the linguistic source of the background effects.
Figures 4.6 and 4.7 compare the RMS relations in word-initial stressed /h/,
when accented in questions and when deaccented. The As are farther from
the y = x line than the Ds, indicating that the magnitude of the gesture is
greater when /h/ begins an accented syllable. For subject DT, the two clouds
of points can be completely separated by a line parallel to y = x. Subject MR
shows a greater range of variation in the D case, with the most carefully
articulated Ds overlapping the gestural magnitude of the As.
Figures 4.8 and 4.9 make the same comparison for word-medial /h/
preceding a weakly stressed or reduced vowel. These plots have a conspi-
cuously different structure from figures 4.6 and 4.7. First, the As are above
and to the right of the Ds, instead of above and to the left. Second, the As
and Ds are not as well separated by distance from the y = x line; whereas this
separation was clear for word-initial /h/s, there is at most a tendency in this
direction for the medial reduced /h/s.
The HR data shown for DT in figures 4.10 and 4.11 further exemplifies this
contrast. Word-initial /h/ shows a large effect of accentuation on gestural
magnitude. For medial reduced /h/ there is only a small effect on magnitude;
however, the As and Ds are still separated due to the lower HR values during
the vowel for the As. HR data is not presented for MR because strong
aspiration rendered the measure a nonmonotonic function of abduction.
Since the effect of accentuation differs depending on position in the word,
we can see that both phrasal prosody and word prosody contribute to
determining how segments are pronounced. In decomposing the effects, let us
first consider the contrasts in gestural magnitude, that is perpendicular to the
x = y line. In the case of hawkweed and hogfarmer, the difference between As
105
Gesture
A
A
o
V\ D
D
•D
A D D
c
1Z CD D
A D D
b\ D
CD
DC
.•'
in
in
40 45 50 55 60
RMS in /h/ in dB
A accented in questions; D deaccented; lines: y = x and y = -x + b
Figure 4.6 RMS in /h/ of hawkweed and hogfarmer plotted against RMS in the following vowel,
when the words are accented in questions (the As) and deaccented (the Ds). The subject is DT
A A
A \ \ tA
\ A D \
A
D
CO D
•o
D D
. D D
D"'•••-.
S D
• • • . .
o . . - • •
CD
50 55 60 65
RMS in /h/ in dB
A accented in questions; D deaccented; lines: y = x and y = -x + b
Figure 4.7 RMS in /h/ of hawkweedand hogfarmer plotted against RMS in the following vowel,
when the words are accented in questions (the As) and deaccented (the Ds). The subject is MR
106
4 Janet Pierrehwnbert and David Talkin
CD
-o
c/) o
ID
ID
40 45 50 55 60
RMS in /h/ in dB
A accented in questions; D deaccented; lines: y = x and y = -x + b
Figure 4.8 RMS in /h/ of Omaha and tomahawk plotted against RMS in the following vowel,
when the words are accented in questions and deaccented. The subject is DT
CD
•D
cc
o
CD
50 55 60 65
RMS in /h/ in dB
A accented in questions; D deaccented; lines: y = x and y = -x + b
Figure 4.9 RMS in /h/ of Omaha and tomahawk plotted against RMS in the following vowel,
when the words are accented in questions and deaccented. The subject is MR
107
Gesture
en
•u
c
DC
X
10 20 30 40
HR in /h/ in dB
A accented in questions; D deaccented; lines: y = x and y = -x + b
Figure 4.10 HR in /h/ of hawkweed and hogfarmer plotted against HR in the following vowel,
when the words are accented in questions and deaccented. The subject is DT
CO
•D
DC
10 20 30 40
HR in /h/ in dB
A accented in questions; D deaccented; lines: y = x and y = -x + b
Figure 4.11 HR in /h/ of Omaha and tomahawk plotted against HR in the following vowel,
when the words are accented in questions and deaccented. The subject is DT
108
4 Janet Pierrehumbert and David Talkin
and Ds is predominately in this direction. The Omaha and tomahawk As and
Ds exhibit a small difference in this direction, though this is not the most
salient feature of the plot. From this we deduce that accentuation increases
gestural magnitude, making vowels more vocalic and consonants more
consonantal. The extent of the effect depends on location with respect to the
word prosody; the main stressed word-initial syllable inherits the strength of
accentuation, so to speak, more than the medial reduced syllable does. At the
same time we note that in tomahawk and Omaha, the As are shifted relative
to the Ds parallel to the y = x line. That is, both the consonant and the vowel
are displaced in the vocalic direction, as if the more vocalic articulation of the
main stressed vowel continued into subsequent syllables. The data for
tomahawk and Omaha might thus be explicated in terms of the superposition
of a local effect on the magnitude of the CV gesture and a larger-scale effect
which makes an entire region beginning with the accented vowel more
vocalic.
The present data do not permit a detailed analysis of what region is
affected by the background shift in a vocalic direction. Note that the effect of
a nuclear accent has abated by the time the deaccented postnuclear target
words are reached, since these show a more consonantal background effect
than the accented words do. In principle, data on the word mahogany would
provide critical information about where the effect begins, indicating, for
example, whether the shift in the vocalic direction starts at the beginning of
the first vowel in an accented word or at the beginning of the stressed vowel
in a foot carrying an accent. Unfortunately, the mahogany data showed
considerable scatter and we are not prepared at this point to make strong
claims about their characterization.
109
Gesture
o
CD
CO
o
m
C/)
DC
Figure 4.12 Duration vs. RMS in /h/ for hawkweed and hogfarmer when accented at a phrase
boundary, accented but phrase-medial in questions, and deaccented. The subject is DT
in
CD
o
CD
CD
•o
c
in
in
CO
DC
Figure 4.13 Duration vs. RMS in /h/ for hawkweed and hogfarmer when accented at a phrase
boundary, accented but phrase-medial in questions, and deaccented. The subject is MR
110
4 Janet Pierrehumbert and David Talkin
consonants in weak positions. For MR, the effect of the phrase boundary is
thus a more major one than the effect of accentual status.
A subset of the data set, the sentences involving tomahawk, make it
possible to extend the result to a nonlaryngeal consonant. The aspiration
duration for the /t/ was measured in the four prosodic positions. The results
are displayed in figures 4.14 and 4.15. The lines represent the total range of
observations for each condition, and each individual datum is indicated with
a tick. For DT, occurring at a phrase boundary approximately doubled the
aspiration length, and there was no overlap between the phrase-boundary
condition and the other conditions. For MR, the effect was somewhat
smaller, but the overlap can still be attributed to only one point, the smallest
value for the phrase-boundary condition. For both subjects, a smaller effect
of accentuation status can also be noted.
The effect of the phrase boundary on gestural magnitude can be investi-
gated by plotting the RMS in the /h/ against RMS in the vowel, the word-
initial accented /h/ in phrase-initial and phrase-medial position. This com-
parison, shown infigures4.16 and 4.17, indicates that the gestural magnitude
was greater in phrase-initial position. The main factor was lower RMS (that
is, a more extreme consonantal outcome) for the /h/ in phrase-initial
position; the vowels differed slightly, if at all. Returning to the decomposition
in terms of gestural-magnitude effects and background effects, we would
suggest that the phrase boundary triggers both a background shift in a
consonantal direction (already observed in preboundary position in the
"deaccented" cases) and an increase in gestural magnitude. The effect on
gestural magnitude must be either immediately local to the boundary, or
related to accentual strength, if deaccented words in the middle of the
postnuclear region are to be exempted as observed.
It is interesting to compare our results on phrase-initial articulation with
Beckman, Edwards, and Fletcher's results (this volume) on phrase-final
articulation. Their work has shown that stress-related lengthening is asso-
ciated with an increase in the extent of jaw movement while phrase-final
lengthening is not, and they interpret this result as indicating that stress
triggers an underlying change in gestural magnitude while phrase-final
lengthening involves a change in local tempo but not gestural magnitude.
Given that our data do show an effect of phrase-initial position on gestural
magnitude, their interpretation leads to the conclusion that phrase-initial and
phrase-final effects are different in nature.
However, we feel that the possibility of a unified treatment of phrase-
peripheral effects remains open, pending the resolution of several questions
about the interpretation of the two experiments. First, it is possible that the
gestural-magnitude effect observed in our data is an artifact of the design of
the materials, since the Now Emma sentences may have seemed more unusual
111
Gesture
questions
i—i—i
deaccented
c
<D
C
o contradiction
LLJ
phrase boundary
Duration in seconds
Figure 4.14 Voice-onset time in /t/ of tomahawk for all four prosodic positions; subject DT.
Ticks indicate individual data points
questions
h-hH V
deaccented
c
CD
C
o contradiction
phrase boundary
+ -H- +-
Duration in seconds
Figure 4.15 Voice-onset time in /t/ of tomahawk for all four prosodic positions; subject MR.
Ticks indicate individual data points
112
4 Janet Pierrehumbert and David Talkin
M
1 M
1M M
MXI
CD > •
c
M
1Z
M
>
c 1
CO
DC
40 45 50 55 60
RMS in /h/ in dB
I phrase-initial; M phrase-medial; lines: y = x and y = -x + b
Figure 4.16 RMS of/h/ andfinalvowel for subject DT in accented phrase-initial (I) and phrase-
medial (M) question contexts
MM
m _ 1 \ M1
1
1 | 1' M
GQ 1
T5
o
r-
o
RMS in >
in
CD
o
CD
..--' '•••..,
'•••..
..•••'
i .-•
i i i
50 55 60 65
RMS in /h/ in dB
I phrase-initial; M phrase-medial; lines: y = x and y = -x + b
Figure 4.17 RMS of /h/ and final vowel for subject MR in accented phrase-initial (I) and
phrase-medial (M) question contexts
113
Gesture
or semantically striking than those where the target words were phrase-
internal. If this is the case, semantically matched sentences would show a
shift towards the consonantal direction in the vowel following the consonant,
as well as the consonant itself. Second, it is possible that an effect on intended
magnitude is being obscured in Beckman, Edwards, and Fletcher's (this
volume) data by the nonlinear physical process whose outcome serves as an
index. Possibly, jaw movement was impeded after the lips contacted for the
labial consonant in their materials, so that force exerted after this point did
not result in statistically significant jaw displacement. If this is the case,
measurements of lip pressure or EMG (electromyography) might yield
results more in line with ours. Third, it is possible that nonlinearities in the
vocal-fold mechanics translate what is basically a tempo effect in phrase-
initial position into a difference in the extent of the acoustic contrast. That is,
it is possible that the vocal folds are no more spread for the phrase-initial /h/s
than for otherwise comparable /h/s elsewhere but that maintaining the
spread position for longer is in itself sufficient to result in greater suppression
of the oscillation. This possibility could be evaluated using high-speed optical
recording of the laryngeal movements.
s -\
d
z: o
% A
Q o
o
d
Figure 4.18 PDEV for /?/ beginning August and awkwardness plotted against PDEV in the
following vowel, for subject DT
115
Gesture
Comments on chapter 4
OSAMU FUJIMURA
First of all, I must express my appreciation of the careful preparation by
Pierrehumbert and Talkin of the experimental material. Subtle phonetic
interactions among various factors such as Fo, F p and vocal-tract constric-
tion are carefully measured and assessed using state-of-the-art signal-
processing technology. This makes it possible to study subtle but critical
effects of prosodic factors on segmental characteristics with respect to vocal-
source control. In this experiment, every technical detail counts, from the
way the signals are recorded to the simultaneous control of several phono-
logical conditions.
Effects of suprasegmental factors on segmental properties, particularly of
syntagmatic or configurational factors, have been studied by relatively few
investigators beyond qualitative or impressionistic description of allophonic
variation. It is difficult to prepare systematically controlled paradigms of
contrasting materials, partly because nonsense materials do not serve the
purpose in this type of work, and linguistic interactions amongst factors to be
controlled prohibit an orthogonal material design. Nevertheless, this work
exemplifies what can be done, and why it is worth the effort. It is perhaps a
typical situation of laboratory phonology.
The general point this study attempts to demonstrate is that so-called
"segmental" aspects of speech interact strongly with "prosodic" or "supra-
segmental" factors. Paradoxically, based on the traditional concept of
segment, one might call this situation "segmental effects of suprasegmental
conditions." As Pierrehumbert and Talkin note, such basic concepts are
being challenged. Tones as abstract entities in phonological representations
manifest themselves in different fundamental frequencies. Likewise, pho-
nemes or distinctive-feature values in lexical representations are realized with
different phonetic features, such as voiced and voiceless or with and without
articulatory closure, depending on the configuration (e.g. syllable- or word-
initial vs. final) and accentual situations in which the phoneme occurs. The
117
Gesture
same phonetic segments, to the extent that they can be identified as such, may
correspond to different phonological units. Pierrehumbert and Talkin,
clarifying the line recently proposed by Pierrehumbert and Beckman (1988),
use the terms 'structure' and 'content' to describe the general framework of
phonological/phonetic representations of speech. The structure, in my inter-
pretation (Fujimura 1987, 1990), is a syntagmatic frame (the skeleton) which
Jakobson, Fant, and Halle (1952) roughly characterized by configurational
features. The content (the melody in each autosegmental tier) was described
in more detail in distinctive (inherent and prosodic) features in the same
classical treatise. Among different aspects of articulatory features, Pierre-
humbert and Talkin's paper deals with voice-source features, in particular
with glottal consonants functioning as the initial margin of syllables in
English.
What is called a glottal stop is not very well understood and varies greatly.
The authors interpret acoustic EGG signal characteristics to be due to braced
configurations of the vocal folds. What they mean by "braced" is not clear to
me. They "surmise" that the subject DT in particular produces the glottal
stop by bracing or tensing partially abducted vocal folds in a way that tends
to create irregular vibration without a fully closed phase. Given the current
progress of our understanding of the vocal-fold vibration mechanism and its
physiological control, and the existence of advanced techniques for direct
and very detailed optical observation of the vocal folds, such qualitative and
largely intuitive interpretation will, I hope, be replaced by solid knowledge in
the near future. Recent developments in the technique of high-speed optical
recording of laryngeal movement, as reported by Kiritani and his co-workers
at the University of Tokyo (RILP), seem to promise a rapid growth of our
knowledge in this area.
A preliminary study using direct optical observation with a fiberscope
(Fujimura and Sawashima 1971) revealed that variants of American English
/t/ were accompanied by characteristic gestures of the false vocal folds.
Physiologically, laryngeal control involves many degrees of freedom, and
EGG observations, much less acoustic signals, reveal little information about
specific gestural characteristics. What is considered in the sparse distinctive-
feature literature about voice-source features tends to be grossly impression-
istic or even simply conjectural with respect to the production-control
mechanisms. The present paper raises good questions and shows the right
way to carry out an instrumental study of this complex issue. Particularly in
this context, Pierrehumbert and Talkin's detailed discussion of their speech
material is very timely and most welcome, along with the inherent value of
the very careful measurement and analysis of the acoustic-signal characteris-
tics. This combination of phonological (particularly intonation-theoretical)
competence and experimental-phonetic (particularly speech-signal engineer-
118
4 Comments
ing) expertise is a necessary condition for this type of study, even just for
preparing effective utterances for examination. Incidentally, it was in these
carefully selected sample sentences that the authors recently made the
striking discovery that a low-tone combination of voice-source characteris-
tics gives rise to a distinctly different spectral envelope (personal
communication).
One of the points of this paper that particularly attracts my attention is the
apparently basic difference between the two speakers examined. In recent
years, I have been impressed by observations that strong interspeaker
variation exists even in what we may consider to be rather fundamental
control strategies of speech production (see Vaissiere 1988 on velum move-
ment strategies, for example). One may hypothesize that different production
strategies result in the same acoustic or auditory consequence. However, I do
not believe this principle explains the phenomena very well, even though in
some cases it is an important principle to consider. In the case of the "glottal
stop," it represents a consonantal function in the syllable onset in opposition
to /h/, from a distributional point of view. Phonetically (including acousti-
cally), however, it seems that the only way to characterize this consonantal
element of the onset (initial demisyllable) is that it lacks any truly consonan-
tal features. This is an interesting issue theoretically in view of some of the
ideas related to nonlinear phonology, particularly with respect to underspeci-
fication. The phonetic implementation of unspecified features is not neces-
sarily empty, being determined by coarticulation principles only, but can
have some ad hoc processes that may vary from speaker to speaker to a
large extent. In order to complete our description of linguistic specifica-
tion for sound features, this issue needs much more attention and serious
study.
In many ways this experimental work is the first of its kind, and it may
open up, together with some other pioneering work of similar nature, a new
epoch in speech research. I could not agree more with Pierrehumbert and
Talkin's conclusion about the need for work on articulatory representation
to include a proper treatment of hierarchical structure and its manifestations.
Much attention should be directed to their assertion that a quantitative
articulatory description will still fail to capture the multidimensional char-
acter of lenition if it handles only the local phonological and phonetic
properties. But the issue raised here is probably not limited to the notion of
lenition.
119
Gesture
Comments on chapters 3 and 4
LOUIS GOLDSTEIN
Introduction
The papers in this section, by Pierrehumbert and Talkin and by Beckman,
Edwards, and Fletcher, can both be seen as addressing the same fundamental
question: namely, how are the spatiotemporal characteristics of speech
gestures modulated (i.e., stretched and squeezed) in different prosodic
environments?* One paper examines a glottal gesture (laryngeal abduction/
adduction for /h/- Pierrehumbert and Talkin), the other an oral gesture
(labial closure/opening for /p/- Beckman, Edwards, and Fletcher). The
results are similar for the different classes of gestures, even though differences
in methods (acoustic analysis vs. articulator tracking) and materials makes a
point-by-point comparison impossible. In general, the studies find that
phrasal accent increases the magnitude of a gesture, in both space and
time, while phrasal boundaries increase the duration of a gesture without
a concomitant spatial change. This striking similarity across gestures
that employ anatomically distinct (and physiologically very different) struc-
tures argues that general principles are at work here. This similarity (and
its implications) are the focus of my remarks. I will first present addi-
tional evidence showing the generality of prosodic effects across gesture
type. Second, I will examine the oral gestures in more detail, asking
how the prosodic effects are distributed across the multiple articula-
tors whose motions contribute to an oral constriction. Finally, I will ask
whether we yet have an adequate understanding of the general principles
involved.
•This work was supported by NSF grant BNS 8820099 and NIH grants HD-01994 and HD-
13617 to Haskins Laboratories.
120
3 and 4 Comments
the two gesture types had the same velocity profile, a mathematical charac-
terization of the shape of curve showing how velocity varies over time in the
course of the gestures. On the basis of this identity of velocity profiles, the
authors conclude that "the tongue and vocal folds share common principles
of control" (1985: 468).
Glottal gestures involving laryngeal abduction and adduction may occur
with a coordinated oral-consonant gesture, as in the case of the /k/s analyzed
by Munhall, Ostry, and Parush, or without such an oral gesture, as in the /h/s
analyzed by Pierrehumbert and Talkin. It would be interesting to investigate
whether the prosodic influences on laryngeal gestures show the same patterns
in these two cases. There is at least one reason to think that they might
behave differently, due to the differing aerodynamic and acoustic conse-
quences. Kingston (1990) has argued that the temporal coordination of
laryngeal and oral gestures could be more tightly constrained when the oral
gesture is an obstruent than when it is a sonorant, because there are critical
aerodynamic consequences of the glottal gesture in obstruents (allowing
generation of release bursts and frication). By the same logic, we might
expect the size (in time and space) of a laryngeal gesture to be relatively
more constrained when it is coordinated with an oral-obstruent gesture
than when it is not (as in /h/). The size (and timing) of a laryngeal gesture
coordinated with an oral closure will determine the stop's voice-onset
time (VOT), and therefore whether it is perceived as aspirated or not,
while there are no comparable consequences in the case of /h/. On the
other hand, these differences may prove to be irrelevant to the prosodic
effects.
In order to test whether there are differences in the behavior of the
laryngeal gesture in these two cases, I compared the word-level prosodic
effects in Pierrehumbert and Talkin's /h/ data (some that were discussed
by the authors and others that I estimated from their graphs) with the data
of a recent experiment by Cooper (forthcoming). Cooper had subjects
produce trisyllabic words with varying stress patterns (e.g. percolate,
passionate, Pandora, permissive, Pekingese), and then reproduce the pro-
sodic pattern on a repetitive /pipipip/ sequence. The glottal gestures in
these nonsense words were measured by means of transillumination. I
was able to make three comparisons between Cooper and Pierrehumbert
and Talkin, all of which showed that the effects generalized over the
presence or absence of a coordinated oral gesture. (1) There is very
little difference in gestural magnitude between word-initial and word-
medial positions for a stressed /h/, hawkweed vs. mahogany. (2) There is,
however, a word-position effect for unstressed syllables (hibachi shows a
larger gesture than Omaha or tomahawk). (3) The laryngeal gesture in
word-initial position is longer when that syllable is stressed than when it
121
Gesture
is unstressed (hawkweed vs. hibachi). All of these effects can be seen in
Cooper's data.
In addition, Cooper's data show a very large reduction of the laryngeal
gesture in a reduced syllable immediately following the stressed vowel (in the
second /p/ of utterances modeled after percolate and passionate). In many
cases, no laryngeal spreading was observable at all. While this environment
was not investigated by Pierrehumbert and Talkin, they note that such /h/s
have been considered by phonologists as being deleted altogether (e.g. vehicle
vs. vehicular). The coincidence of these effects is again striking. Moreover,
this is an environment where oral gestures may also be severely reduced:
tongue-tip closure gestures reduce to flaps (Kahn 1976). Thus, there is strong
parallelism between prosodic effects on laryngeal gestures for /h/ and on
those that are coordinated with oral stops. This similarity is particularly
impressive in face of the very different acoustic consequences of laryngeal
gesture in the two cases: generation of breathy voice (/h/) and generation
of voiceless intervals. It would seem, therefore, that it is the gestural
dynamics themselves that are being directly modulated by stress and
position, rather than the output variables such as VOT. The changes
can be stated most generally at the level of gestural kinematics and/or
dynamics.
122
3 and 4 Comments
phonological units; each is modeled as a dynamical system, or control
regime, whose spatial goals are defined in terms of tract variables such as
these. When a given gesture is active, the individual articulatory compo-
nents that can contribute to a given tract variable constitute a "coordina-
tive structure" (Kelso, Saltzman, and Tuller, 1986) and cooperate to
achieve the tract-variable goal. Individual articulators may compensate for
one another, when one is mechanically restrained or involved in another
concurrent speech gesture. In this fashion, articulatory differences in
different contexts are modeled (Saltzman 1986). With respect to the pro-
sodic effects discussed by Beckman, Edwards, and Fletcher, it is important
to know whether the gesture's tract-variable goals are being modified,
or rather if some individual articulator's motions are being amplified or
reduced, and if so, whether these changes are compensated for by other
articulators.
Since Beckman, Edwards, and Fletcher present data only for the jaw,
the question cannot be answered directly. However, Macchi (1988)
has attempted to answer this question using data similar to the type em-
ployed by them. Macchi finds that prosodic variation (stress, syllable
structure) influences primarily the activity of the jaw, but that, unlike
variation introduced by vowel environment, the prosodic effects on the
jaw are not compensated for by the displacement of the lower lip with
respect to the jaw. The displacement remains invariant across prosodic
contexts. Thus, the position of the lower lip in space does vary as a
function of prosodic environment, but this variation is caused almost
exclusively by jaw differences. That is, the locus of the prosodic effects is
the jaw.
123
Gesture
laryngeal data. Here, as we have seen, the effects parallel those for oral
gestures, in terms of changes in gesture magnitude and duration. Yet it would
be hard to include the laryngeal effects under the rubric of sonority, at least
as traditionally defined. Phrasal accent results in a more open glottis, but a
more open glottis would result in a less sonorous output. Thus the sonority
analysis fails to account, in a unified way, for the parallel prosodic
modulations of the laryngeal and oral gestures. An alternative analysis
would be to examine the effects in terms of the overall amount of energy
expended by the gestures, accented syllables being more energetic. However,
this would not explain why the effects on oral gestures seem to be restricted
to the jaw (which is explained by the sonority account). Finding
a unified account of laryngeal and oral effects remains an exciting
challenge.
This was presented at the conference as a commentary on several papers, but because of the
organization of the volume appears here with chapters 3 and 4.
124
3 and 4 Comments
groups speech sounds into the following set of constituents (from the word
up):
(1) phonological utterance;
intonational phrase;
phonological phrase;
clitic group;
phonological word;
Phonological constituents referred to in this volume, however, include the
following:
(2) (a) Pierrehumbert and Talkin:
(intonational) phrase
(phonological) word
(b) Beckman, Edwards, and Fletcher:
(intonation) phrase
(phonological word)
(c) Kubozono:
major phrase
minor phrase
(d) van den Berg, Gussenhoven, and Rietveld:
association domain
association domain'
125
Gesture
definitions they are using for their constituents. In Kubozono's paper,
however, we find major phrase and minor phrase, and, since neither is
explicitly defined, we do not know how they relate to the other proposed
constituents. Similarly, van den Berg, Gussenhoven, and Rietveld explicitly
claim that their association domain and association domain' do not coincide
with any phonological constituents proposed elsewhere in the literature.
Does this mean we are, in fact, adopting the position that essentially
anything goes, where we just create phonological constituent structure as we
need it? Given the impressive cross-linguistic insights that have been gained
in phonology by identifying a small finite set of prosodic constituents along
the lines of (1), it would be surprising, and dismaying, if phonetic investi-
gation yielded significantly different results.
126
3 and 4 Comments
defining characteristics of phonology. Of course, if the goal of phonetics
is to model specific phenomena, as is often the case, we do need to end
up with abstractions again. We must still ask, though, what is actually
being modeled when this model itself is based on such limited sets of
data.
127
5
On types of coarticulation
5.1 Introduction
With few exceptions, both phonetics and phonology have used the "seg-
ment" as a basis of analysis in the past few decades.* The phonological
segment has customarily been equated with a phonemic unit, and the nature
of the phonetic segment has been different for different applications. Coarti-
culation has, on the whole, been modeled as intersegmental influences, most
frequently for adjacent segments but also at a greater distance, and the
domain of coarticulatory interactions has been assumed to be controlled by
cognitive (language-specific, structural, phonological) factors as well as by
purely mechanical ones. As such, the concept of coarticulation has been a
useful conceptual device for explaining the gap between unitary phonological
representations and variable phonetic realizations. Less segment-based views
of phonological representation and of speech production/perception have led
to different ways of talking about coarticulation (see, for example, Browman
and Goldstein, this volume).
We are concerned here with examining some segment-based definitions of
coarticulation, asking some questions about the coherence of these defini-
tions, and offering some experimental evidence for reevaluation of
terminology.
*We are grateful to Peter Ladefoged, John Ohala, and Christine Shadle for their comments and
advice about this work. We also thank Colin Watson of Queen Margaret College for his work in
designing and building analysis equipment.
128
5 Nigel Hewlett and Linda Shockey
approaches to measuring these influences. We will compare two of these
approaches. One could be called "testing for allophone separation." A
number of studies (Turnbaugh et al. 1985; Repp 1986; Sereno and Lieberman
1987; Sereno et al. 1987; Hewlett 1988; Nittrouer et al. 1989) have examined
spectral characteristics of the consonant in CV syllables by adults and
children with a view to estimating the amount of influence of the vowel on the
realization of the consonant (typically a voiceless stop), i.e. to determine
whether and to what extent realizations of the same consonant phoneme
were spectrally distinct before different vowels. The precise techniques are
not at issue here. Some have involved simply measuring the frequency of the
tallest peak in the spectrum or identifying whether the vowel F2 is anticipated
in the consonant spectrum, others have used more elaborate techniques. A
recent and extensive investigation of this type is that of Nittrouer, Studdert-
Kennedy, and McGowan (1989). In this study, the centroids of /s/ spectra in
/si/ and /su/ syllables, as spoken by young children and adults, were
calculated and compared. The authors conclude that the fricative spectrum is
more highly influenced by a following vowel in the children's productions
than in the adults'.
Another approach could be termed "finding the target-locus relationship"
(Lindblom and Lindgren 1985; Krull 1987, 1989). According to this criter-
ion, the difference between F2 at vowel onset (the "locus") and F2 at mid-
vowel (the "target") is inversely related to the amount of CV coarticulation
present. The reasoning here is that coarticulation is a process which makes
two adjacent sounds more similar to each other, so if it can be shown that
there is a shallower transition between consonant and vowel in one condition
than in another, then that condition can be thought of as more coarticulated.
This approach has been applied to a comparison of more careful and less
careful speech styles and the findings have indicated greater coarticulation
associated with a decrease in care.
It is difficult to use the first approach (measuring allophone separation) in
dealing with connected speech because of yet another phenomenon, which is
often termed "coarticulation": that of vowel centralization. It is well known
that on average the vowel space used in conversation is much reduced from
that found for citation-form speech. Sharf and Ohde (1981) distinguish
two types of coarticulation which they call "feature spreading" and
"feature reduction," the latter of which includes vowel centralization. If
one wants to compare coarticulation in citation form with coarticulation
in connected speech by testing for allophone separation, one's task is very
much complicated by the fact that two variables are involved: difference
in phonological identity of the vowel and difference in overall vowel-space
shape. The same objection cannot be made to the method of finding the
target-locus relationship because it is insensitive to the difference
129
Gesture
between (for example) a "change up" in target and a "change down" in
the locus.
As far as we know, no previous study has attempted to determine whether
the application of the two approaches to the same data leads to a similar
conclusion concerning the relative degree of coarticulation in each case.
Citation Reading
i
LL
2,000 •
1,800,
1 600-
I
ki (citation) ki (reading) ku (citation) ku (reading)
in each case. Where there was more than one very prominent peak, spectra of
subsequent time frames were examined in order to determine which pole was
most robust overall.
The frequencies of the first three formants of each vowel were measured at
onset and in the middle, using the continuous spectrogram facility of a
Spectrophonics system, which produces traditional broad-band spectral
displays on a CRT (cathode ray tube). Where frequencies were in doubt due
131
Gesture
Table 5.2 Mean frequencies in Hz of the most
prominent peak of consonant burst, second formant
of the vowel at onset and secondformant of the vowel
at steady state
Citation Reading
7,000- •I
6,500-
<
"requency in Hz
<
6,000-
5,500-
5,000'
4,500 <
4,000-
o cr\r\.
O,C)UU
ti (citation) ti (reading) tu (citation) tu (reading)
133
Gesture
Table 5.3 T-Test results for comparisons described
in column 1 (df = 75 throughout)
u.o —
ki
l tu
0 4 -•-- if • t tu
• ku
•ku
nR
n«
• citation
• reading
134
5 Nigel Hewlett and Linda Shockey
0.3
0.4 | i" •™
cu
ti tu W
ki tu
0.5
0.6
• citation
# reading
135
Gesture
Table 5.4 Difference between the frequencies ofF2 of
the vowel steady state and vowel onset (Hz)
The question which immediately arises is: how can reading and citation
forms be simultaneously not different and different with respect to coarticu-
lation? A similar anomalous situation can be adduced from the literature on
acquisition by children: studies such as that by Nittrouer et al. (1989), based
on allophone separation, suggest that greater coarticulation is found in
earlier stages of language development and that coarticulation decreases with
age. Kent (1983), however, points to the fact that if greater coarticulation is
associated with greater speed and fluency of production, it would be liable to
increase with greater motor skill and hence with the age of children. He
observes that this is compatible with the finding that children's speech is
slower than adults', and he offers as evidence the fact that the F 2 trajectory in
sequences such as /oju/ (in the phrase We saw youl) is flatter in adults'
pronunciation than in children's, indicating greater coarticulation in adult
speech. The criterion used in this argument is comparable to that of the
target-locus relation.
These contradictory findings suggest the possibility that F2 trajectories and
allophone separation are really measuring two different kinds of coarticula-
tion which behave differently in relation to phonetic development as well as
in relation to speech style.
Pertinent to this point is a fact which is well known but often not
mentioned in discussions of coarticulation: very carefully articulated speech
manifests a great deal of allophone separation. Striking examples are
frequently seen to come from items pronounced in isolation or in frame
sentences (e.g. Daniloff, Shuckers, and Feth 1980: 328-33). Further evidence
of a large amount of allophone separation in careful speech comes from the
search for acoustic invariance (Blumstein and Stevens 1979). The extent of
the differences in the spectra of velar consonants in particular before
different vowels testifies to the amount of this type of coarticulation in
maximally differentiated tokens. Whatever sort of coarticulation may be
measured by allophone separation, it seems very unlikely to be the sort which
is increased by increases in speech rate or in degree of casualness. We assume
it to reflect local variation in lip/tongue/jaw movement. As such, it could be
easily accommodated in the gestural framework suggested by Browman and
Goldstein (1986, this volume).
136
5 Nigel Hewlett and Linda Shockey
Our results show the characteristic vowel centralization which is normally
attributed to connected speech. We have found, in addition, another quite
general effect which might, if found to be robust, come to be considered
coarticulation of the same sort as vowel centralization; this is the marked
lowering of burst frequencies in read speech. The cause of this lowering has
yet to be discovered experimentally. Suggested reasons are: (1) that in
connected speech /ki/ and /ku/ are liable to be produced with a more open
jaw and therefore a larger cavity in front of the release, which would have the
effect of lowering the frequency (Ladefoged, personal communication); and
(2) that since there is probably greater energy expended in the production of
citation forms, it is very likely that there is a greater volume velocity of
airflow (Ohala, personal communication). Given a higher rate of flow
passing through approximately the same size aperture, the result is a higher
center frequency in the source spectrum of the burst. These explanations are
not incompatible and both are amenable to experimental investigation.
It is quite likely that vowel centralization, lowering of burst frequencies,
and flattening of locus-target trajectories in connected speech are all parts of
the same type of long-term effect (which we may or may not want to term
"coarticulation") which can be attributed to larger mandible opening
combined with smaller mandible movements. It certainly appears in our data
that lowered vowel-onset frequencies (which are directly linked to lowered
burst frequencies) and vowel centralization conspire to produce flatter
trajectories, but this hypothesis requires further investigation. Such long-
term effects would be more difficult to explain using a gestural model, but
could possibly be described as a style-sensitive overall weighting on articula-
tor movement. How style can, in practice, be used to determine this
weighting is an open question.
Our results and the others discussed above support Sharf and Ohde's
(1981) notion that it is fruitful to divide what is currently called "coarticula-
tion" into at least two separate areas of study: relatively short-term effects
and longer-term settings. In addition, our results suggest that the former may
not be much influenced by differences in style while the latter show strong
style effects.
137
Gesture
such details helped to steady his nerves. To the same end, he studied the titles
of the books that were propping the broken sash window open at the bottom,
providing a welcome draught of air into the room. The Collected Poems of
T.S. Eliot were squashed uncomfortably between A Tale of Two Cities and
Teach Yourself Icelandic.
"Perhaps you'd like to tell me the purpose of this unexpected visit?" Courts
smiled, but only with his teeth. The eyes above them remained perfectly
expressionless.
"You know already. I want my client's document back. And the key that
goes with it."
"Now what key would that be, I wonder?"
"About two inches long, grey in colour and with BT5024 stamped on the
barrel."
"Oh, that key."
"So, you know about this too," Courts mused. "Well, we've got lots of time
to talk about it. As it happens, I'm willing to devote the rest of my afternoon
to your client's little problem."
He laughed again, a proper laugh this time, which revealed a badly
chipped tooth. That might have been Stella's handiwork of the previous day
with the teapot. There was a photograph of her which was lying on its back
on the highly polished teak surface of the desk. Next to it was another
photograph, of a teenage boy who didn't seem to bear any resemblance to
Courts.
"I'm not too keen on spending the rest of the afternoon cooped up in your
pokey little office," said Keith.
He tried to think of something more interesting to say, something that was
guaranteed to keep Court's attention directed towards him. For just along
from the Collected Poems of T.S. Eliot, a face was peering through the gap at
the bottom of the window. This was somewhat surprising, since to the best of
his recollection they were two floors up.
Comments on chapter 5
WILLIAM G. BARRY and SARAH HAWKINS
Hewlett and Shockey are concerned with coarticulation from three different
angles. Firstly, they present data from an experiment designed to allow the
comparison of single-syllable CV utterances with the same CV sequences
produced in a continuous-speech passage. The question is whether the degree
of coarticulation in these sequences differs between speech-production
conditions. Secondly, they are concerned with the methodological question
138
5 Comments
of how to quantitatively assess degree of coarticulation. Thirdly, following
their title, they suggest different types of coarticulation based on different
speech-production factors underlying the measured coarticulatory
phenomena.
The simultaneous concern with three aspects of coarticulation studies is
very illuminating. The application for the first time of two different ways of
quantifying CV coarticulation, using the same speech data, provides a clear
illustration of the extent to which theoretical models can be the product of a
particular methodological approach. Given this dependency, however, it is
rather a bold step to conclude the discussion by postulating two categories of
coarticulation, neatly differentiated by analytic method.
This question mark over the conclusions results more from the structure of
the material on which the discussion of coarticulation types is based than
from the interpretation of the data. The contradictory trends found by the
two analysis methods for the two production conditions might be related to
separate aspects of production, namely a "local" tongue/lip/jaw effect and a
"long-term" effect due to jaw movement. This is an interesting way of
looking at the data, although we agree with Hewlett and Shockey's comment
that the latter effect may be an articulatory setting rather than due to
coarticulation. But in order to place these observations into a wider
theoretical perspective, a discussion of types of coarticulation is needed to
avoid post hoc definitions of type along the divisions of analytic method. We
discuss two aspects of (co)articulation that are particularly relevant to
Hewlett and Shockey's work:first,the types of articulatory processes that are
involved in coarticulation; and second, the domains coarticulation can
operate over.
In an early experimental investigation of speech, Menzerath and De
Lacerda (1933) distinguish "Koartikulation," involving the preparatory or
perseveratory activity of an articulator not primarily involved in the current
segment, and "Steuerung" ("steering" or "control"), which is the deviation
of an articulator from its target in one segment due to a different target for
the same articulator in a neighboring segment. This distinction not only
illuminates aspects of Hewlett and Shockey's experimental material, but also
points to a potential criterion for distinguishing (one type of) longer-term
coarticulatory effect from more local effects. The /ki ku ti tu/ syllables can be
seen to carry both factors: (1) the lip-rounding for /u/ is completely
independent of the consonantal tongue articulation for /k/ and /t/, and is
therefore "Koartikulation" in the Menzerath and De Lacerda sense; (2) the
interaction between the tongue targets for /k/ and the two vowels, and for /t/
and the two vowels is a clear case of "Steuerung." "Steuerung" is likely to be
a more local effect, in that the trajectory of a single articulator involved in
consecutive "targets" will depend upon the relation between those targets.
139
Gesture
The "independent" articulator, on the other hand, is free to coarticulate with
any segments to which it is redundant (see Benguerel and Cowan 1974;
Lubker 1981). Note that Menzerath and De Lacerda's distinction, based on
articulatory criteria, encourages different analyses from Hewlett and
Shockey's, derived from acoustic measurements. The latter suggest that their
more long-term effect stems from differences in jaw height, whereas in the
former's system jaw involvement is primarily regarded as a local (Steuerung)
effect. Thus Hewlett and Shockey's "local" effect seems to incorporate both
Menzerath and De Lacerda's terms, and their own "long-term" effect is not
included in the earlier system. The danger of defining coarticulatory types
purely in terms of the acoustic-analytic method employed is now clear. The
acoustically defined target-locus relationship cannot distinguish between
"Koartikulation" and "Steuerung," which are worth separating.
The emphasis on the method of acoustic analysis rather than on articula-
tion leads to another design problem, affecting assessment of allophone
separation. Allophone separation can only be assessed in the standard way
(see Nittrouer et al. 1989) if there are no (or negligible) differences in the
primary articulation. In Menzerath and De Lacerda's terminology, we can
only assess allophone separation for cases of Koartikulation rather than
Steuerung. Allophone separation can thus be assessed in the standard way
for /ti/ and /tu/ but not for /ki/ and /ku/. In careful speech, the position of the
tongue tip and blade for /t/ is much the same before /i/ as before /u/, but
vowel-dependent differences due to the tongue body will have stronger
acoustic effects on the burst as the speech becomes faster and less careful. For
/ki/ and /ku/, on the other hand, different parts of the tongue body form the
/k/ stop closure: the allophones are already separated as [k] and [k], even
(judging from the acoustic data) for the speaker with his fronted /u/. Thus, as
Hewlett and Shockey hint in their discussion, velar and alveolar stops in
these contexts represent very different cases in terms of motor control, and
should not be lumped together. We would therefore predict that the
difference in the frequency of the peak amplitude of alveolar burst should
increase as the /t/ allophones become more distinct, but should decrease for
velar bursts as the /k/ allophones become less distinct. This is exactly what
Hewlett and Shockey found. Hence we disagree with their claim that
allophone-separation measures fail to differentiate citation and reading
forms.
Seen in this light, citation and reading forms in Hewlett and Shockey's
data seem to differ in a consistent way for both the target-locus and the
allophone-separation measure. These observations need to be verified statis-
tically: a series of independent T-tests is open to the risk of false significance
errors and, more importantly, does not allow us to assess interactions
between conditions. The factorial design of the experiment described in
140
5 Comments
chapter 5 is ideally suited to analysis of variance which would allow a more
sensitive interpretation of the data.
Turning now to the domain of coarticulation, we question whether it is
helpful to distinguish between local and long-term effects, and, if it is,
whether one can do so in practice. First, we need to say what local and long
term mean. One possibility is that local coarticulatory effects operate over a
short timescale, while long-term effects span longer periods. Another is that
local influences affect only adjacent segments while long-term influences
affect nonadjacent segments. The second possibility is closest to traditional
definitions of coarticulation in terms of "the overlapping of adjacent
articulations" (Ladefoged 1982: 281), or "the influence of one speech
segment upon another... of a phonetic context upon a given segment"
(Daniloff and Hammarberg 1983: 239). The temporal vs. segmental defini-
tions are not completely independent, and we could also consider a four-way
classification: coarticulatory influences on either adjacent or nonadjacent
segments, each extending over either long or short durations. Our attempts
to fit data from the literature into any of these possible frameworks lead us to
conclude that distinguishing coarticulatory influences in terms of local versus
long-term effects is unlikely to be satisfactory.
There are historical antecedents for a definition of coarticulation in terms
of local effects operating over adjacent segments in Kozhevnikov and
Chistovich's (1965) notion of "articulatory syllable," and in variants of
feature-spreading models (e.g. Henke 1966; Bell-Berti and Harris 1981) with
which Hewlett and Shockey's focus on CV structures implicitly conforms.
There is also plenty of evidence of local coarticulatory effects between
nonadjacent segments. Cases in point are Ohman's (1966a) classic study of
vowels influencing each other across intervening stops and Fowler's (1981a)
observations on variability in schwa as a function of stressed-vowel context.
(In these examples of vowel-to-vowel influences, there are also, of course,
coarticulatory effects on the intervening consonant or consonants.)
The clarity of this division into adjacent and nonadjacent depends on how
one defines "segment." If the definition involved mapping acoustic segments
onto phone or phoneme strings, the division could be quite clear, but if the
definition is articulatory, which it must be in any production model, then the
distinction is not at all clear: units of articulatory control could overlap such
that acoustically nonadjacent phonetic-phonological segments influence one
another. This point also has implications for experimental design: to assess
coarticulatory influences between a single consonant and vowel in connected
speech, the context surrounding the critical acoustic segments must either be
constant over repetitions, or sufficiently diverse that it can be treated as a
random variable. Hewlett and Shockey's passage has uneven distributions of
sounds around the critical segments. For example, most of the articulations
141
Gesture
surrounding /ku/ were dental or alveolar (five before and seven after the
syllable, of eight repetitions), whereas less than half of those surrounding /ti/
and /tu/ were dental or alveolar.
The above examples are for effects over relatively short periods of time;
segmentally induced modifications to features or segments can also extend
over longer stretches of time. For example, Kelly and Local (1989) have
described the spread of cavity features such as velarity and fronting over the
whole syllable and the foot. This spreading affects adjacent (i.e. uninter-
rupted strings of) segments and is reasonably regarded as segment-induced,
since it occurs in the presence of particular sounds (e.g. /r/ or /I/); it is
therefore categorizable as coarticulation. Long-term effects on nonadjacent
segments have also been observed. Slis (personal communication) and
Kohler, van Dommelen, and Timmermann (1981) have found for Dutch and
French, respectively, that devoicing of phonologically voiced obstruents is
more likely in a sentence containing predominantly voiceless consonants.
Such sympathetic voicing or devoicing in an utterance is the result of a
property of certain segments spreading to other (nonadjacent) segments.
However, classifying these longer-term effects as coarticulation runs the
risk that the term becomes a catch-all category, with a corresponding loss in
usefulness. In their articulatory (and presumably also acoustic) manifes-
tation there appears to be nothing to differentiate at least these latter cases of
long-term effects from the acquired permanent settings of some speakers, and
Hewlett and Shockey in fact point out that their longer-term effect (mandible
movement) may be best regarded as an articulatory setting. Articulatory
setting is probably a useful concept to retain, even though in terms of
execution it may not always be very distinct from coarticulation.
There are also similarities between the above long-term effects and certain
connected-speech processes of particular languages, such as the apparently
global tendency in French and Russian for voicing to spread into segments
that are phonologically voiceless, which contrasts with German, English, and
Swedish, where the opposing tendency of devoicing is stronger. If these
general properties of speech are classed as coarticulation, it seems a relatively
small step to include umlaut usage in German, vowel harmony, and various
other phonological prosodies (e.g. Firth 1948; Lyons 1962) as forms of
coarticulation. Many of these processes may have had a coarticulatory basis
historically, but there are good grounds in synchronic descriptions for
continuing to distinguish aspects of motor control from phonological rules
and sociophonetic variables. We do not mean to advocate that the term
coarticulation should be restricted to supposed "universal" tendencies and
all language-, accent-, or style-dependent variation should be called some-
thing else. But we do suggest that there is probably little to be gained by
142
5 Comments
describing all types of variation as coarticulation, unless that word is used as
a synonym for speech motor control.
We suggest that the type of long-term effect that Hewlett and Shockey
have identified in their data is linked with the communicative redundancy of
individual segments in continuous speech compared to their communicative
value in single-syllable citation. This relationship between communicative
context and phonetic differentiation (Kohler 1989; Lindblom 1983) is
assumed to regulate the amount of articulatory effort invested in an
utterance. Vowel centralization has been shown to disappear in communica-
tive situations where greater effort is required, e.g. noise (Schindler 1975;
Summers et al. 1988). Such "communicative sets" would presumably also
include phenomena such as extended labiality in "pouted" speech, but not
permanent settings that characterize speakers or languages (see the comment
on voicing and devoicing tendencies above), though these too are probably
better not considered to be "coarticulation."
The exclusion of such phenomena from a precise definition of coarticula-
tion does not imply that they have no effect on coarticulatory processes. The
"reduction of effort" set, which on the basis of Hewlett and Shockey's data
might account for vowel undershoot and lowered release-burst frequencies,
can also be invoked to explain indirectly the assimilation of alveolars
preceding velars and labials. By weakening in some sense the syllable-final
alveolar target, it allows the anticipated velar or labial to dominate acousti-
cally. Of course, this goes no way towards explaining the fact (Gimson 1960;
Kohler 1976) that final alveolars are somehow already more unstable in their
definition than other places of articulation and therefore susceptible to
coarticulatory effects under decreased effort.
To conclude, then, we suggest that although there seems to be a great deal
of physiological similarity between segment-induced modifications and the
settings mentioned above that are linked permanently to speakers or
languages, or temporarily to communicative context, it is useful to
distinguish them conceptually. This remains true even though a model
of motoric execution might treat them all similarly. To constrain the use
of the term coarticulation, we need to include the concept of "source
segment(s)" and "affected segments" in its definition. The existence of
a property or feature extending over a domain of several segments, some
of which are not characterized by that feature, does not in itself indicate
coarticulation.
Our final comment concerns Hewlett and Shockey's suggested acoustic
explanation for their finding that the most prominent peaks of the burst
spectra were all lower in frequency in the read speech than in the citation
utterances. One of their suggested possibilities is that less forceful speech
143
Gesture
could be associated with a lower volume velocity of flow through the
constriction at the release, which would lower the center frequency of the
noise-excitation spectrum. The volume velocity of flow might well have
been lower in Hewlett and Shockey's connected-speech condition, but we
suggest it was probably not responsible for the observed differences. The
bandwidth of the noise spectrum is so wide that any change in its center
frequency is unlikely to have a significant effect on the output spectrum at
the lips.
The second suggestion is that a more open jaw could make a larger cavity
in front of the closure. It is not clear what is being envisaged here, and it is
certainly not simple to examine this claim experimentally. We could measure
whether the jaw is more open, but if it were more open, how would this affect
the burst spectrum? Hewlett and Shockey's use of the word "size" suggests
that they may be thinking of a Helmholtz resonance, but this seems unlikely,
given the relationship between the oral-cavity volume and lip opening that
would be required. A more general model that reflects the detailed area
function (length and shape) of the vocal tract is likely to be required (e.g.,
Fant 1960). The modeling is not likely to be simple, however, and it is
probably inappropriate to attribute the observed pattern to a single cause.
The most important section of the vocal tract to consider is the cavity from
the major constriction to the lips. It is unclear how front-cavity length and
shape changes associated with jaw height would produce the observed
patterns. However, if an explanation in terms of cavity size is correct, the
most likely explanation we know of is not so much that the front cavity itself
is larger, as that the wider lip aperture (that would probably accompany a
lowered jaw position) affected the radiation impedance.1 Rough calcula-
tions following Flanagan (1982: 36) indicate that changes in lip aperture
might affect the radiation impedance by appropriate amounts at the relevant
frequencies.
Other details should also be taken into consideration, such as the degree to
which the cavity immediately behind the constriction is tapered towards the
constriction, and the acoustic compliance of the vocal-tract walls. There is
probably more tapering in citation than reading form of alveolars and
fronted /k/, and the wall compliance is probably less in citation than reading
form. Changes in these parameters due to decreased effort could contribute
to lowering the frequencies and changing the intensities of vocal-tract
resonances. The contribution of each influencing factor is likely to depend on
the place of articulation of the stop.
To summarize, we feel that in searching for explanations of motor
1
We are indebted to Christine Shadle for pointing this out to us.
144
5 Comments
behavior from acoustic measurements, it is important to use models (acoustic
and articulatory) that can represent differences as well as similarities between
superficially similar things. The issues raised by Hewlett and Shockey's study
merit further research within a more detailed framework.
145
Section B
Segment
147
6
An introduction to feature geometry
MICHAEL BROE
6.0 Introduction
This paper provides a short introduction to a theory which has in recent
years radically transformed the appearance of phonological representations.
The theory, following Clement's seminal (1985) paper, has come to be known
as feature geometry.1 The rapid and widespread adoption of this theory as
the standard mode of representation in mainstream generative phonology
stems from two main factors. On the one hand, the theory resolved a debate
which had been developing within nonlinear phonology over competing
modes of phonological representation (and resolved it to the satisfaction of
both sets of protagonists), thus unifying the field at a crucial juncture. But
simultaneously, the theory corrected certain long-standing and widely
acknowledged deficiencies in the standard version of feature theory, which
had remained virtually unchanged since The Sound Pattern of English,
(Chomsky and Halle 1968; hereafter SPE).
The theory of feature geometry is rootedfirmlyin the tradition of nonlinear
phonology - an extension of the principles of autosegmental phonology to
the wider phonological domain - and in section 6.1 I review some of this
essential background. I then show, in section 6.2, how rival modes of
representation developed within this tradition. In section 6.3 I consider a
related problem, the question of the proper treatment of assimilation
phenomena. These two sections prepare the ground for section 6.4, which
1
Clements (1985) is the locus classicus for the theory, and the source of the term "feature
geometry" itself. Earlier suggestions along similar lines (in unpublished manuscripts) by
Mascaro (1983) and Mohanan (1983) are cited by Clements, together with Thrainsson (1978)
and certain proposals in Lass (1976), which can be seen as an early adumbration of the leading
idea. A more detailed survey of the theory may be found in McCarthy (1988), to which I am
indebted. Pulleyblank (1989) provides an excellent introduction to nonlinear phonology in
general.
149
Segment
shows how feature geometry successfully resolves the representation prob-
lem, and at the same time provides a more adequate treatment of assimila-
tion. I then outline the details of the theory, and in section 6.5 show how the
theory removes certain other deficiencies in the standard treatment of
phonological features. I close with an example of the theory in operation: a
treatment of the Sanskrit rule of «-Retroflexion or Nati.
V C C
Contour tone Affricate Prenasalized stop
(2) H i m
V V V V C C
Tonal spread Long vowel Geminate consonant
150
6 Michael Broe
has observed that, for the most part, rules which treat geminates as atoms are
rules affecting segment quality, while rules which require a sequence represen-
tation are general "prosodic," affecting stress, tone, and length itself. Within
a nonlinear framework, the distinction can be captured in terms of rule
application on the prosodic (quantitative) tier or the melodic (qualitative)
tier respectively.
Futher exceptional properties of geminates also find natural expression in
the nonlinear mode of representation. One of the most significant of these is
the property of geminate integrity: the fact that, in languages with both
geminates and rules of epenthesis, epenthetic segments cannot be inserted
"within" a geminate cluster. In nonlinear terms, this is due to the fact that the
structural representation of a geminate constitutes a "closed domain"; it is
thus impossible to construct a coherent representation of an epenthesized
geminate:
(3) 1 1
V c V
V
V c
Here, the starred representation contains crossing association lines. Such
representations are formally incoherent: they purport to describe a state of
affairs in which one segment simultaneously precedes and follows another.
It was quickly noted that assimilated clusters also exhibit the properties of
true geminates. The following Palestinian data from Hayes (1986) show
epenthesis failing to apply both in the case of underlying geminates and in the
case of clusters arising through total assimilation:
(4) /?akl/ -> [Pakil] food
/Pimm/ -> [Pimm] mother
/jisr kbiir/ -• [jisrikbir] bridge big
/1-walad 1-zyiir/ —> [lwaladizzyiir] DEF-boy DEF-small
151
Segment
on the melodic tier itself. They provide the following formulation of Hausa
regressive assimilation:
(5) /littaf+taaf+ai/ -• [littattaafai]
c v c c v c c v v c v v
Here - after the regressive spread of the [t] melody and delinking of the [f] -
the output of the assimilation process is structurally identical to a geminate
consonant, thus accounting for the similarities in their behavior. This in turn
opens up the possibility of a generalized account of assimilation phenomena
as the autosegmental spreading of a melody to an adjacent CV-slot (see the
articles by Hayes, Nolan, and Local this volume).
However, a problem immediately arises when we attempt to extend this
spreading account to cases of partial assimilation, as in the following
example from Hausa (Halle and Vergnaud 1980):
(6) /gaddam + dam + ii/ -> [gaddandamii]
The example shows that, in their words, "it is possible to break the link
between a skeleton slot and some features on the melody tier only while
leaving links with other features intact" (p. 92): in this case, to delink place
features while leaving nasality and sonority intact. Halle and Vergnaud do
not extend their formalism to deal with such cases.
The problem is simply to provide an account of this partial spreading
which is both principled and formally adequate. The radical autosegmental-
ism of phonological properties gives rise to a general problem of organiza-
tion. As long as there is just one autosegmental tier - the tonal tier -
organized in parallel to the basic segmental sequence, the relation between
the two is straightforward. But when the segmental material is itself
autosegmentalized on a range of tiers - syllabicity, nasality, voicing, contin-
uancy, tone, vowel-harmony features, place features, all of which have been
shown to exhibit autosegmental behavior - then it is no longer clear how the
tiers are organized with respect to each other. It is important to notice that
this issue is not resolved simply by recourse to a "syllabic core" or CV-
skeleton, although the orientation this provides is essential. As Pulleyblank
(1986) puts it: "Are there limitations on the relating of tiers? Could a tone
tier, for example, link directly to a nasality tier? Or could a tone tier link
directly to a vowel harmony tier?"
152
6 Michael Broe
i--
manner manner
featuresj |_featuresj
manner manner
featuresj [features
These two solutions induce very different types of representation, and
make different empirical predictions. Note first that, in the limiting case, a
multiplanar approach would produce a representation in which every feature
constituted a separate tier. Elements are associated directly with the skeletal
core, ranged about it in a so-called "paddlewheel" or "bottlebrush" forma-
tion (Pulleyblank 1986, Sagey 1986b). Here, then, the limiting case would
exhibit the complete structural autonomy of every feature:
(9) [ coronal ]
[ consonantal ] [ sonorant ]
[ continuant ]
153
Segment
In the coplanar model, on the other hand, tiers are arrayed in parallel, but
within the same plane, with certain primary features associated directly with
skeletal slots, and secondary or subordinate features associated with a CV-
position only indirectly, mediated through association to the primary
features. This form of representation, then, exhibits intrinsic featural depen-
dencies. Compare the coplanar representation in (8) above with the following
reformulation:
(10) manner manner
|_featuresj |_featuresj
I place I F place ~|
l^featuresj [_featuresj
—1
Here, counterfactually, place features have been represented as primary,
associated directly to the skeletal tier, while manner features are secondary,
linked only indirectly to a structural position. But under this regime of
feature organization it would be impossible to represent place assimilation
independently of manner assimilation. Example (10) is properly interpreted
as total assimilation: any alteration in the association of the place features is
necessarily inherited by features subordinate to them. The coplanar model
thus predicts that assimilation phenomena will display implicational asym-
metries. If place features can spread independently of manner features, then
they must be represented as subordinate to them in the tree; but we then
predict that it will be impossible to spread manner features independently.
The limiting case here, then, would be the complete structural dependency of
every feature. In the case of vowel-harmony features, for example, Archan-
geli (1985: 370) concludes: "The position we are led to, then, is that Universal
Grammar provides for vowel features arrayed in tiers on a single plane in the
particular fashion illustrated below."
[back]
[ round]
[high]
V
154
6 Michael Broe
Which model of organization is the correct one? It quickly becomes
apparent that both models are too strong: neither complete dependence nor
complete independence is correct. The theory of feature geometry brings a
complementary perspective to this debate, and provides a solution which
displays featural dependence in the right measure.
The problem is in fact again due to the complete formal independence of the
features involved. This forces us to state assimilation as a set of pairwise
agreements, rather than agreement of a set as a whole (Lass 1976: 163).
155
Segment
PLACE = cor = +
ant = —
Here, values have simply been written to the right of their respective features.
We may extend the same principle to manner features; and further, group
together place and manner features under a superordinate ROOT node, in
order to provide a parallel account of total assimilation (that is, in terms of
agreement of ROOT specifications):
(15) Fcor = +
PLACE =
|_ant = -
ROOT =
MANNER = [ nas = —]
156
6 Michael Broe
Such a representation has the appearance of a set of equations between
features and their subordinate values, nested in a hierarchical structure. This
is in fact more than analogy: any such category constitutes a mathematical
function in the strict sense, where the domain is the set of feature names, and
the range is the set of feature values. Now just as coordinate geometry
provides us with a graphical representation of any function, the better to
represent certain salient properties, so we may give a graph-theoretic
representation of the feature structure above:
(16) A ROOT
PLACE
ant
Here, each node represents a feature, and dependent on the feature is its
value, be it atomic - as at the leaves of the tree - or category-valued.
Formally, then, the unstructured feature matrix of the standard theory has
been recast in terms of a hierarchically structured graph, giving direct
expression to the built-in feature taxonomy.
157
Segment
ROOT
MANNER
nasal
MANNER
PLACE
158
6 Michael Broe
[ROOT
SUPRALARYNGEAL
RADICAL
159
Segment
looks like figure 6.3. At the highest level of branching, the framework is able
to express Lass's (1976: 152f.) suggestion that feature matrices can be
internally structured into laryngeal and supralaryngeal gestures:
in certain ways [?] is similar to (the closure phase of) a postvocalic voiceless stop.
There is a complete cut-off of airflow through the vocal tract: i.e. a configuration that
can reasonably be called voiceless, consonantal (if we rescind the ad hoc restriction
against glottal strictures being so called), and certainly noncontinuant. In other
words, something very like the features of a voiceless stop, but MINUS SUPRA-
LARYNGEAL ARTICULATION . . . Thus [?] and [h] are DEFECTIVE . . . they are missing an
entire component or parameter that is present in "normal" segments.
r [oral] I
I [laryngeal] I
160
6 Michael Broe
But while the [ + cor] class is frequently attested in phonological rules, the
[ — cor] class is never found. The problem here is that standard feature theory
embodies an implicit claim that, if one value of a feature denotes a natural
class, then so will the opposite value. This is hard-wired into the theory: it is
impossible to give oneself the ability to say [ + F] without simultaneously
giving oneself the ability to say [ — F].
Consider n o w a classification based on active articulators:
(20)
labial alveolar palatal retroflex velar uvular
LAB COR COR COR DOR DOR
Under this approach, the problematic class mentioned above simply cannot
be mentioned - the desired result.
Consider now the same argument with respect to the feature [anterior]; we
predict the following classes:
(22) [-fant]: *{labial, alveolar}
[ — ant]: *{palatal, retroflex, velar, uvular}
Here the problem is even worse. As commonly remarked in the literature,
there seems to be no phonological process for which this feature denotes a
3
The following discussion is based on Yip (1989).
161
Segment
6.6 An example
+ nas
-ant —dist
162
6 Michael Broe
Table 6.1 Data exemplifying the Sanskrit
n-Retroflexion rule shown in figure 6.4
Present
1 mr;d-naa- be gracious
2 i§-r\aa- seek
3 Pt-n.aa- fill
Passive
4 bhug-na- bend
5 puu|;-i\a- fill
6 vr.k-r(a- cut up
Middle participle 1
7 marj-aana- wipe
8 k§ved-aana- hum
9 pur,-aar\a- fill
10 k§ubh-aar\a- quake
11 cak§-aar\a- see
Middle participle 2
12 krt-a-maana- cut
13 kr,p-a-maai\a- lament
We may thusfigureto ourselves the rationale of the process: in the marked proclivity
of the language toward lingual utterance, especially of the nasal, the tip of the tongue,
when once reverted into the loose lingual position by the utterance of non-contact
lingual element, tends to hang there and make its next nasal contact in that position:
and does so, unless the proclivity is satisfied by the utterance of a lingual mute, or the
organ is thrown out of adjustment by the utterance of an element which causes it to
assume a different posture. This is not the case with the gutturals or labials, which do
not move the front part of the tongue. (Whitney 1879: 65)
Note that, with respect to the relevant (CORONAL) tier, target and trigger are
163
Segment
r - - -
LARYNGEAL SUPRALARYNGEAL
PLACE
LABIAL
I CORONAL
' anterior
distributed
Figure 6.5 The operation of the rule shown in figure 6.4, illustrating the transparency of an
intervening labial node
adjacent. This gives rise to the notion that all harmony rules are "local" in an
extended sense: adjacent at some level of representation.
It may be helpful to conclude with an analogy from another cognitive
faculty, one with a long and sophisticated history of notation: music. The
following representation is a perceptually accurate transcription of a piece of
musical data:
(24)
i
In this transcription, x and y are clearly nonadjacent. But now consider this
performance score, where the bass line is represented "autosegmentalized" on
a separate tier:
(25)
Here, x and y are adjacent, on the relevant tier. Note, too, that there is an
articulatory basis to this representation, being a transcription of the gestures
of the right and left hands respectively: x and y are adjacent in the left hand.
An articulator-bound notion of feature geometry thus lends itself naturally
to a conception of phonological representation as gestural score (Browman
and Goldstein 1989 and this volume).
164
6 Michael Broe
The most notable achievement of feature geometry, then, is the synthesis it
achieves between a theory of feature classification and taxonomy, on the one
hand, and the theory of autosegmental representation - in its extended
application to segmental material - on the other. Now as Chomsky (1965:
172) points out, the notion of the feature matrix as originally conceived gives
direct expression to the notion of "paradigm" - the system of oppositions
operating at a given place in structure: "The system of paradigms is simply
described as a system of features, one (or perhaps some hierarchic configu-
ration) corresponding to each of the dimensions that define the system of
paradigms." Feature geometry makes substantive proposals regarding the
"hierarchic configuration" of the paradigmatic dimension in phonology. But
it goes further, and shows what kind of syntagmatic structure such a
hierarchy supports. A new syntagm, a new paradigm.
165
7
The segment: primitive or derived?
JOHN J. OHALA
7.1 Introduction
The segmental or articulated character of speech has been one of the
cornerstones of phonology since its beginnings some two-and-a-half millen-
nia ago.* Even though segments were broken down into component features,
the temporal coordination of these features was still regarded as a given.
Other common characteristics of the segment, not always made explicit, are
that they have a roughly steady-state character (or that most of them do),
and that they are created out of the same relatively small set of features used
in various combinations.
Autosegmental phonology deviates somewhat from this by positing an
underlying representation of speech which includes autonomous features
(autosegments) uncoordinated with respect to each other or to a CV core or
"skeleton" which is characterized as "timing units."1 These autonomous
features can undergo a variety of phonological processes on their own.
Ultimately, of course, the various features become associated with given Cs
or Vs in the CV skeleton. These associations or linkages are supposed to be
governed by general principles, e.g. left-to-right mapping (Goldsmith 1976),
the obligatory contour principle (Leben 1978), the shared feature convention
(Steriade 1982). These principles of association are "general" in the sense
*I thank Bjorn Lindblom, Nick Clements, Larry Hyman, John Local, Maria-Josep Sole, and an
anonymous reviewer for helpful comments on earlier versions of this paper. The program which
computed the formant frequencies of the vocal-tract shapes in figure 7.2 was written by Ray
Weitzman, based on earlier programs constructed by Lloyd Rice and Peter Ladefoged. A grant
from the University of California Committee on Research enabled me to attend and present this
paper in Edinburgh.
1
As far as I have been able to tell, the terms "timing unit" or "timing slot" are just arbitrary
labels. There is no justification to impute a temporal character to these entities. Rather, they
are just "place holders" for the site of linkage that the autosegments eventually receive.
166
7 John J. Ohala
that they do not take into account the "intrinsic content" of the features
(Chomsky and Halle 1968: 400ff.); the linkage would be the same whether the
autosegments were [± nasal] or [ + strident]. Thus autosegmental phonology
preserves something of the traditional notion of segment in the CV-tier but
this (auto)segment at the underlying level is no longer determined by the
temporal coordination of various features. Rather, it is an abstract entity
(except insofar as it is predestined to receive linkages with features proper to
vowels or consonants).
Is the primitive or even half-primitive nature of the segment justified or
necessary? I suggest that the answer to this question is paradoxically both
"no" and "yes": "no" from an evolutionary point of view, but "yes" in every
case after speech became fully developed; this latter naturally includes the
mental grammars of all current speakers. I will argue that it is impossible to
have articulated speech, i.e., with "segments," without having linked, i.e.
temporally coordinated, features. However, it will be necessary to justify
separately the temporal linkage of features, the existence of steady-states,
and the use of a small set of basic features; it will turn out that these
characteristics do not all occur in precisely the same temporal domain or
"chunk" in the stream of speech. Thus the "segment" derived will not
correspond in all points to the traditional notion of segment.
For the evolutionary part of my story I am only able to offer arguments
based primarily on the plausibility of the expected outcome of a "gedanken"
simulation; an actual simulation of the evolution of speech using computer
models has not been done yet.2 However, Lindblom (1984, 1989) has
simulated and explored in detail some aspects of the scenario presented here.
Also relevant is a comparison of speech-sound sequences done by Kawasaki
(1982) and summarized by Ohala and Kawasaki (1984). These will be
discussed below. In any case, much of my argument consists of bringing well-
known phonetic principles to bear on the issue of how speech sounds can be
made different from each other - the essential function of the speech code.
167
Segment
natural physical and physiological principles and the constraints of the
ecological niche occupied by humans). We then assign the vocal tract the task
of creating a vocabulary of a few hundred different utterances (words) which
have the following properties:
1 They must be inherently robust acoustically, that is, easily differentiate
from the acoustic background and also sufficiently different from each other.
1 will refer to both these properties as "distinctness". Usable measures of
acoustic distinctness exist which are applicable to all-voiced speech with no
discontinuities in its formant tracks; these have been applied to tasks
comparable to that specified here (Kawasaki 1982; Lindblom 1984; Ohala et
al. 1984). Of course, speech involves acoustic modulations in more than just
spectral pattern; there are also modulations in amplitude, degree of periodi-
city, rate of periodicity (fundamental frequency), and perhaps other para-
meters that characterize voice quality. Ultimately all such modulations have
to be taken into account.
2 Errors in reception are inevitable, so it would be desirable to have some
means of error correction or error reduction incorporated into the code.
3 The rate and magnitude of movements of the vocal tract must operate
within its own physical constraints and within the constraints of the ear to
detect acoustic modulations. What I have in mind here is, first, the obser-
vation that the speech organs, although having no constraint on how slowly
they can move, definitely have a constraint on how rapidly they can move.
Furthermore, as with any muscular system, there is a trade-off between
amplitude of movement and the speed of movement; the movements of
speech typically seem to operate at a speed faster than that which would
permit maximal amplitude of movement but much slower than the maximal
rate of movement (Ohala 1981b, 1989). (See McCroskey 1957; Lindblom
1983; Lindblom and Lubker 1985 on energy expenditure during speech.) On
the auditory side, there are limits to the magnitude of an optimal acoustic
modulation, i.e., any change in a sound. Thus, as we know from numerous
psychophysical studies, very slow changes are hardly noticeable and very
rapid changes present a largely indistinguishable "blur" to the ear. There is
some optimal range of rates of change in between these extremes (see
Licklider and Miller 1951; Bertsch et al. 1956). Similar constraints govern the
rate of modulations detectable by other sense modalities and show up in, e.g.,
the use of flashing lights to attract attention.
4 The words should be as short as possible (and we might also establish an
upper limit on the length of a word, say, 1 sec). This is designed to prevent a
vocabulary where one word is /ba/, another /baba/, another /b9b9b9/ etc.,
168
7 John J. Ohala
with the longest word consisting of a sequence of n /b9/s where n = the size
of the vocabulary.
2,000
1,000--
800
200 500
Formant 1
Figure 7.1 Vowel space with five hypothetical vowels corresponding to the vocal-tract con-
figurations shown in figure 7.2. Abscissa: Formant 1; ordinate: Formant 2. For reference, the
average peripheral vowels produced by adult male speakers, as reported by Peterson and Barney
(1952) is shown by filled squares connected by solid lines
170
7 John J. Ohala
Lips Glottis
Figure 7.2 Five hypothetical vocal-tract shapes corresponding to the formant frequency
positions in figure 7.1. Vertical axis: vocal-tract cross-dimensional area; horizontal axis: vocal-
tract length from glottis (right) to lips (left). See text for further explanation
171
Segment
(Lehiste 1970: 28); thus there is interaction between glottal state and the
overall consonantal duration. The American English vowel [a*], which is
characterized by the lowest third formant of any human vowel, has three
constrictions: labial, mid-palatal, and pharyngeal (Uldall 1958; Delattre
1971; Ohala 1985b). These three locations are precisely the locations of the
three antinodes of the third standing wave (the third resonance) of the vocal
tract. In many languages the elevation of the soft palate in vowels is
correlated with vowel height or, what is probably more to the point, inversely
correlated with the first formant of the vowel (Lubker 1968; Fritzell 1969;
Ohala 1975). There is much cross-linguistic evidence that [u]-like vowels are
characterized not only by the obvious lip protrusion but also by a lowered
larynx (vis-a-vis the larynx position for a low vowel like [a]) (Ohala and
Eukel 1987). Presumably, this lengthening of the vocal tract helps to keep the
vowel resonances as low as possible and thus maximally distinct from other
vowels.
As alluded to above, it is well known in sensory physiology that modula-
tions of stimulus parameters elicit maximum response from the sensory
receptor systems only if they occur at some optimal rate (in time or space,
depending on the sense involved). A good prima facie case can be made that
the speech events which come closest to satisfying this requirement for the
auditory system are what are known as "transitions" or the boundaries
between traditional segments, e.g. bursts, rapid changes in formants and
amplitude, changes from silence to sound or from periodic to aperiodic
excitation and vice versa. So all that has been argued for so far is that
temporally coordinated gestures would evolve - including, perhaps, some
acoustic events consisting of continuous trajectories through the vowel space,
clockwise and counterclockwise loops, S-shaped loops, etc. These may not
fully satisfy all of our requirements for the notion of "segment," so other
factors, discussed below, must also come into play,
7.2.2.3 "Steady-state"
Regarding steady-state segments, several things need to be said. First of all,
from an articulatory point of view there are few if any true steady-state
postures adopted by the speech organs. However, due to the nonlinear
mapping from articulation to aerodynamics and to acoustics there do exist
near steady-states in these latter domains.3 In most cases the reason for this
3
Thus the claim, often encountered, that the speech signal is continuous, that is, shows few
discontinuities and nothing approximating steady-states in between (e.g. Schane 1973: 3;
Hyman 1975: 3), is exaggerated and misleading. The claim is largely true in the articulatory
domain (though not in the aerodynamic domain). And it is true that in the perceptual domain
the cues for separate segments or "phonemes" may overlap, but this by itself does not mean
that the perceptual signal has no discontinuities. The claim is patently false in the acoustic
domain as even a casual examination of spectrograms of speech reveals.
172
7 John J. Ohala
nonlinear relationship is not difficult to understand. Given the elasticity of
the tissue and the inertia of the articulators, during a consonantal closing
gesture the articulators continue to move even after complete closure is
attained. Nevertheless, for as long as the complete closure lasts it effectively
attenuates the output sound in a uniform way. Other parts of the vocal tract
can be moving and still there will be little or no acoustic output to reveal it.
Other nonlinearities govern the creation of steady-states or near-steady-
states for other types of speech events (Stevens 1972, 1989).
But there may be another reason why steady-states would be included in
the speech signal. Recall the task constraint that the code should include
some means for error correction or error reduction. Benoit Mandelbrot
(1954) has argued persuasively that any coded transmission subject to errors
could effect error reduction or at least error limitation by having "break-
points" in the transmission. Consider the consequences of the alternative,
where everything transmitted in between silence constituted the individual
cipher. An error affecting any part of that transmission would make the
entire transmission erroneous. Imagine, for example, a Morse-code type of
system which for each of the 16 million possible sentences that could be
conveyed had a unique string of twenty-four dots and dashes. An error on
even one of the dots and dashes would make the whole transmission fail. On
the other hand if the transmission had breakpoints often enough, that is,
places where what had been transmsitted so far could be decoded, then any
error could be limited to that portion and it would not nullify the whole of
the transmission. Checksums and other devices in digital communications
are examples of this strategy. I think the steady-states that we find in speech,
from 50 to 200 msec, or so in duration, constitute the necessary "dead"
intervals or breakpoints that clearly demarcate the chunks with high infor-
mation density. During these dead intervals the listener can decode these
chunks and then get ready for the subsequent chunks. What I am calling
"dead" intervals are, of course, not truly devoid of information but I would
maintain that they transmit information at a demonstrably lower rate than
that during the rapid acoustic modulations they separate. This, in fact, is the
interpretation I give to the experimental results of Ohman (1966b) and
Strange, Verbrugge, and Edman (1976).
It must be pointed out that if there is a high amount of redundancy in the
code, which is certainly true of any human language's vocabulary, then the
ability to localize an error of transmission allows error correction, too.
Hearing "skrawberry" and knowing that there is no such word while there is
a word strawberry allows us to correct a (probable) transmission error.
I believe that these chunks or bursts of high-density information flow are
what we call "transitions" between phonemes. I would maintain that these
are the kind of units required by the constraints of the communication task.
173
Segment
These are what the speaker is intending to produce when coordinating the
movements of diverse articulators4 and these are what the listener attends
to.
Nevertheless, these are not equivalent to our traditional conception of the
"segment." The units arrived at up to this point contain information on a
sequential pair of traditional segments. Furthermore, the inventory of such
units is larger than the inventory of traditional segments by an order of
magnitude. Finally, what I have called the "dead interval" between these
units is equivalent to the traditional segment (the precise boundaries may be
somewhat ambiguous but that, in fact, corresponds to reality).
I think that our traditional conception of the segment arises from the fact
that adjacent pairs of novel segments, i.e. transitions, are generally corre-
lated. For example, the transition found in the sequence /ab/ is almost
invariably followed by one of a restricted set of transitions, those characteris-
tic of/bi/, /be/, /bu/, etc., but not /gi/, /de/. As it happens, this correlation
between adjacent pairs of transitions arises because it is not so easy for our
vocal tract to produce uncorrelated transitions: the articulator that makes a
closure is usually the same one that breaks the closure. The traditional
segment, then, is an entity constructed by speakers-listeners; it has a
psychological reality based on the correlations that necessarily occur between
successive pairs of the units that emerge from the underlying articulatory
constraints.
The relationship between the acoustic signal, the transitions which require
the close temporal coordination between articulators, and the traditional
segments is represented schematically in figure 7.3.
7.2.2.4 Features
If an acoustically salient gesture is "discovered" by combining labial closure,
velic elevation, and glottal abduction, will the same velic elevation and
glottal abduction be "discovered" to work well with apical and dorsal
closures? Plausibly, the system should also be able to discover how to
"recycle" features, especially in the case of modulations made distinct by the
combination of different "valves" in the vocal tract. There are, after all, very
few options in this respect: glottis, velum, lips, and various actions of the
tongue (see also Fujimura 1989b). A further limitation exists in the options
available for modulating and controlling spectral pattern by virtue of the fact
that the standing wave patterns of the lowest resonances have nodes and
4
The gestures which produce these acoustic modulations may require not only temporal
coordination between articulators but also precision in the articulatory movements them-
selves. This may correspond to what Fujimura (1986) calls "icebergs": patterns of temporally
localized invariant articulatory gestures separated by periods where the gestures are more
variable.
174
7 John J. Ohala
(a)
ep pi ik ka
(c)
Time —*
Figure 7.3 Relationship between acoustic speech signal (a), the units with high-rate-of-
information transmission that require close temporal coordination between articulators (b), and
the traditional segment (c)
antinodes at discrete and relatively few locations in the vocal tract (Chiba
and Kajiyama 1941; Fant 1960; Stevens 1972, 1989): an expansion of the
pharynx would serve to keep F} as low as possible when accompanying a
palatal constriction (for an [i]) as well as when accompanying simultaneous
labial and uvular constrictions (for an [u]) due to the presence there of an
antinode in the pressure standing wave of the lowest resonance.5
Having said this, however, it would be well not to exaggerate (as
phonologists often do) the similarity in state or function of what is
considered to be the "same" feature when used with different segments. The
same velic coupling will work about as well with a labial closure as an apical
one to create [m] and [n] but as the closure gets further back the nasal
consonants that result get progressively less consonantal. This is because an
5
Pharyngeal expansion was not used in the implementation of the [u]-like vowel 3 in figure 7.1,
but if it had been it would have approached more closely the corner vowel [u] from the
Peterson and Barney study.
175
Segment
important element in the creation of a nasal consonant is the "cul-de-sac"
resonating cavity branching off the pharyngeal-nasal resonating cavity. This
"cul-de-sac" naturally gets shorter and acts less effectively as a separate
cavity the further back the oral closure is (Fujimura 1962; Ohala 1975, 1979a,
b; Ohala and Lorentz 1977). I believe this accounts for the lesser incidence or
more restricted distribution of [rj] in the sound systems of the languages of
the world6. Similarly, although a stop burst is generally a highly salient
acoustic event, all stop bursts are not created equal. Velar and apical stop
bursts have the advantage of a resonating cavity downstream which serves to
reinforce their amplitude; this is missing in the case of labial stop bursts.
Accordingly, among stops that rely heavily on bursts, i.e. voiceless stops
(pulmonic or glottalic), the labial position is often unused, has a highly
restricted distribution, or simply occurs less often in running speech (Wang
and Crawford 1960; Gamkrelidze 1975; Maddieson 1984: ch. 2). The more
one digs into such matters, the more differences are found in the "same"
feature occurring in different segments: as mentioned above, Sawashima and
Hirose have found differences in the character of glottal state during
fricatives vis-a-vis cognate stops. The conclusion to draw from this is that
what matters most in speech communication is making sounds which differ
from each other; it is less important that these be made out of recombinations
of the same gestures used in other segments. The orderly grid-like systems of
oppositions among the sounds of a language which one finds especially in
Prague School writings (Trubetzkoy 1939 [1969]) are quite elusive when
examined phonetically. Instead, they usually exhibit subtle or occasionally
not-so-subtle asymmetries. Whether one can make a case for symmetry
phonologically is another matter but phonologists cannot simply assume
that the symmetry is self-evident in the phonetic data.
176
7 John J. Ohala
7.3 Interpretation
178
7 John J. Ohala
between features. The problem is that there exist many different types of
physical relationships between the various features. Insofar as a phonetic
basis has been considered in feature geometry, it is primarily only that of
spatial anatomical relations. But there are also aerodynamic and acoustic
relations, and feature geometry, as currently proposed, ignores these. These
latter domains link anatomically distant structures. Some examples (among
many that could be cited): simultaneous oral and velic closures inhibit vocal-
cord vibration; a lowered soft palate not only inhibits frication and trills in
oral obstruents (if articulated at or further forward of the uvula) but also
influences the Fj (height) of vowels; the glottal state of high airflow segments
(such as /s/, /p/), if assimilated onto adjacent vowels, creates a condition that
mimics nasalization and is apparently reinterpreted by listeners as nasaliza-
tion; labial-velar segments like [w,kp] pattern with plain labials ([ + anterior])
when they influence vowel quality or when frication or noise bursts are
involved, but they frequently pattern like velars ([ — anterior]) when nasal
consonants assimilate to them; such articulatorily distant and disjoint
secondary articulations as labialization, retroflexion, and pharyngealization
have similar effects on high vowels (they centralize [i] maximally and have
little effect on [u]) (Ohala 1976, 1978, 1983, 1985a, b; Beddor, Krakow, and
Goldstein 1986; Wright 1986).
I challenge the advocates of "feature geometry" to represent such criss-
crossing and occasionally bidirectional dependencies in terms of asymmetric,
transitive, relationships. In any case, the attempt to explain these and a host
of other dependencies other than by reference to phonetic principles will be
subject to the fundamental criticism: even if one can devise a formal
relabeling of what does happen in speech, one will not be able to show in
principle - that is, without ad hoc stipulations - why certain patterns do not
happen. For example, why should [ + nasal] affect primarily the feature [high]
in vowels and not the feature [back]? Why should [ — continuant] [ — nasal]
inhibit [ +voice] instead of [ — voice]?
179
Segment
ments, but there is at least some anecdotal and experimental evidence that
can be cited and it is not all absolutely inconsistent with the autosegmental
position (though, I would maintain, it does not unambiguously support it
either). Systematic investigation of the issues is necessary, though, before any
confident conclusions may be drawn.
Even outside of linguistics analyzers of song and poetry have for millennia
extracted metrical and prosodic structures from songs and poems. An
elaborate vocabulary exists to describe these extracted prosodies, e.g. in the
Western tradition the Greeks gave us terms and concepts such as iamb,
trochee, anapest, etc. Although worth further study, it is not clear what
implication this has for the psychological reality of autosegments. Linguisti-
cally naive (as well as linguistically sophisticated) speakers are liable to the
reification fallacy. Like Plato, they are prone to regard abstract concepts as
real entities. Fertility, war, learning, youth, and death are among the many
fundamental abstract concepts that people have often hypostatized, some-
times in the form of specific deities. Yet, as we all know, these concepts only
manifest themselves when linked with specific concrete people or objects.
They cannot "float" as independent entities from one object to another.
Though more prosaic than these (so to speak), is iamb any different? Are
autosegments any different?
But even if we admit that ordinary speakers are able to form concepts of
prosodic categories paralleling those in autosegmental phonology, no cul-
ture, to my knowledge, has shown an awareness of comparable concepts
involving, say, nasal (to consider one feature often treated autosegmentally).
That is, there is no vocabulary and no concept comparable to iamb and
trochee for the opposite patterns of values for [nasal] in words like dam vs.
mid or mountain vs. damp. The concepts and vocabulary that do exist in this
domain concerning the manipulation of nonprosodic entities are things like
rhyme, alliteration, and assonance, all of which involve the repetition of
whole segments.
Somewhat more to the point, psychologically, is evidence from speech
errors, word games and things like "tip of the tongue" (TOT) recall. Errors
of stress placement and intonation contour do occur (Fromkin 1976; Cutler
1980), but they are often somewhat difficult to interpret. Is the error of
ambiguty for the target ambiguity a grafting of the stress pattern from the
morphologically related word ambiguous (which would mean that stress is an
entity separable from the segments it sits on) or has the stem of this latter
word itself intruded? Regarding the shifting of other features, including
[nasal] and those for places of articulation, there is some controversy.
Fromkin (1971) claimed there was evidence of feature interchange, but
Shattuck-Hufnagel and Klatt (1979) say this is rare - usually whole bundles
of features, i.e. phonemes, are what shift. Hombert (1986) has demonstrated
180
7 John J. Ohala
using word games that the tone and vowel length of words can, in some cases
(but not all), be stripped off the segments they are normally realized on and
materialized in new places. In general, though, word games show that it is
almost invariably whole segments that are manipulated, not features. TOT
recall (recall of some aspects of the pronunciation of a word without full
retrieval of the word) frequently exhibits awareness of the prosodic character
of the target word (including the number of syllables, as it happens; Brown
and McNeill 1966; Browman 1978). Such evidence is suggestive but unfortu-
nately, even when knowledge of features is demonstrated, it does not provide
crucial evidence for differentiating between the psychological reality of
traditional (mutually linked) features and autonomous features, i.e., autoseg-
ments. There is as yet no hitching post for autosegmental theory in this data.
181
Segment
before palato-alveolars, e.g., spatial and special become homophones. (See
also Kawasaki [1986] regarding the perceptual "invisibility" of nasalization
near nasal consonants.)
Thus, from a phonetic point of view such spill-over of articulatory gestures
is well known (at least since the early physiological records of speech using
the kymograph) and it is a constant and universal feature of speech, even
before any sound change occurs which catches the attention of the linguist.
Many features thus come "prespread," so to speak; they do not start
unspread and then migrate to other segments. Such spill-over only affects the
phonological interpretation of neighboring elements if a sound change
occurs. I have presented evidence that sound change is a misapprehension or
reinterpretation on the part of the listener (Ohala 1974, 1975, 1981b, 1985a,
1987,1989). Along with this reinterpretation there may be some exaggeration
of aspects of the original pronunciation, e.g. the slight nasalization on a
vowel may now be heavy and longer. Under this view of sound change, no
person, neither the speaker nor the listener, has implemented a change in the
sense of having in their mental grammar a rule that states something like /e/
-> /ej/ /_/3/; rather, the listener parses the signal in a way that differs from the
way the speaker parses it. Similarly, if a reader misinterprets a carelessly
handwritten "n" as the letter "u," we would not attribute to the writer or the
reader the psychological act or intention characterized by the rule "n" -*
"u." Such a rule would just be a description of the event from the vantage
point of an observer (a linguist?) outside the speaker's and listener's domains.
In the case of sound patterns of language, however, we are now able to go
beyond such external, "telescoped," descriptions of events and provide
realistic, detailed, accounts in terms of the underlying mechanisms.
The migration of features is therefore not evidence for autosegmental
representations and is not evidence capable of countering the claim that
features are nonautonomous. There is no mental representation requiring
unlinked or temporally uncoordinated features.
7.4 Conclusions
I have argued that features are so bound together due to physical principles
and task constraints that if we started out with uncoordinated features they
would have linked themselves of their own accord.10 Claims that features
can be unlinked have not been made with any evident awareness of the full
phonetic complexity of speech, including not only the anatomical but also
the aerodynamic and the acoustic-auditory principles governing it. Thus,
more than twenty years after the defect was first pointed out, phonological
10
Similar arguments are made under the heading of "feature enhancement" by Stevens, Keyser,
and Kawasaki (1986) and Stevens and Keyser (1989).
182
7 John J. Ohala
representations still fail to reflect the "intrinsic content" of speech (Chomsky
and Halle 1968: 400ff.). They also suffer from a failure to consider fully the
kind of diachronic scenario which could give rise to apparent "spreading" of
features, one of the principal motivations for unlinked features.
What has been demonstrated in the autosegmental literature is that it is
possible to represent speech-sound behavior using autosegments which
eventually become associated with the slots in the CV skeleton. It has not
been shown that it is necessary to do so. The same phonological phenomena
have been represented adequately (though still not explanatorily) without
autosegments. But there must be an infinity of possible ways to represent
speech (indeed, we have seen several in the past twenty-five years and will no
doubt see several more in the future); equally, it was possible to represent
apparent solar and planetary motion with the Ptolemaic epicycles and the
assumption of an earth-centered universe. But we do not have to relive
history to see that simply being able to "save the appearances" of pheno-
mena is not justification in itself for a theory. However, even more damaging
than the lack of a compelling motivation for the use of autosegments, is that
the concept of autosegments cannot explain the full range of phonological
phenomena which involve interactions between features, a very small sample
of which was discussed above. This includes the failure to account for what
does not occur in phonological processes or which occurs much less com-
monly. On the other hand, I think phonological accounts which make
reference to the full range of articulatory, acoustic, and auditory factors,
supported by experiments, have a good track record in this regard (Ohala
1990a).
Comments on chapter 7
G. N. CLEMENTS
In his paper "The segment: primitive or derived?" Ohala constructs what he
calls a "plausibility argument" for the view that there is no level of
phonological representation in which features are not coordinated with each
other in a strict one-to-one fashion.* In contrast to many phoneticians who
have called attention to the high degree of overlap and slippage in speech
production, Ohala argues that the optimal condition for speech perception
requires an alternating sequence of precisely coordinated rapid transitions
and steady-states. From this observation, he concludes that the phonological
*Research for this paper was supported in part by grant no. INT-8807437 from the National
Science Foundation.
183
Segment
184
7 Comments
synchronic assimilation rules, though constraints on segment sequences may
reflect the reinterpretation (or "misanalysis") of phonetic processes operat-
ing at earlier historical periods. If true, this argument seriously undermines
the theory at its basis. But is it true? Do we have any criteria for determining
when a detectable linguistic generalization is a synchronic rule?
This issue has been raised elsewhere by Ohala himself. He has frequently
expressed the view that a grammar which aims at proposing a model of
speaker competence must distinguish between regularities which the speaker
is "aware" of, in the sense that they are used productively, and those that are
present only for historical reasons, and which do not form part of the
speaker's grammar (see e.g. Ohala 1974; Ohala and Jaeger 1986). In this
view, the mere existence of a detectable regularity is not by itself evidence
that it is incorporated into the mental grammar as a synchronic rule. If a
regularity has the status of a rule, we expect it to meet what we might call the
productivity standard: the rule should apply to new forms which the speaker
has not previously encountered, or which cannot have been plausibly
memorized.
Since its beginnings, autosegmental phonology has been based on the
study of productive rules in just this sense, and has based its major theoretical
findings on such rules. Thus, for example, in some of the earliest work, Leben
(1973) showed that when Bambara words combine into larger phrases, their
surface tone melody varies in regular ways. This result has been confirmed
and extended in more recent work showing that the surface tone pattern of
any word depends on tonal and grammatical information contributed by the
sentence as a whole (Rialland and Badjime 1989). Unless all such phrases are
memorized, we must assume that unlinked "tone melodies" constitute an
autonomous functional unit in the phonological composition of Bambara
words.
Many other studies in autosegmental phonology are based on productive
rules of this sort. The rules involved in the Igbo and Kikuyu tone systems, for
example, apply across word boundaries and, in the case of Kikuyu, can affect
multiple sequences of words at once. Unless we are willing to believe that
entire sentences are listed in the lexicon, we are forced to conclude that the
rules are productive, and part of the synchronic grammar.12 Such evidence is
not restricted to tonal phenomena. In Icelandic, preaspirated stops are
created by the deletion of the supralaryngeal features of thefirstmember of a
geminate unaspirated stop and of the laryngeal features of the second. In his
study of this phenomenon, Thrainsson (1978) shows at considerable length
that it satisfies a variety of productivity criteria, and must be part of a
synchronic grammar. In Luganda, the rules whose operation brings to light
12
For Igbo see Goldsmith (1976), Clark (1990); for Kikuyu see Clements (1984) and references
therein.
185
Segment
the striking phenomenon of "mora stability" apply not only within morpho-
logically complex words but also across word boundaries. The independence
of the CV skeleton in this language is confirmed not only by regular
alternations, but also by the children's play language called Ludikya, in
which the segmental content of syllables is reversed while length and tone
remain constant (Clements 1986). In English, the rule of intrusive stop
formation which inserts a brief [t] in words like prince provides evidence for
treating the features characterizing oral occlusion as an autosegmental node
in hierarchical feature representation (see Clements 1987 for discussion); the
productivity of this rule has never been questioned, and is experimentally
demonstrated in Ohala (1981a). In sum, the argumentation upon which
autosegmental phonology is based has regularly met the productivity stan-
dard as Ohala and others have characterized it. We are a long way from the
days when to show that a regularity represented a synchronic rule, it was
considered sufficient just to note that it existed.
But even if we agree that autosegmental rules constitute a synchronic
reality, a fundamental question still remains: if, as Ohala argues, linear
coordination of features represents the optimal condition for speech percep-
tion, why do we find feature asynchrony at all? The reasons for this lie, at
least in part, in the fact that phonological structure involves not only
perceptually motivated constraints, but also articulatorily motivated con-
straints, as well as higher-order grammatical considerations. Phonology (in
the large sense, including much of what is traditionally viewed as phonetics)
is concerned with the mapping between abstract lexicosyntactic represen-
tations and their primary medium of expression, articulated speech. At one
end of this mapping we find linguistic structures whose formal organization
is hierarchical rather than linear, and at the other end we find complex
interactions of articulators involving various degrees of neural and muscular
synergy and inertia. Neither type of structure lends itself readily or insight-
fully to expression in terms of linear sequences of primitives (segments) or
stacks of primitives (feature bundles).
In many cases, we find that feature asynchrony is regularly characteristic
of phonological systems in which features and feature sets larger and smaller
than the segment have a grammatical or morphological function. For
instance, in languages where tone typically serves a grammatical function (as
in Bantu languages, or many West African languages), we find a greater
mismatch between underlying and surface representations than in languages
where its function is largely lexical (as in Chinese). In Bambara, to take an
example cited earlier, the floating low tone represents the definite article and
the floating high tone is a phrasal-boundary marker; while in Igbo, the
floating tone is the associative-construction marker. The well-known non-
linearities found in Semitic verb morphology are due to the fact that
186
7 Comments
consonants, vowels, and templates all play a separate grammatical role in the
make-up of the word (McCarthy 1981). In many further cases, autosegmen-
talized features have the status of morpheme-level features, rather than
segment-level features (to use terminology first suggested by Robert Vago).
Thus in many vowel-harmony systems, the harmonic feature (palatality,
ATR (Advanced Tongue Root), etc.) commutes at the level of the root or
morpheme, not the segment. In Japanese, as Pierrehumbert and Beckman
point out (1988), some tones characterize the morpheme while others
characterize the phrase, a fact which these authors represent by linking each
tone to the node it characterizes. What these and other examples suggest is
that nonlinearities tend to arise in a system to the extent that certain subsets
of features have a morphological or syntactic role independent of others.
Other types of asynchronies between features appear to have articulatory
motivation, reflecting the relative sluggishness of some articulators with
respect to others (cf. intrusive stop formation, nasal harmonies, etc.), while
others may have functional or perceptual motivation (the Icelandic pre-
aspiration rule preserves the distinction between underlying aspirated and
unaspirated geminate stops from the effects of a potentially neutralizing
deaspiration rule, but it translates this distinction into one between preaspir-
ated and unaspirated geminates). If all such asynchronies represent depar-
tures from the optimal or "default" state and in this way add to the formal
complexity of a phonological representation, then many of the rules and
principles of autosegmental phonology can be viewed as motivated by the
general, overriding principle: reduce complexity.
Ohala argues that "feature geometry," a model which uses evidence from
phonological rules to specify a hierarchical organization among features
(Clements 1985; Sagey 1986a; McCarthy 1988), does not capture all observ-
able phonetic dependencies among features, and is therefore incomplete.
However, feature geometry captures a number of significant cross-linguistic
generalizations that could not be captured in less structured feature systems,
such as the fact that the features defining place of articulation commonly
function as a unit in assimilation rules. True, it does not and cannot express
certain further dependencies, such as the fact that labiality combines less
optimally with stop production (as in [p]) than do apical or velar closure (as
in [t] or [k]). But not all such generalizations form part of phonological
grammars. Thus, phonologists have not discovered any tendency for rules to
refer to the set of all stops except [p] as a natural class. On the contrary, in
spite of its less efficient exploitation of vocal-tract mechanics, [p] consistently
patterns with [t] and [k] in rules referring to oral stops, reflecting the general
tendency of phonological systems to impose a symmetrical classification
upon speech sounds sharing linguistically significant properties. If feature
geometry attempted to derive all phonetic as well as phonological dependen-
187
Segment
cies from its formalism, it would fail to make correct predictions about cross-
linguistically favored rule types, in this and many other cases.
Ohala rejects autosegmental phonology (and indeed all formal approaches
to phonology) on the grounds that its formal principles may have an ultimate
explanation in physics and psychoacoustics, and should therefore be super-
fluous. But this argument overlooks the fact that physics and psychology are
extremely complex sciences, which are subject to multiple (and often conflict-
ing) interpretations of phenomena in almost every area. Physics and psy-
chology in their present state can shed light on some aspects of phonological
systems, but, taken together, they are far from being able to offer the hard,
falsifiable predictions that Ohala's reductionist program requires if it is to
acquire the status of a predictive empirical theory. In particular, it is often
difficult to determine on a priori grounds whether articulatory, aerodynamic,
acoustic, or perceptual considerations play the predominant role in any given
case, and these different perspectives often lead to conflicting expectations. It
is just here that the necessity for formal models becomes apparent. The
advantage of using formal models in linguistics (and other sciences) is that
they can help us to formulate and test hypotheses within the domain of study
even when we do not yet know what their ultimate explanation might be. If
we abandoned our models on the grounds that we cannot yet explain and
interpret them in terms of higher-level principles, as Ohala's program
requires, we would make many types of discovery impossible. To take one
familiar example: Newton could not explain the Law of Gravity to the
satisfaction of his contemporaries, but he could give a mathematical state-
ment of it - and this statement proved to have considerable explanatory
power.
It is quite possible that, ultimately, all linguistic (and other cognitive)
phenomena will be shown to be grounded in physical, biological, and
psychological principles in the largest sense, and that what is specific to the
language faculty may itself, as some have argued, have an evolutionary
explanation. But this possibility does not commit us to a reductionist
philosophy of linguistics. Indeed, it is only by constructing explicit, predictive
formal or mathematical models that we can identify generalizations whose
relations to language-external phenomena (if they exist) may one day become
clear. This is the procedure of modern science, which has been described as
follows by one theoretical physicist (Hawking 1988: 10): "A theory is a good
theory if it satisfies two requirements: It must ultimately describe a large class
of observations on the basis of a model that contains only a few arbitrary
elements, and it must make definite predictions about the results of future
observations." This view is just as applicable to linguistics and phonetics as it
is to physics.
Ohala's paper contains many challenging and useful ideas, but it over-
188
7 Comments
states its case by a considerable margin. We can agree that temporal
coordination plays an important role in speech production and perception
without concluding that phonological representations are uniformly segmen-
tal. On the contrary, the evidence from a wide and typologically diverse
number of languages involving both "suprasegmental" and traditionally
segmental features demonstrates massively and convincingly that phonologi-
cal systems tolerate asynchronic relations among features at all levels of
representation. This fact of phonological structure has both linguistic and
phonetic motivation. Although phonetics and phonology follow partly
different methodologies and may (as in this case) generate different hypoth-
eses about the nature of phonological structure, the results of each approach
help to illuminate the other, and take us further towards our goal of
providing a complete theory of the relationship between discrete linguistic
structure and the biophysical continuum which serves as its medium.
189
8
Modeling assimilation in nonsegmental,
rule-free synthesis
JOHN LOCAL
8.1 Introduction
Only relatively recently have phonologists begun the task of seriously testing
and evaluating their claims in a rigorous fashion.1 In this paper, I attempt
to sustain this task by discussing a computationally explicit version of one
kind of structured phonology, based on the Firthian prosodic approach to
phonological interpretation, which is implemented as an intelligent know-
ledge-based "front-end" to a laboratory formant speech synthesizer (Klatt
1980). My purpose here is to report on how the superiority of a structured
monostratal approach to phonology over catenative segmental approaches
can be demonstrated in practice. The approach discussed here compels new
standards of formal explicitness in the phonological domain as well as a need
to pay serious attention to parametric and temporal detail in the phonetic
domain (see Browman and Goldstein 1985, 1986). The paper falls into two
parts: the first outlines the nonsegmental approach to phonological interpre-
tation and representation; the second gives an overview of the way "process"
phenomena are treated within this rule-free approach and presents an
analysis of some assimilatory phenomena in English and the implementation
of that analysis within the synthesis model. Although the treatment of
assimilation presented here is similar to some current proposals (e.g. Lodge
1984), it is, to the best of my knowledge, unique in having been given an
explicit computational implementation.
The approach to phonological analysis presented here is discussed at length in Kelly and
Local (1989), where a wide range and variety of languages are considered. The synthesis of
English which forms part of our work in phonological theory is supported by a grant from
British Telecom PLC. The work is collaborative and is being carried out by John Coleman
and myself. Though my name appears as the author of this paper I owe a great debt to John
Coleman, without whom this work could not have taken the shape it does.
190
8 John Local
192
8 John Local
83.1 Abstractness: phonology and phonetics demarcation
One of the central aspects of Firthian approach to phonology,2 and one
that still distinguishes it from much current work, is the insistence on a strict
distinction between phonetics and phonology (see also Pierrehumbert and
Beckman 1988). This is a central commitment in our work. We take seriously
Trubetzkoy's dictum that:
The data for the study of the articulatory as well as the acoustic aspects of speech
sounds can only be gathered from concrete speech events. In contrast, the linguistic
values of sounds to be examined by phonology are abstract in nature. They are above
all relations, oppositions, etc., quite intangible things, which can be neither perceived
nor studied with the aid of the sense of hearing or touch. (Trubetzkoy 1939 [1969]: 13)
Like the Firthians and Trubetzkoy, we take phonology to be relational: it is a
study whose descriptions are constructed in terms of structural and systemic
contrast; in terms of distribution, alternation, opposition, and composition.
Our formal approach, then, treats phonology as abstract; this has a number
of important consequences. For example, this enables us (like the Firthians)
to employ a restrictive, monostratal phonological representation. There is
only one level of phonological representation and one level of phonetic
representation; there are no derivational steps. This means that for us it
would be incoherent to say such things as "a process that converts a high
tone into a rising tone following a low tone" (Kaye 1988: 1) or "a striking
feature of many Canadian dialects of English is the implementation of the
diphthongs [ay] and [aw] as [Ay] and [AW]" (Bromberger and Halle 1989: 58);
formulations such as these simply confuse phonological categories with their
phonetic exponents. Because phonological descriptions and representations
encode relational information they are abstract, algebraic objects appropri-
ately formulated in the domain of set theory.
In contrast, phonetic representations are descriptions of physical,
temporal events formulated in a physical domain. This being the case, it
makes no sense to talk of structure or systems in phonetics: there may be
differences between portions of utterance, but in the phonetics there can be
no "distinctions." The relationship between phonology and phonetics is
arbitrary (in the sense of Saussure) but systematic; I know of no evidence that
suggests otherwise (see also Lindau and Ladefoged 1986). The precise form
2
One important aspect of our approach which I will not discuss here (but see Kelly and Local
1989) is the way we conduct phonological interpretation. (The consequences of the kinds of
phonological interpretation we do can, to some extent, be discerned via our chosen mode of
representation.) Nor does space permit a proper discussion of how a declarative approach to
phonology deals with morphophonological alternations. However, declarative treatment of
such alternations poses no particular problems and requires no additional formal machinery.
Firthian prosodic phonology provides a nonprocess model of such alternations (see, e.g.,
Sprigg 1963).
193
Segment
194
8 John Local
phonological feature is to be interpreted in the context of the set of features
which accompany the particular feature under consideration along with its
place in structure. This means that a feature such as [heightl] will not always
receive the same phonetic interpretation. Employing again the data from
Klatt (1980) we can provide a phonetic interpretation for the [heightl]
feature for syllables such as see and sue thus:
{syllable([heightl],T1,T2,T3,T4),(Fl:33O;Bl:55),(Fl:33O;Bl:55),
(Fl:350; Bl:60),(Fl:350; Bl:60)}
Ladefoged (1977) discusses such a structure-dependent interpretation of
phonological features. He writes: "the matrix [of synthesizer values (and
presumably of some foregoing acoustic analysis)] has different values for F2
(which correlates with place of articulation) for [s] and [t], although these
segments are both [ + alveolar]. The feature Alveolar has to be given a
different interpretation for fricatives and plosives" (1977: 231). We should
note, however, that the domain of these phonetic parameters - whether they
are formulated in articulatory or acoustic terms, say, has to do with the task
at hand - is implementation-specific; it has not, and cannot have, any
implications whatsoever for the phonological theory.
The position I have just outlined is not universally accepted. Many
linguists view phonology and phonetics as forming a continuum with
phonological descriptions presented in the same terms as phonetic descrip-
tions; these are often formulated in terms of a supposedly universal set of
distinctive phonetic properties. This can be seen in the way phonological
features are employed in most contemporary approaches. Phonological
categories are typically constructed from the names of phonetic (or quasi-
phonetic) categories, such as "close, back, vowel," or "voiced, noncoronal,
obstruent." By doing this the impression of a phonology-phonetics conti-
nuum is maintained. "Phonetic" representations in generative phonologies
are merely the end result of a process of mappings from strings to strings; the
phonological representations are constructed from features taking binary
values, the phonetic representations employing the same features usually
taking scalar values. Chomsky and Halle explicitly assert this phonetics-
phonology continuum when they write: "We take 'distinctive features' to be
the minimal elements of which phonetic, lexical, and phonological transcrip-
tions are composed, by combination and concatenation" (1968: 64). One
reason that this kind of position is possible at all, of course, is that in such
approaches there is nothing like an explicit phonetic representation.
Typically, all that is provided are a few feature names, or segment symbols;
rarely is there any indication of what would be involved in constructing the
algorithm that would allow us to test such claims. The impression of a
phonology-phonetics continuum is, of course, quite illusory. In large part,
195
Segment
the illusion is sustained by the entirely erroneous belief that the phonological
categories have some kind of implicit, naive phonetic denotation (this seems
to be part of what underlies the search for invariant phonetic correlates of
phonological categories and the obsession with feature names [Keating
1988b]). Despite some explicit claims to this effect (see, for example,
Chomsky and Halle 1968; Kaye, Lowenstamm, and Vernaud 1985; Brom-
berger and Halle 1989), phonological features do not have implicit deno-
tations and it is irresponsible of phonologists to behave as if they had when
they present uninterpreted notations in their work.
One of the advantages that accrues from having a parametric phonetic
interpretation distinguished from phonological representation is that the
arbitrary separation of "segmental" from "supra-" or "non-"segmental
features can be dispensed with. After all, the exponents of so-called "segmen-
tal" phonetic aspects are no different from those of "nonsegmental" ones -
they are all parameters having various extents. Indeed, any coherent account
of the exponents of "nonsegmental" components will find it necessary to
refer to "segmental" features: for instance, as pointed out by Adrian
Simpson (pers. comm.), lip-rounding and glottality are amongst the phonetic
exponents of accentuation (stress). Compare to him, and for him, for
example, in versions where to and for are either stressed or nonstressed.
Local (1990) also shows that the particular quality differences found in final-
open syllable vocoids in words like city and seedy are exponents of the
metrical structure of the words (see also the experimental findings of
Beckman 1986).
196
8 John Local
forwardly to the phonetics, that "processes" have been postulated. Syntag-
matic structure, which shows how smaller representations are built up into
larger ones, is represented by graphs. The graphs are familiar from the
syllable-tree notation ubiquitous in phonology (see Fudge 1987 for copious
references). The graphs we employ, however, are directed acyclical graphs
(DAGs) rather than trees, since we admit the possibility of multidominated
nodes ("reentrant" structures; see Shieber, 1986), for instance, in the
representation of ambisyllabicity and larger structures built with feature
sharing, e.g. coda-onset "assimilation":
(1) syllable
rime
onset
coda
197
Segment
(3) syllable
[ ± rnd]
onset
X
\coda
Onset and coda are in the domain of [ ± rnd] (though coda does not bear the
feature distinctively) by virtue of their occurrence as syllable constituents.
Once the structural domain of oppositions is established there is no need to
198
8 John Local
employ process modeling such as "copying" or "spreading." In a similar
fashion nonstructured phonologies will typically specify the value of the
contrast feature [± voice] for consonants, whereas vowels will typically be
specified (explicitly, or default-specified) as being [+ voi]. A structured
phonological representation, on the other hand, will explicitly recognize that
the opposition of voicing holds over onsets and rimes, not over consonants
and vowels (see Browman and Goldstein 1986: 227; Sprigg 1972). Thus
vowels will be left unspecified for voicing and the phonological voicing
distinction between bat ~ bad and bent ~ bend, rather than being assigned
to a coda domain, will be assigned to a rime domain:
(4) syllable
rime
onset / \ [ + voi ]
nucleus / \coda
If this is done, then the differences in voice quality and duration of the vowel
(and, in the case of bent ~ bend, the differences in the quality and duration of
nasality), the differences in the nature of the transitions into the closure and
the release of that closure can be treated in a coherent and unified fashion as
the exponents of rime-domain voicing opposition. (The similarity between
these kinds of claims and representations and the sorts of representations
found in prosodic analysis should be obvious.3)
The illustrative representations I have given have involved the use of rather
conventional phonological features. The features names that we use in the
construction of phonological representations are, in the main, taken from the
Jakobson, Fant, and Halle (1952) set, though nothing of particular interest
hangs on this. However, they do differ from the "distinctive features"
described by Jakobson, Fant, and Halle in that, as I indicated earlier, they
are purely phonological; they have no implicit phonetic interpretation. They
differ from the Jakobson, Fant, and Halle features in two other respects.
First, when they are interpreted (because of the compositionality principle)
they do not have the same interpretation wherever they occur in structure.
3
For example, it is not uncommon (e.g. Albrow 1975) to find "formulae" like yw(CViCh/fi) as
partial representations of the systemic (syntagmatic and paradigmatic) contrasts instantiated
by words such as pit and put.
199
Segment
So, for instance, the feature [ + voi] at onset will not receive the same
interpretation as [+ voi] at rime. Second, by employing hierarchical,
category-valued features, such as [cons] and [voc] as in (5) below, it enables
the same feature names to be used within the same feature structure but with
different values. Here, in the partial feature structure relating to apicality
with open approximation and velarity at onset, [grv] with different values is
employed to encode appropriate aspects of the primary and secondary
articulation of the consonantal portion beginning a word such as red.
(5)
grv:
cons:
cmp:
grv: +
onset: voc: height: 0
rnd: +
src: nas: —
200
8 John Local
8.4 Temporal interpretation: dealing with "processes"
Consider the following syllable graph:
(6) syllable
rime
s
/
onset
coda
c V
201
Segment
coarticulation, deletion, and insertion can be given a nonprocess, declarative
representation:
(8)
Syllable exponents
Rime exponents
Nucleus exponents
Onset exponents Coda exponents
C V C
(The "box notation" is employed in a purely illustrative fashion and has no
formal status. So, for instance, the vertical lines indicating the ends of
exponents of constituents should not be taken to indicate an absolute cross-
parametric temporal synchrony.)
k keep k cart
v exponents
K exponents i exponents
coot
fc
202
8 John Local
Notice that the temporal-overlay account (when interpreted parametrically)
will result in just the right kind of "vocalic" nucleus characteristics through-
out the initial occlusive part of the syllable even though onset and nucleus are
not sisters in the phonological structure. Notice, too, that we consider such
"coarticulation" as phonological and not simply some phonetic-mechanical
effect, since it is easy to demonstrate that it varies from language to language,
and, for that matter, within English, from dialect to dialect. A similar
interpretation can be given to the onsets of words such as split and sprit,
where the initial periods of friction are qualitatively different as are the
vocalic portions (this is particularly noticeable in those accents of English
which realize the period of friction in spr- words with noticeably lip-rounded
tongue-tip retracted palato-alveolarity). In these cases what we want to say is
that the liquid constituent dominates the initial cluster so its exponents are
coextensive with both the initial friction and occlusion and with the early
portion of the exponents of the nucleus. One aspect of this account that I
have not made explicit is the tantalizing possibility that only overlaying/
cocatenation need be postulated and that apparently concatenative pheno-
mena are simply a product of different temporal gluings; only permitting one
combinatorial operation is a step towards making the model genuinely more
restrictive.
In this context consider now the typical putative process phenomenon of
deletion. As an example consider versions of the words tyrannical and
torrential as produced by one (British English) speaker. Conventional
accounts (e.g. Gimson 1970; Lass 1985; Dalby 1986; Mohanan 1986) of the
tempo/stylistic reduced/elided pronunciations of the first, unstressed syll-
ables of tyrannical and torrential as
(10) \r tyx*
would argue simply that the vowel segment had been deleted. But where does
this deletion take place? In the phonology? Or the phonetics? Or both? The
notion of phonological features "changing" or being "deleted" is, as I
indicated earlier, highly problematical. However, if one observes carefully
the phonetic detail of such purportedly "elided" utterances, it is actually
difficult to find evidence that things have been deleted. The phonetic detail
suggests, rather, that the same material has simply been temporally redistri-
buted (i.e. these things are not phonologically different; they differ merely in
terms of their temporal phonetic interpretation). Even a cursory listening
reveals that the beginnings of the "elided" forms of tyrannical and torrential
do not sound the same. They differ, for instance, in the extent of their
liprounding, and in terms of their resonances. In the "elided" form of tyr the
liprounding is coincident with the period of friction, whereas in tor it is
observable from the beginning of the closure; tor has markedly back
203
Segment
resonance throughout compared with tyr, which has front of central reso-
nance throughout. (This case is not unusual: compare the "elided" forms of
suppose, secure, prepose, propose, and the do ~ dew and cologne ~ clone
cases discussed by Kelly and Local 1989: part 4.) A deletion-process account
of such material would appear to be a codification of a not particularly
attentive observation of the material.
By contrast, a cocatenation account of such phenomena obviates the need
to postulate destructive rules and allows us to take account of the observed
phonetics and argue that the phonological representation and ingredients of
elided and nonelided forms are the same; all that is different is the temporal
phonetic interpretation of the constituents. The "unreduced" forms have a
temporal organization schematically represented as follows:
(11)
i exponents o exponents
T exponents p exponents T exponents p exponents
while the reduced forms may have the exponents of nucleus of such a
duration that their end is coincident with the end of the onset exponents:
(12)
i exponents o exponents
T exponents T exponents
p exponents p exponents
8.4.2 Assimilation
Having provided a basic outline of some of the major features of a "non-
segmental," "nonprocess" approach to phonology, I will now examine the
ways such an approach can deal with the phenomenon of "assimilation."
Specifically, I want to consider those "assimilations" occurring between the
end of one word and the beginning of the next, usually described as involving
"alveolar consonants." The standard story about assimilation, so-called, in
English can be found in Sweet (1877), Jones (1940), and Gimson (1970),
often illustrated with the same examples. Roach (1983:14) provides a recent
formulation:
For example, thefinalconsonant in "that" daet is alveolar t. In rapid casual speech the
t will become p before a bilabial consonant... Before a dental consonant, t changes
204
8 John Local
to a dental plosive... Before a velar consonant, the t will become k . . . In similar
contexts d would become, b, d, g, respectively, and n would become m, n and r j . . . S
becomes, j, and z becomes 3 when followed by J or j .
I want to suggest that this, and formulations like it (see below), give a
somewhat misleading description of the phenomenon under consideration.
The reasons for this are manifold: in part, the problem lies in not drawing a
clear distinction between phonology and phonetics; in part, a tacit assump-
tion that there are "phonological segments" and/or "phonetic segments"; in
part, not having very much of interest to say in the phonetic domain; in part,
perhaps, a lack of thoroughgoing phonetic observation. Consider the follow-
ing (reasonably arbitrary) selection of quotes and analyses centered on
"assimilation":
1 "the underlying consonant is /t/ and that we have rules to the effect:
t ->k/ [ + cons
— ant
— cor]"
— sonorant
' [ + nasal ] -• [ a place ]/ — continuant (Mohanan 1986:106).
a place
"The change from /s/ to [s] . . . takes place in the postlexical module in I mi[s]
you" (Mohanan 1986:7).
' T h e [mp] sequence derived from np [ten pounds] is phonetically identical to
the [mp] sequence derived from mp" (Mohanan 1986:178).
"the alveolars . . . are particularly apt to undergo neutralization as redund-
ant oppositions in connected speech" (Gimson 1970:295).
"Assimilation of voicing may take place when a word ends in a voiced
consonant before a word beginning with a voiceless consonant" (Barry
1984:5).
"We need to discover whether or not the phonological processes discernible
in fast speech are fundamentally different from those of careful slow
speech... as you in rapid speech can be pronounced [9z ja] or [339]" (Lodge
1984:2).
d b b b
son — son
cnt - cnt
+ ant + ant
+ cor — cor
0 (Nathan, 1988:311)
205
Segment
The flavor of these quotes and analyses should be familiar. I think they are a
fair reflection of the generally received approach to handling assimilation
phenomena in English. They are remarkably uniform not only in what they
say but in the ways they say it. Despite some differences in formulation and in
representation, they tell the same old story: when particular things come into
contact with each other one of those things changes into, or accommodates
in shape to, the other. However, this position does beg a number of
questions, for example: what are these "things"? (variously denoted: t ~ k ~
[ + cons, —ant, —cor] [1]; /s/ ~ [s] [3]; [mp] ~ np ~ [mp] ~ mp [4]; alveolars
[5]; voiced ~ voiceless consonant [6]. Where are these "things"? - in the
phonology or the phonetics or both or somewhere in between? Let us
examine some of the assumptions that underlie what I have been referring to
as "the standard story about assimilation." The following features represent
the commonalities revealed by the quotations above:
Concatenative-segmental: the accounts all assume strings of segments at
some level of organization (phonological and/or phonetic).
Punctual-contrastive: the accounts embody the notion that phonological
distinctions and relationships, as realized in their phonetic exponents,
are organized in terms of a single, unique cross-parametric, time slice.
Procedural-destructive: as construed in these accounts, assimilation involves
something which starts out as one thing, is changed/changes and ends
up being another. These accounts are typical in being systematically
equivocal with respect to the level(s) of description involved: no
serious attempt is made to distinguish the phonetic from the
phonological.
Homophony-producing: these accounts all make the claim (explicitly or
implicitly) that assimilated forms engender neutralization of opposit-
ions. That is, in assimilation contexts, we find the merger of a contrast
such that forms which in other contexts are distinct become phonet-
ically identical.
207
Segment
(a) .,
(d)
Figure 8.1 Spectrograms of (a) that case and (b) black case produced by speaker K and (c) that
case and (d) black case produced by speaker L
208
8 John Local
Speaker (L)
40 4! 42 43 44 45 46 47 48 49 50 51 52 53
54 55 56 57
that_case
40 41 42 43 44 45 4i 47 48 49 50 51 52 53
54 55 56
/5/ac/c case
Figure 8.2 Electropalatographic records for the utterances that case and black case produced by
speaker L shown in figure 8.1
Examination of the EPG records (for L - seefig.8.2) reveals that for these
same two pairs, although there is indeed velarity at the junction of that case
and black case, the nature and location of the contact of tongue to back of
the roof of the mouth is different. In black case we can see that the tongue
contact is restricted to the back two rows of electrodes on the palate. By
contrast, in that case the contact extends forward to include the third from
back row (frames 43-5, 49-52) and for three frames (46-8) the fourth from
back row; there is generally more contact, too, on the second from back row
(frames 45-9) through the holding of the closure. Put in general terms, there
is an overall fronter articulation for the junction of that case as compared
with that for black case. Such auditory, spectrographic, and EPGfindingsare
209
Segment
routine and typical for a range of speakers producing this and similar
utterances5 (though the precise frontness/backness relations differ [see
Kelly and Local 1986]).
Consider now the spectrograms in figure 8.3 of three speakers (A, L, W)
producing the continuous utterance This shop's a fish shop.6 The portions of
interest here are the periods of friction at the junction of this shop and fish
shop. A routine claim found in assimilation studies is that presented earlier
from Roach: "s becomes J, and z becomes 3 when followed by J or j . "
Auditory impressionistic observation of these three speakers' utterances
reveals that though there are indeed similarities between the friction at the
junction of the word pairs, and though in the versions of this shop produced
by these speakers it does not sound like canonical apico-alveolar friction, the
portions are not identical and a reasonably careful phonetician would feel
uneasy about employing the symbol J for the observed "palatality" in the this
shop case. The spectrograms again reflect these observed differences. In each
case the overall center of gravity for the period of friction in this shop is
higher than that in fish shop (see Shattuck-Hufnagel, Zue, and Bernstein
1978; and Zue and Shattuck-Hufnagel 1980).
These three speakers could also be seen to be doing different things at the
junction of the word pairs in terms of the synchrony of local maxima of lip
rounding relative to other articulatory components. For this shop the onset
of lip rounding for all three speakers begins later in the period of friction than
in fish shop, and it gets progressively closer through to the beginning of shop.
In fish shop for these speakers lip rounding is noticeably present throughout
the period of final friction in shop. (For a number of speakers that we have
observed, the lip-rounding details discussed here are highlighted by the
noticeable lack of lip rounding in assimilated forms of this year.)
Notice too that, as with the that case and black case examples discussed
earlier, there are observable differences in the Fx and F3 frequencies in the
vocalic portions of this and shop. Most obviously, in this Fl is lower than in
fish and F3 is higher (these differences show consistency over a number of
productions of the same utterance by any given speaker). The vocalic portion
in this even with "assimilated" palatality is not that of a syllable where the
palatality is the exponent of a different (lexically relevant) systemic oppo-
sition. While the EPG records for one of these speakers, L (seefig.8.4), offer
us no insight into the articulatory characteristics corresponding to the
impressionistic and acoustic differences of the vocalic portions they do show
5
Kelly and Local (1989: part 4.7) discuss a range of phenomena where different forms are said
to be homophonous and neutralization of contrast is said to take place. They show that
appropriate attention to parametric phonetic detail forces a retraction of this view. See also
Dinnsen (1983).
6
In these utterances 'this' and 'fish' were produced as accented syllables.
210
8 John Local
iutiitiSif
ran
I
Figure 8.3 Spectrograms of the utterances This shop's a fish shop produced by speakers W, A,
and L
211
Segment
40 41 42 A3 44 45 4i 47 ., 48, „ 4J
'" !!!!!! !!";; !!!!!! !!!!!! D"'!" O!!!!'fl 00
69 70 71 72 73 74 75 76
! I I lo!
6? 70 71 73 74 75 76 77
82 83 84 85
fish shop
Figure 8.4 Electropalatographic records of the utterances this shop and fish shop produced by
speaker L shown in figure 8.3
that the tongue-palate relationships are different in the two cases. The
palatographic record of this shop shows that while there is tongue contact
around the sides of the palate front-to-back throughout the period corres-
ponding to the friction portion, the bulk of the tongue-palate contact is
oriented towards the front of the palate. Compare this with the equivalent
period taken fromfishshop. The patterning of tongue-palate contacts here is
very different. Predominantly, we find that the main part of the contact is
212
8 John Local
oriented towards the back of the palate, with no contact occurring in the first
three rows. This is in sharp contrast to the EPG record for this shop where the
bulk of the frames corresponding to the period of friction show tongue-
palate contact which includes the front three rows. Moreover, where this shop
exhibits back contact it is different in kind from that seen for fish shop in
terms of the overall area involved.7 The difference here reflects the
difference in phonological status between the ("assimilatory") palatality at
the end of this, an exponent of a particular kind of word juncture, and that of
fish, which is an exponent of lexically relevant palatality.
What are we to make of these facts? First, they raise questions about the
appropriacy of the received phonetic descriptions of assimilation. Second,
they lead us to question the claims about "neutralization" of phonological
oppositions. (Where is this neutralization to be found? Why should vocalic
portions as exponents of "the same vowel" be different if the closures they
cooccur with were exponents of "the same consonant"?) Third, even if we
were to subscribe to the concatenative-segment view, these phonetic differ-
ences suggest that whatever is going on in assimilation, it is not simply the
replacement of one object by another.
213
Segment
cluster of phonetic features which characterize "assimilated consonants" do
not, in fact, appear to be found elsewhere. And just as with coarticulation, we
can give assimilation a nonprocedural interpretation. In order to recast the
canonical procedural account of assimilation in rather more "theory-
neutral" nonprocedural fashion we need minimally:
to distinguish phonology from phonetics - we do not want to talk of the
phonology changing. Rather we want to be able to talk about
different structure-dependent phonetic interpretations of particular
coda-onset relations;
a way of representing and accessing constituent-structure information about
domains over which contrasts operate: in assimilation, rime expo-
nents are maintained; what differences there are are to be related to
the phonetic interpretation of coda;
a way of interpreting parameters (in temporal and quality terms) for
particular feature values and constellations of feature values.
The approach to phonology and phonetics which I described earlier (section
8.3) has enabled us at York to produce a computer program which embodies
all these prerequisites. It generates the acoustic parameters of English
monosyllables (and some two-syllable structures) from structured phono-
logical representations. The system is implemented in Prolog - a program-
ming language particularly suitable for handling relational problems. This
allows successful integration of, for example, feature structures, operations
on graphs, and unification with specification of temporal relations obtaining
between the exponents of structural units. In the terms sketched earlier, the
program formally and explicitly represents a phonology which is non-
segmental, abstract, structured, monostratal, and monotonic. Each statement
in the program reflects a commitment to a particular theoretical position. So-
called "process" phenomena such as "consonant-vowel" coarticulation are
modeled in exactly the way described earlier; exponents of onset constituents
are overlaid on exponents of syllable-rime-nucleus constituents. In order to
do this it is necessary to have a formal and explicit representation of
phonological relations and their parametric phonetic exponents. Having
such parametric phonetic representations means that (apart from not operat-
ing with cross-parametric segmentation) it becomes possible to achieve
precise control over the interpretation of different feature structures and
phonetic generalizations across feature structures.
As yet, I have had nothing to say about how we cope with the destructive-
process (feature-changing) orientation of the accounts I have considered so
far. I will deal with the "process" aspect first. A moment's consideration
reveals a very simple declarative, rather than procedural, solution. What we
want to say is something very akin to the overlaying in coarticulation
discussed above. Just as the onset and rime are deemed to share particular
214
8 John Local
vocalic characteristics so, in order to be able to do the appropriate phonetic
interpretation of assimilation, we want to say that the coda and onset share
particular features. Schematically we can picture the "sharing" in an
"assimilation context" such as that case as follows:
(13)
a, exponents a 2 exponents
velarity exponents
rime
onset
coda,
a, exponents a2 exponents
coda, = onset,
velarity exponents
While this goes some way to removing the process orientation by stipulating
the appropriate structural domain over which velarity exponents operate,
there is still the problem of specifying the appropriate feature structures.
How can we achieve this sharing without destructively changing the feature
structure associated with codaj? At first glance, this looks like a problem, for
change the structure we must if we are going to share values. Pairs of feature
structures cannot be unified if they have conflicting information - this would
certainly be the case if the codat has associated with it a feature structure with
information about (alveolar) place of articulation. However, if we consider
the phonological oppositions operating in system at this point in structure we
can see that alveolarity is not relevantly statable. If we examine the phonetic
215
Segment
2 If [stop]
[ 0place ]
then [ alv ]
Although Lodge (1984: 131) talks of "realization rules" (rule 2 here), the
formulation here, like the earlier formulations, does not distinguish between
the phonetic and the phonological (it again trades on a putative, naive
interpretation of the phonological features). "Realization" is not being
employed here in a "phonetic-interpretation" sense; rather, it is employed in
the sense of "feature-specification defaults" (see Chomsky and Halle 1968;
Gazdar et al 1985).
The idea of not specifying some parts of phonological representations has
recently found renewed favor under the label of "underspecification theory"
(UT) (Archangeli 1988). However, the monolithic asystemic approach sub-
216
8 John Local
sumed under the UT label is at present too coarse an instrument to handle
the phenomena under consideration here (because of its across-the-board
principles, treating the codas of words like bit, bid and bin as not being
specified for "place of articulation" would entail that all such "alveolars" be
so specified; as I will show, though, the same principle of not specifying some
part of the phonological representation of codas of words such as this and his
is appropriate to account for palatality assimilation as described earlier, it is
not "place" of articulation that is involved in these cases).
Within the phonological model described earlier, we can give "un(der)-
specification of place" a straightforward interpretation. One important
characteristic of unification-based formalisms employing feature structures
of the kind I described earlier (and which are implemented in our model) is
their ability to share structure. That is, two or more features in a structure
may share one value (Shieber 1986; Pollard and Sag 1987). When two
features share one value, then any increment in information about that value
automatically provides further specification for both features. In the present
case, what we require is that coda-constituent feature structures for bit, bid,
and bin, for instance, should have "place" features undefined. A feature
whose value is undefined is denoted: [feat: 0]. Thus we state, in part, for coda
exponents under consideration:
(15)
grv: 0
cmp: 0
This, taken with the coda-onset constraint equation above, allows for the
appropriate sharing of information. We can illustrate this by means of a
partial feature structure corresponding to // cut:
(16)
src:
coda: cons: |_J_J
voc:
src:
voc:
277
Segment
The sharing of structure is indicated by using integer coindexing in feature-
vaiue notation. In this feature structure CD indexes the category [grv: -f,
cmp: +], and its occurrence in [coda; [cons: Q]]] indicates that the same value
is here shared. Note that this coindexing indicating sharing of category
values at different places in structure is not equivalent to multiple occur-
rences of the same values at different places in structure. Compare the
following feature structure:
(17)
src:
voc:
src:
voc:
This is not the same as the preceding feature structure. In the first case we
have one shared token of the same category [cons: [grv: +, cmp: +]],
whereas in the second there are two different tokens. While the first of
these partial descriptions corresponds to the coda-onset aspects of it cut,
the second relates to the coda-onset aspects of an utterance such as
black case. By sharing, then, we mean that the attributes have the same token
as their value rather than two tokens of the same type. This crucial
difference allows us to give an appropriate phonetic interpretation in the
two different cases (using the extant exponency data in the synthesis
program).
As the standard assimilation story reflects, the phonological coda opposit-
ions which have, in various places, t, d, n as their exponents share certain
commonalities: their exponents can exhibit other than alveolarity as place of
articulation; in traditional phonetic terms their exponents across different
assimilation contexts have manner, voicing, and source features in common.
218
8 John Local
Thus the exponents of the coda of ten as in ten peas, ten teas, ten keys all have
as part of their make-up voice, closure, and nasality; the exponents of the
coda of it in it tore, it bit, it cut all have voicelessness, closure, and non-
nasality as part of theirs. In addition, the "coarticulation" of the assimilated
coda exponents binds them to the exponents of the nucleus constituent in the
same syllable. At first glance, the situation with codas of words such as this
and his might appear to be the same as that just described. That is, we require
underspecification of "place of articulation." In conventional accounts (e.g.
Roach above and Gimson 1970) they are lumped for all practical purposes
with the "alveolars" as an "assimilation class." In some respects they are
similar, but in one important respect the exponents of these codas behave
differently. Crucially, we do not find the same place of articulation pheno-
mena here that wefindwith the bit, bid, and bin codas. Whereas in those cases
we find variants with alveolarity, labiality, velarity, and dentality, for
example, we do not find the same possibilities for this and his type of codas.
For instance, in our collection of material, we have not observed (to give
broad systemic representations) forms such as:
(18) 6iffif difman digdsn diOOir)
for this fish, this man, this then, and this thing. However, we do find
"assimilated" versions of words such as this and his with palatality (as
described above, e.g. this shop, this year). In these cases we do not want to say
that the values of the [ens] part of the coda feature-structure are undefined in
their entirety. Rather, we want to say that we have sharing of "palatality."
(Although the production of palatality as an exponent of the assimilated
coda constituent of this may appear to be a "place-of-articulation" matter, it
is not [see Lass 1976: 4.3, for a related, though segmental, analysis of
alveolarity with friction, on the one hand, and palato-alveolarity with
friction, on the other]). The allocation of one kind of "underspecification" to
the codas of bit, bid, and bin forms and another to the codas of this and his
forms represents part of an explicit implementation of the Firthian poly-
systemic approach to phonological interpretation.
Given the way we have constructed our feature descriptions, the partial
representation for the relevant aspects of such feature structures will look
like this:
(19)
grv: grv:
cons: cons:
emp: emp:
219
Segment
U here denotes the unification of the two structures. This will give us the
large structure corresponding to the "assimilated" version of this shop:
(20)
grv:
coda: cons:
cmp:
voc:
grv: +
onset:
cmp: +
221
Segment
hi M l A Milt
222
8 John Local
Figure 8.5 Spectrograms of synthetic versions of (a) this, (b) fish, (c) this shop
(unassimilated), (d) this shop (assimilated), and (e)fish shop
while differing in many details from the natural versions, nonetheless has in it
just the same kinds of relevant similarities and differences.
8.6 Conclusion
I have briefly sketched an approach to phonological representation and
parametric phonetic interpretation which provides an explicit and computa-
tionally tractable model for nonsegmental, nonprocedural, (rewrite-)rule-
free speech synthesis. I have tried to demonstrate that it is possible to model
"process" phenomena within this approach in a nonderivational way that
provides a consistent and coherent account of particular assimilatory aspects
of speech. The success of this approach in high-quality speech synthesis
suggests that this model of phonological organization will repay extensive
further examination.
223
Segment
Comments on chapter 8
KLAUS KOHLER
Summary of Local's position
The main points of Local's paper are:
1 His phonological approach does not admit of segmental entities at any level
of representation.
2 It distinguishes strictly between phonology and phonetics.
Phonology is abstract.
It is monostratal, i.e. there is only one level of phonological representation.
There are no derivational steps, therefore no processes (e.g. the
conversion of a high tone into a rising tone following a low tone).
Instead, phonological categories have phonetic exponents, which are
descriptions of physical, temporal events formulated in the physical
domain, i.e. phonetic representations in terms of component para-
meters and their synchronization in time. The precise form of
phonetic representations has no bearing on the form of phonological
representations.
Phonological representations are structured; there is no time or sequence in
them, but only places in structure. Phonology deals with unordered
labeled graph-objects instead of linearly ordered strings of symbols.
Talking about sequence only makes sense in terms of temporal
phonetic interpretation.
Feature structures must be distinguished from parametric phonetic rep-
resentations in time. Deletion and insertion are treated as different
kinds of parameter synchronization in the phonetic interpretation of
the phonological representation. Phonological operations are mono-
tonic; i.e. the combinatorial operations with which phonological
structures are constructed can only add information to represen-
tations and cannot remove anything.
3 As a corollary to the preceding, there are no rewrite rules. The assimilations
of, for example, alveolars to following labials and velars at word boundaries
in English are not treated as changes in phonology, because there are no
"alveolar consonants" in the phonology of English; "by positing such
entities we simply embroil ourselves in a procedural-destructive-type
account of assimilation because we have to change the 'alveolar consonant'
into something else" (p. 207). Quite apart from this there is no homophony
in, e.g., that case and black case or this shop andfishshop, so the empirical
facts are not reported correctly.
224
8 Comments
Critical comments
My reply to Local's paper follows the line of arguments set out above.
1 If phonological elements are nonsegmental and unordered, only struc-
tured, and if phonetic exponency shows ordering in time, then Local has to
demonstrate how the one is transformed into the other. He has not done this.
To say that component parameters are synchronized in time in phonetic
representations is not enough, because we require explicit statements as to
the points in sequence where this synchronization occurs, where certain
parameters take on certain values.
Local does not comprehensively state how the generation of acoustic
parameters from structured phonological representations is achieved. What
is, for instance the input into the computer that activates these exponency
files, e.g. in black case and that easel Surely Local types in the sequence of
alphabetic symbols of English spelling, which at the same time reflects a
phonetic order: k and t come before c of case. This orthographic sequence is
then, presumably, transformed into Local's phonological structures, which,
therefore, implicitly contain segmental-order information because the struc-
tures are derived from a sequential input. So the sequential information is
never lost, and, consequently, not introduced specially by the exponency files
activated in turn by the structural transforms. When the exponency files are
called upon to provide parametric values and time extensions the segmental
order is already there. Even if Local denies the existence of segments and
sequence in phonology, his application to synthesis by rule in his computer
program must implicitly rely on it.
2 If phonology is strictly separated from phonetics, and abstract, how can
timeless, unordered, abstract structures be turned into ordered, concrete time
courses of physical parameters? Features must bear very direct relations to
parameters, at least in a number of cases at focal points, and there must be
order in the phonological representations to indicate whether parameter
values occur sooner or later, before or after particular values in some other
parameter. Moreover, action theory (e.g. Fowler 1980) has demonstrated
very convincingly that temporal information must be incorporated in
phonological representations. This is also the assumption Browman and
Goldstein (1986, this volume) work on. And many papers in this volume (like
Firth himself, e.g. Firth 1957) want to take phonology into the laboratory
and incorporate phonetics into phonology. The precise form of phonetic
representations does indeed have a bearing on the form of phonological
representations (cf. Ohala 1983). For example, the question as to why
alveolars are assimilated to following labials and velars, not vice versa, and
why labials and velars are not assimilated to each other in English or
225
Segment
German, finds its answer in the phonetics of speech production and
perception, and the statement of this restriction of assimilation must be part
of phonology (Kohler 1990).
If the strict separation of phonetics and phonology, advocated by Local, is
given up, phonology cannot be monostratal, and representations have to be
changed. It is certainly not an empirical fact that information can only be
added to phonological representations, never removed. The examples that
case/black case and this shop/fish shop do not prove the generality of Local's
assertion. We can ask a number of questions with regard to them:
1 How general is the distinction? The fact that it can be maintained is no proof
that is has to be maintained. The anecdotal reference to these phrases is not
sufficient. We want statistical evaluations of a large data base, not a few
laboratory productions in constructed sentences by L and K, which may
even stand for John Local and John Kelly.
2 Even if we grant that the distinction is maintained in stressed that case vs.
stressed black case, what happens in It isn't in the bag, it's in that easel
3 What happens in the case of nasals, e.g. in You can get it or pancake?
4 What happens in other vowel contexts, e.g. in hot cooking?
5 In German mitkommen, mitgehen the traces of /t/ are definitely removable,
resulting in a coalescence with the clusters in zuruckkommen, zuruckgehen;
similarly, ankommen/langkommen, angehen/langgehen.
6 Even if these assimilations were such that there are always phonetic traces of
the unassimilated structures left, this cannot be upheld in all cases of
articulatory adjustments and reductions. For instance, German mit dem
Auto can be reduced to a greater or lesser extent in the function words mit
and dem. Two realizations at different ends of the reduction scale are:
[mit de-m '?aoto:]
[mim 'Paotoi]
There is no sense in saying that the syllable onset and nucleus of [ deTm] are
still contained in [m], because the second utterance has one syllable less. If,
however, the two utterances are related to the same lexical items and a
uniform phonological representation, which is certainly a sensible thing to
do, then there has to be derivation and change. And this derivation has to
explain phonologically and phonetically why it, rather than any other
derivation, occurs. Allowing derivations can give these insights, e.g. for the
set of German weak form reductions [mit deTm], (mit ctam], [mitm], [mipm],
[mibm], [mimm], [mim] along a scale from least reduced and most formal to
most reduced and least formal, which can be accounted for by a set of
ordered rules explaining these changes with reference to general phonetic
principles, and excluding all others (Kohler 1979).
3 Rewrite rules are thus not only inevitable, they also add to the explanatory
power of our phonological descriptions. This leads me to a further question.
226
8 Comments
What are phonologies of the type Local proposes useful for? He does not
deal with this issue, but I think it is fair to say that he is not basically
concerned with explanatory adequacy, i.e. with the question as to why things
are the way they are. His main concern is with descriptive adequacy, and it is
in this respect that the acuteness of phonetic observations on prosodic lines
can contribute a lot, and has definitely done so. This is the area where
prosodic analysis should continue to excel, at the expense of the theorizing
presented by Local.
Comments on chapter 8
MARIO ROSSI
The phonology-phonetics distinction
I agree with Local on the necessity of a clear distinction between the two
levels of matter (substance) and form. But it seems to me that his conception
is an old one derived from a misinterpretation of De Saussure. Many of the
structuralists thought that the concept of language in De Saussure was
defined as a pure form; but an accurate reading of De Saussure, whose ideas
were mostly influenced by Aristotle's Metaphysics, shows that the "langue"
concept is a compound of matter and form, and that the matter is organized
by the form. So it is the organization of acoustic cues in the matter, which is a
reflection of the form, that allows in some way the perception and decoding
of the linguistic form. Consequently, Local's assumption, "it makes no sense
to talk of structure or systems in phonetics" (p. 193), is a misconception of
the relationship between matter and form. Matter is not an "amorphous"
substance. The arbitrary relationship between matter and form (I prefer
"matter and form" to "phonetics and phonology") means that the para-
meters of the matter and the way in which they are organized are not
necessarily linked with the form; that is, the form imposes an organization on
the parameters of the matter according to the constraints and the specific
modes of the matter. So we are justified in looking for traces of form values in
the matter.
At the same time, we have to remember that the organization of matter is
not isomorphic with the structure of the form. In other words, the acoustic/
articulatory cues are not structured as linguistic features, as implied in
Jakobson, Fant, and Halle (1952). In that sense, Local is right when he says
"phonological features do not have implicit denotations" (p. 196). In reality,
the type of reasoning Local uses in the discussion of the assimilation process
in this shop and fish shop demonstrates that he is looking at the acoustic
227
Segment
parameters as organized parameters, parameters that reflect the formal
structure.
I agree in part with the assumption that the search for invariant phonetic
correlates of phonological categories is implied by the erroneous belief that
the phonological categories have some kind of implicit "naive phonetic
denotation." However, I think that the search for invariants by some
phoneticians is not "naive," but more complexly tied to two factors:
1 The structuralist conception of language as pure form, and the phonology
derived from this conception. In this conception a phonological unit does
not change: "once a phoneme always a phoneme."
2 The lack of a clear distinction in American structuralism between cues and
features, derived from the misconception that emphasizes the omnipotence
of the form embedded in the matter.
Finally, to say that "the relationship between phonology and phonetics is
arbitrary" (p. 193) is to overlook the theory of natural phonology. The
concept of the arbitrary relationship needs to be explained and clarified.
Consider Hooper's (1976) "likelihood condition"; if this condition did not
hold, the phonological unit would have phonetic exponents that would be
totally random.
Underspecification
How can an "unspecified" coda affect the onset exponents of the structure
(p. 215). Local defines some codas as "underspecified"; he says that phono-
logical descriptions are algebraic objects and the precise form of phonetic
representation has no bearing on the form of the phonological represen-
tations. I see a contradiction in this reasoning: underspecification is posited
in order to account for phonetic assimilation processes. So phonetic
representations have bearing on form! I see no difference between under-
specification and the neutralization that Local wants to avoid.
228
9
Lexical processing and phonological representation
9.1 Introduction
In this paper, we are concerned with the mental representation of lexical
items and the way in which the acoustic signal is mapped onto these
representations during the process of recognition. We propose here a
psycholinguistic model of these processes, integrating a theory of processing
with a theory of representation. We take the cohort model of spoken word-
recognition (Marslen-Wilson 1984, 1987) as the basis for our assumptions
about the processing environment in which lexical processing takes place,
and we take fundamental phonological assumptions about abstractness as
the basis for our theory of representation. Specifically, we assume that the
abstract properties that phonological theory assigns to underlying represen-
tations of lexical form correspond, in some significant way, to the listener's
mental representations of lexical form in the "recognition lexicon," and that
these representations have direct consequences for the way in which the
listener interprets the incoming acoustic-phonetic information, as the speech
signal is mapped into the lexicon.
The paper is organized as follows. We first lay out our basic assumptions
about the processing and representation of lexical form. We then turn to two
experimental investigations of the resulting psycholinguistic model: the first
involves the representation and the spreading of a melodic feature (the
feature [nasal]); and the second concerns the representation of quantity,
specifically geminate consonants. In each case we show that the listener's
performance is best understood in terms of very abstract perceptual represen-
tations, rather than representations which simply reflect the surface forms of
words in the language.
229
Segment
233
Segment
claims that we have made about the properties of the underlying represen-
tations. In the first of these tests, to which we now turn, we investigate the
interpretation of the same surface feature as its underlying phonological
status varies across different languages. If it is the underlying representation
that controls performance, then the interpretation of the surface feature
should change as its underlying phonological status changes. In the second
test, presented in section 9.4, we investigate the processing and represen-
tation of quantity.
Bengali
CVN cvc cvc
Underlying V C V C V C
[ + nas ] [ + nas ]
English
CVN cvc
Underlying V C V C
[ + nas]
Surface V C V C
[ -Knas ]
seven oral vowels in the language has a corresponding nasal vowel, as in the
minimal pairs [pak] "slime" and [pak] "cooking" (Ferguson and Chowdhury
1960). A postlexical process of regressive assimilation is an additional source
of surface nasalization applying to both monomorphemic and heteromor-
phemic words.2
235
Segment
3
This view is so taken for granted in current research on lexical access that it would be
invidious to single out any single exponent of it.
236
9 Aditi Lahiri and William Marslen- Wilson
about the word being heard. At each increment they are asked to say what
word they think they are hearing, and this enables the experimenter to
determine how the listener is interpreting the sensory information presented
up to the point at which the current gate terminates. Previous research
(Marslen-Wilson 1984, 1987) shows that performance in this task correlates
well with recognition performance under more normal listening conditions.
Other research (Warren and Marslen-Wilson 1987) shows that gating
responses are sensitive to the presence of phonetic cues such as vowel
nasalization, as they become available in the speech input.4
We will use the gating task to investigate listeners' interpretations of
phonetically oral and phonetically nasal vowels for three different stimulus
sets. Two sets reflect the structure laid out in table 9.1: a set of CVC, CVN,
and CVC triplets in Bengali, and a set of CVC and CVN doublets in English.
To allow a more direct comparison between English and Bengali, we will also
include a set of Bengali doublets, consisting of CVN and CVC pairs, where
the lexicon of the language does not contain a CVC beginning with the same
consonant and vowel as the CVN/CVC pair. This will place the Bengali
listeners, as they heard the CVN stimuli, in the same position, in principle, as
the English listeners exposed to an English CVN. In each case, the item is
lexically unambiguous, since there are no CVC words lexically available.
As we will lay out in more detail in section 9.3.4 below, the underlying-
representation hypothesis predicts a quite different pattern of performance
than any view of representation which includes redundant information (such
as derived nasalization). In particular, it predicts that phonetically oral
vowels will be ambiguous between CVNs and CVCs for both Bengali and
English, whereas phonetically nasal vowels will be unambiguous for both
languages but in different ways - in Bengali vowel nasalization will be
interpreted as reflecting an underlying nasal vowel followed by an oral
consonant, while in English it will be interpreted as reflecting an underlying
oral vowel followed by a nasal consonant.
Notice that these predictions for nasalized vowels also follow directly from
the cohort model's claims about the immediate uptake of information in the
speech signal - as interpreted in the context of this specific set of claims about
the content of lexical-form representations. For the Bengali case, vowel
nasalization should immediately begin to be interpreted as information
about the vowel currently being heard. For the English case, where the
4
Ohala (this volume) raises the issue of "hysteresis" effects in gating: namely, that because the
stimulus is repeated several times in small increments, listeners become locked in to particular
perceptual hypotheses and are reluctant to given them up in the face of disconfirming
information. Previous research by Cotton and Grosjean (1984) and Salasoo and Pisoni (1985)
shows that such effects are negligible, and can be dismissed as a possible factor in the current
research.
237
Segment
presence of nasalization cannot, ex hypothesis be interpreted as information
about vowels, it is interpreted as constraining the class of consonants that
can follow the vowel being heard.5 This is consistent with earlier research
(Warren and Marslen-Wilson 1987) showing that listeners will select from
the class of nasal consonants in making their responses, even before the place
of articulation of this consonant is known.
Turning to a surface-representation hypothesis, this makes the same
predictions for nasalized vowels in English, but diverges for all other cases. If
the representation of CVNs in the recognition lexicon codes the fact that the
vowel is nasalized, then phonetically oral vowels should be unambiguous
(listeners should never interpret CVCs as potential CVNs) in both Bengali
and English, while phonetically nasal vowels should now be ambiguous in
Bengali - ceteris paribus, listeners should be as willing to interpret vowel
nasalization as evidence for a CVN as for a CVC.
933 Method
933.1 Materials and design
Two sets of materials were constructed, for the Bengali and for the English
parts of the study. We will describe first the Bengali stimuli.
The primary set of Bengali stimuli consisted of twenty-one triplets of
Bengali words, each containing a CVC, and CVN, and a CVC, where each
member of the triplet shared the same initial oral consonant (or consonant
cluster), and the same vowel (oral or nasal) but different in final consonant
(which was either oral or nasal). An example set is the triplet /kap/, /kam/,
/kap/. As far as possible the place of articulation of the word-final consonant
was kept constant. The vowels [a, o, D, ae, e] and their nasal counterparts were
used.
We also attempted to match the members of each triplet for frequency of
occurrence in the language. Since there are no published frequency norms for
Bengali, it was necessary to rely on the subjective familiarity judgments of a
native speaker (the first author). Judgments of this type correlate well with
objective measures of frequency (e.g. Segui et al. 1982).
The second set of Bengali stimuli consisted of twenty doublets, containing
matched CVCs and CVNs, where there was no word in the language
beginning with the same consonant and vowel, but where the vowel was a
nasal. An example is the pair /lorn/, /lop/, where there is no lexical item in the
language beginning with the sequence /16/. The absence of lexical items with
the appropriate nasal vowels was checked in a standard Bengali dictionary
5
Ohala (this volume) unfortunately misses this point, which leads him to claim, quite wrongly,
that there is an incompatibility between the general claims of the cohort model and the
assumptions being made here about the processing of vowel nasalization in English.
238
9 Aditi Lahiri and William Marslen- Wilson
(Dev 1973). As before, place of articulation of the final consonant in each
doublet was kept constant. We used the same vowels as for the triplets, with
the addition of [i] and [u].
Given the absence of nasal vowels in English, only one set of stimuli was
constructed. This was a set of twenty doublets, matched as closely as possible
to the Bengali doublets in phonetic structure. The pairs were matched for
frequency, using the Kucera and Francis (1967) norms, with a mean
frequency for the CVNs of 18.2 and for the CVCs of 23.4.
All of the stimuli were prepared in the same way for use in the gating task.
The Bengali and English stimuli were recorded by native speakers of the
respective languages. They were then digitized at a sampling rate of 20 kHz
for editing and manipulation in the Max-Planck speech laboratory.
Each gating sequence was organized as follows. All gates were set at a zero
crossing. We wanted to be able to look systematically at responses relative
both to vowel onset and to vowel offset. The first gate was therefore set, for
all stimuli, at the end of the fourth glottal pulse after vowel onset. This first
gate was variable in length. The gating sequence then continued through the
vowel in approximately 40 msec, increments until the offset of the vowel was
encountered. A gate boundary was always set at vowel offset, with the result
that the last gate before vowel offset also varied in length for different stimuli.
If the interval between the end of the last preceding gate and the offset of the
vowel was less than 10 msec, (i.e., not more than one glottal pulse), then this
last gate was simply increased in length by the necessary amount. If the
interval to vowel offset was more than 10 msec, then an extra gate of variable
length was inserted. After vowel offset the gating sequence then continued in
steady 40 msec, increments until the end of the word. Figure 9.1 illustrates
the complete gating sequence computed for one of the English stimuli.
The location of the gates for the stimuli was determined using a high-
resolution visual display, assisted by auditory playback. When gates had
been assigned to all of the stimuli, seven different experimental tapes were
then constructed. Three of these were for the Bengali triplets, and each
consisted of three practice items followed by twenty-one test items. The tapes
were organized so that each tape contained an equal number of CVCs,
CVNs, and CVCs, but only one item from each triplet, so that each subject
heard a given initial CV combination only once during the experiment. A
further two tapes were constructed for the Bengali doublets, again with three
practice items followed by twenty test items, with members of each doublet
assigned one to each tape. The final two tapes, for the English doublets,
followed in structure the Bengali doublet tapes.
On each tape, the successive gates were recorded at six-second intervals. A
short warning tone preceded each gate, and a double tone marked the
beginning of a new gating sequence.
239
Segment
-8 -7 -6 -5 -4 -3 -2 -1 0 +1 +2 +3
Figure 9.1 The complete gating sequence for the English word grade. Gate 0 marks the offset of
the vowel
-5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5
-5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5
Gates
Figure 9.2 Bengali triplets: mean percentage of different types of response (CVC, CVC, or
CVN) to each type of stimulus, plotted across thefivegates up to offset of the vowel (gate 0) and
continuing forfivegates into the consonant. The top panel gives the responses to CVN stimuli,
the middle panel the responses to CVC stimuli, and the bottom panel plots the responses to CVC
stimuli
242
9 Aditi Lahiri and William Marslen- Wilson
Table 9.2 Bengali triplets: percent responses up to
vowel offset
Type of response
CVN and CVC stimuli shows that the degree of perceived nasalization was
approximately equal for both types of stimulus.
These results are problematic for a surface-representation hypothesis. On
such an account, vowel nasalization is perceptually ambiguous, and res-
ponses should be more or less evenly divided between CVNs and CVCs. To
explain the imbalance in favor of CVC responses, this account would have to
postulate an additional source of bias, operating postperceptually to push
the listener towards the nasal-vowel interpretation rather than the oral-
vowel/nasal-consonant reading. This becomes implausible as soon as we look
at the pattern of responses to CVC stimuli, where oral vowels are followed by
oral consonants.
Performance here is dominated by CVC responses. Already at gate — 5 the
proportion of CVC responses is higher than for the CVN or CVC stimuli,
and remains fairly steady, at around 80 percent, for the next five gates.
Consistent with this, there are essentially no CVC responses at all. In
contrast, there is a relatively high frequency of CVN responses over the first
five gates. Listeners produce more than twice as many CVN responses, on
average, to CVC stimuli than they do to either CVN or CVC stimuli.
This is difficult to explain on a surface-representation account. If CVNs
are represented in the recognition lexicon as containing a nasalized vowel
followed by a nasal consonant, then there should be no more reason to
produce CVNs as responses to CVCs than there is to produce CVCs. And
certainly, there should be no reason to expect CVN responses to be
significantly more frequent to oral vowels than to nasalized vowels. On a
surface-representation hypothesis these responses are simply mistakes -
which leaves unexplained why listeners do not make the same mistake with
CVC responses.
On the underlying-representation hypothesis, the pattern of results for
CVC stimuli follows directly. The recognition lexicon represents CVCs as
243
Segment
having a nasal vowel. There is therefore no reason to make a CVC response
when an oral, non-nasalized vowel is being heard. Both CVCs and CVNs,
however, are represented as having an oral vowel (followed in the one case by
an oral consonant and in the other by a nasal consonant). As far as the
listener's recognition lexicon is concerned, therefore, it is just as appropriate
to give CVNs as responses to oral vowels as it is to give CVCs. The exact
distribution of CVC and CVN responses to CVCs (a ratio of roughly 4 to 1)
presumably reflects the distributional facts of the language, with CVCs being
far more frequent than CVNs.
244
9 Aditi Lahiri and William Marslen- Wilson
Figure 9.3 Bengali doublets: mean percentage of different types of response (CVC, CVC, or
CVN) plotted across gates, gate 0 marking the offset of the vowel. The upper panel gives
responses to CVN stimuli and the lower panel the responses to CVC stimuli
Type of response
Stimulus
CVC 82.6 0.0 14.7
CVN 64.7 17.0 15.6
245
Segment
Response type
CVC
CVN
Response type
CVC
CVN
Figure 9.4 Mean percentage of different types of responses to the two English stimulus sets
across gates, gate 0 marking the vowel offset. Responses to CVN stimuli are plotted on the upper
panel and to CVC stimuli on the lower panel
interpreted as a cue to an underlyingly nasal vowel. For the doublets, this will
mean that the listener will not be able to find any perfect lexical match, since
the CVN being heard is represented in the recognition lexicon as underly-
ingly oral, and there is no lexically available CVC. This predicts, as we
observed here, that there should not be a large increase in CVN responses
even when the CVN is lexically ambiguous. A CVC which diverges in some
other feature from the input will be just as good a match as the CVN, at least
until the nasal consonant is heard.
246
9 Aditi Lahiri and William Marslen- Wilson
Table 9.4 English Doublets: percent responses up to
vowel offset
Type of response
CVC CVN
Stimulus
CVC 83.4 16.6
CVN 59.3 40.7
247
Segment
Table 9.5 CVN responses to CVC stimuli: place effects across gates (percent
response)
Gates
-5 -4 _3 -2 -1 0 +1 +2
Correct place 12.0 14.5 15.5 11.5 13.5 10.0 21.0 1.5
Incorrect place 9.5 5.0 6.0 5.0 2.5 1.5 0.0 0.0
248
9 Aditi Lahiri and William Marslen-Wilson
orality is the universally unmarked case for vowels, oral vowels have no
specification along the oral/nasal dimension. The underlying representation
of the vowel in CVCs and CVNs is therefore blind to the fact that the vowel is
oral. The listener will only stop producing CVNs as responses when it
becomes clear that the following consonant is also oral.
An important aspect, finally, of the results for the English doublets is that
they provide evidence for the generality of the claims we are making here.
Despite the contrasting phonological status of nasality in the Bengali vowel
system as opposed to the English, both languages treat oral vowels in the
same way, and with similar consequences for the ways in which the speakers
of these languages are able to interpret the absence of nasality in a vowel.
Although vowel nasalization has a very different interpretation in Bengali
than in English, leading to exactly opposite perceptual consequences, the
presence of an oral vowel leads to very similar ambiguities for listeners in
both languages. In both Bengali and English, an oral vowel does not
discriminate between CVC and CVN responses.
249
Segment
marked status of the feature but on the listener's assessment of the segment
slots and therefore of the prosodic structure.
This means that information about duration will not be informative in the
same way as spectral information. For example, when the listener hears
nasal-murmur information during the closure of the geminate consonant in a
word like [kania], the qualitative information that the consonant is a nasal
can be immediately inferred. But the quantity of the consonant (the fact that
it is geminated) will not be deduced - even after approximately 180 msec, of
the nasal murmur - until after the point of release of the consonant, where
this also includes some information about the following vowel.
Since geminates in Bengali are intervocalic, duration can only function as a
cue when it is structurally (or prosodically) plausible - in other words, when
there is a following vowel. If, in contrast, length had been marked as a feature
on the consonant in the same way as standard melodic features, then
duration information ought to function as a cue just like nasality. Instead, if
quantity is independently represented, then double-consonant responses to
consonantal quantity will only emerge when there is positive evidence that
the appropriate structural environment is present.
9.4.1 Method
The stimuli consisted of sixteen pairs of matched bisyllabic geminate/
nongeminate pairs, differing only in their intervocalic consonant (e.g. [pala]
and [palia]). The contrasting consonants consisted of eight [l]s, seven [n]s,
and one [m]. The primary reason for choosing these sonorant consonants was
that the period of consonantal closure was not silent (as, for example, in
unvoiced stops), but contained a continuous voiced murmur, indicating to
the listener the duration of the closure at any gate preceding the consonantal
release at the end of the closure. The geminates in these pairs were all
underlying geminates. We attempted to match the members of each pair for
frequency, again relying on subjective familiarity judgments.
The stimuli were prepared for use in the gating task in the same way as in
the first experiment. The gating sequence was, however, organized differ-
ently. Each word had six gates (as illustrated in figure 9.5). The first gate
consisted of the CV plus 20 msec, of the closure. The second gate included an
extra 40 msec, consonantal-closure information. The third gate was set at the
end of the closure, before any release information was present. This gate
differed in length for geminate and nongeminates, since the closure was more
than twice as long for the geminates. It is at this gate, where listeners have
heard a closure whose duration far exceeds any closure that could be
associated with a nongeminated consonant, that we would expect geminate
responses if listeners were sensitive to duration alone. The fourth gate
250
9 Aditi Lahiri and William Marslen- Wilson
Jilliuiiiu
—Viewpoint width: 550 msec.-
1
H rri HI" If! 111111'
kana
50 msec. I 1 1, 1 i i i
1 23 4 5 6
—Viewpoint width: 550 msec.-*
Hipp Hi
Wffl
lyiiiM
PPPPWWvv-
kanna
50 msec.
1 i i i i i i 1
1 2 3 4 5
Gates
Figure 9.5 The completing gating sequence for the Bengali pair kana and kanna
included the release plus two glottal pulses - enough information to indicate
that there was a vowel even though the vowel quality was not yet clear. The
fifth gate contained another four glottal pluses - making the identity of the
vowel quite clear. The sixth and last gate included the whole word.
Two tapes were constructed, with each tape containing one member of
each geminate/nongeminate pair, for a total of eight geminates and eight
nongeminates. Three practice items preceded the test items. A total of
twenty-eight subjects were tested, fourteen for each tape. No subject heard
more than one tape. The subjects were literate native speakers of Bengali,
tested in Calcutta. The testing procedure was the same as before (see section
9.3.3), except for the fact that here the subjects were instructed to respond
with bisyllabic words. Bisyllabic responses allowed us to determine whether
251
Segment
100 -i
—*— Geminates
"•*•" Nongeminates
Figure 9.6 Mean percentage of geminate responses for the geminate and nongeminate stimuli
plotted across the six gates
253
Segment
9.5 Conclusions
In the paper we have sketched the outline of a psycholinguistic model of the
representation and processing of lexical form, combining the cohort model of
lexical access and selection with a set of assumptions about the contents of
the recognition lexicon that derive from current phonological theory. The
joint predictions of this model, specifying how information in the speech
signal should be interpreted relative to abstract, multilevel, underspecified,
lexical representations, were evaluated in two experiments. The first of these
investigated the processing of a melodic feature, under conditions where its
phonological status varied cross-linguistically. The second investigated the
interpretation of quantity information in the signal, as a function of cues to
the structural organization of the word being heard. In each case, the
responses patterned in a way that was predicted by the model, suggesting that
the perceptually relevant representations of lexical form are indeed functio-
nally isomorphic to the kinds of representations specified in phonological
theory.
If this conclusion is correct, then there are two major sets of implications.
Where psycholinguistic models are concerned, this provides, perhaps for the
first time, the possibility of a coherent basis for processing models of lexical
access and selection. These models cannot be made sufficiently precise
without a proper specification of the perceptual targets of these processes,
where these targets are the mental representations of lexical form. Our
research suggests that phonological concepts can provide the basis for a
solution to this problem.
The second set of implications are for phonological theory. If it is indeed
the case that the contents of the mental-recognition lexicon can, in broad
outline, be characterized in phonological terms, then this suggests a much
closer link between phonological theory and experimental research into
language processing than has normally been considered by phonologists (or
indeed by psychologists). Certainly, if a phonological theory has as its goal
the making of statements about mental representations of linguistic know-
ledge, then the evaluation of its claims about the properties of these
representations will have to take into account experimental research into the
properties of the recognition lexicon.
254
9 Comments
Comments on chapter 9
JOHN J. OHALA
I applaud Lahiri and Marslen-Wilson for putting underspecification and
other phonological hypotheses into the empirical arena.* I share their view
that phonological hypotheses that purport to explain how language is
represented in the brain of the speaker have to be evaluated via psycho-
linguistic tests. It is a reasonable proposal that the way speakers recognize
heard words should be influenced by the form of their lexical representation,
about which underspecification has made some very specific claims.
Still, this paper leaves me very confused. Among its working hypotheses
are two that seem to me to be absolutely contradictory. The first of these is
that word recognition will be performed by making matches between the
redundant input form and an underlying lexical form which contains no
predictable information. The second is that the match is made between the
input form and a derived form which includes surface redundancies. Previous
work by Marslen-Wilson provides convincing support for the second
hypothesis. He showed that as soon as a lexically distinguishing feature
appeared in a word, listeners would be able to make use of it in a word-
recognition task; e.g., the nasalization of the vowel in drown could be used to
identify the word well before the actual nasal consonant appeared. Similar
results had earlier been obtained by Ali et al. (1971). Such vowel nasalization
is regarded by Lahiri and Marslen-Wilson as predictable and therefore not
present in the underlying representation.
In the Bengali triplet experiment the higher number of CVN responses
(than CVC responses) to CVC stimuli is attributed to the vowel in CVN's
underlying representation being unspecified for [nasal], i.e., that to the
listener the oral vowel in CVC could be confused with the underlying "oral"
vowel in CVN. A similar account is given for the results of the Bengali
doublet experiment (the results of which are odd since they include responses
for CVC even though there were supposedly no words of this sort in the
language). On the other hand, the progressive increase in the CVN responses
to English CVN stimuli is interpreted as due to listeners being able to take
advantage of the unambiguous cue which the redundant vowel nasalization
offers to the nature of the postvocalic consonant. Whatever the results, it
seems, one could invoke the listener having or not having access to
predictable surface features of words to explain them.
Actually, given the earlier results of Ali et al. that a majority of listeners,
hearing only the first half of a CVC or CVN syllable, could nevertheless
discriminate them, Lahiri and Marslen-Wilson's results in the English
255
Segment
experiment (summarized in fig. 9.4) are puzzling. Why did CVN responses
exceed the CVC responses only at the truncation point made at the VN
junction? One would have expected listeners to be able to tell the vowel was
nasalized well before this point. I suspect that part of the explanation lies in
the way the stimuli were presented, i.e. the shortest stimulus followed by the
next shortest, and so on through to the longest and least truncated one. What
one is likely to get in such a case is a kind of "hysteresis" effect, where
subjects' judgments on a given stimulus in the series is influenced by their
judgment on the preceding one. This effect has been studied by Frederiksen
(1967), who also reviews prior work. In the present case, since a vowel in a
CVN syllable is least nasalized at the beginning of the vowel, listeners' initial
judgments that the syllable is not one with a nasalized vowel is reasonable
but then the hysteresis effect would make them retain that judgment even
though subsequent stimuli present more auditory evidence to the contrary.
Whether this is the explanation for the relatively low CVN responses is easily
checked by doing this experiment with the stimuli randomized and
unblocked, although this would require other changes in the test design so
that presentation of the full stimulus would not bias judgments on the
truncated stimuli. This could be done in a number of ways, e.g. by presenting
stimuli with different truncations to different sets of listeners.
Lahiri and Marslen-Wilson remark that the greater number of CVC over
CVN responses to the CVN stimuli in the Bengali triplet study
are problematic for a surface-representation hypothesis. On such an account, vowel
nasalization is perceptually ambiguous, and responses should be more or less evenly
divided between CVNs and CVCs. To explain the imbalance in favor of CVC
responses, this account would have to postulate an additional source of bias,
operating postperceptually.
Such a perceptual bias has been previously discussed: Ohala (1981b, 1985a,
1986, forthcoming) and Ohala and Feder (1987) have presented phonological
and phonetic evidence that listeners are aware of predictable cooccurrences
of phonetic events in speech and react differently to predictable events than
unpredictable ones. If nasalization is a predictable feature of vowels adjacent
to nasal consonants, it is discounted in that it is noticed less than that on
vowels not adjacent to nasal consonants. Kawasaki (1986) has presented
experimental evidence supporting this. With a nasal consonant present at the
end of a word, the nasalization on the vowel in camouflaged. When Lahiri
and Marslen-Wilson's subjects heard the CVN stimuli without the final N
they heard nasalized vowels uncamouflaged and would thus find the nasali-
zation more salient and distinctive (since the vowel nasalization cannot be
"blamed" on any nearby nasal). They would therefore think of CVC stimuli
first. This constitutes an extension to a very fine-grained phonetic level of the
256
9 Comments
same kind of predictability that Warren (1970) demonstrated at lexical,
syntactic, and pragmatic levels. Even though listeners may show some bias in
interpreting incoming signals depending on the redundancies that exist
between various elements of the signal, it does not seem warranted to jump to
the conclusion that all predictable elements are unrepresented at the deepest
levels.
The result in the geminate experiment, that listeners favored responses
with underlying geminates over derived geminates, was interpreted by Lahiri
and Marslen-Wilson as due to listeners opting preferentially for a word with
a lexical specification of gemination (and thus supporting the notion that
derived geminates, those that arose from original transmorphemic heterorga-
nic clusters, e.g. -rl- > -11-, have a different underlying representation.
However, there is another possible interpretation for these data that relies
more on morphology than phonology. Taft (1978, 1979) found that upon
hearing [deiz] subjects tended to identify the word as daze rather than days,
even though the latter was a far more common word. He suggested that
uninflected words are the preferred candidates for word identification when
there is some ambiguity between inflected and uninflected choices. Giinther
(1988) later demonstrated that, rather than being simply a matter of inflected
vs. uninflected, it was more that base forms are preferred over oblique forms.
By either interpretation, a Bengali word such as /paha/ "scale" would be the
preferred response over the morphologically complex or oblique form /paho/
"to be able to (third-person past)."
At present, then, I think the claims made by underspecification theory and
metrical phonology about the lexical representation of words are unproven.
Commendably, Lahiri and Marslen-Wilson have shown us in a general way
how perceptual studies can be brought to bear on such phonological issues;
with refinements, I am sure that the evidential value of such experiments can
be improved.
Comments on chapter 9
CATHERINE P. BROWMAN
Lahiri and Marslen-Wilson's study explores whether the processes of lexical
access and selection proceed with respect to underlying abstract phonological
representations or with respect to surface forms. Using a gating task, two
constrasts are explored: the oral/nasal contrast in English and Bengali, and
the single/geminate contrast in Bengali. Lahiri and Marslen-Wilson argue
that for both contrasts, it is the underlying form (rather than the surface
form) that is important in explaining the results of the gating task, where the
257
Segment
underlying forms are assumed to differ for the two contrasts, as follows
(replacing their CV notation with X notation):
(1) Bengali oral/nasal contrast:
VN VC VC
XX XX XX
I I
[ + nas] [ + nas]
C: C
XX X
The English contrast is like the Bengali, but is only two-way (there are no
lexically distinctive nasalized vowels in English). On the surface, the vowel
preceding a nasal consonant is nasalized in both Bengali and English.
The underlying relation between the x-tier and the featural tiers is the same
for the nasalized vowel and the singleton consonant - in both cases, the
relevant featural information is associated with a single x (timing unit). This
relation differs for the nasal consonant and the geminate consonant - the
relevant featural information ([nas]) for the nasal consonant is associated
with a single x, but the featural information for the geminate is associated
with two timing units. However, this latter assumption about underlying
structure misses an important generalization about similarities between the
behaviour, on the gating task, of the nasals and the geminates. These
similarities can be captured if the nasalization for the nasal consonant is
assumed to be underlyingly associated with two timing units, as in (3), rather
than with a single timing unit, as in (1).
(3) Proposed Bengali oral/nasal contrast:
VN VC VC
XX XX XX
i I I
[ + nas ] [ + nas]
258
9 Comments
Velum
Tongue
Speech envelope
N
(a)
Tongue
Speech envelope
C:
(b)
Figure 9.7 Schematic relations among articulations and speech envelope, assuming a consonant
that employs the tongue: (a) nasal consonant preceded by nasalized vowel (schematic tongue
articulation for consonant only); (b) geminate consonant
strates that, in English, the velopharyngeal port opens as rapidly for syllable-
final nasals as for syllable-initial nasals, but is held open much longer
(throughout the preceding vowel). This is analogous to the articulatory
difference between Finnish singleton and geminate labial consonants, cur-
rently being investigated at Yale by Margaret Dunn.
If the nasal in VN sequences is considered to be an underlying geminate,
then the similar behavior of the nasals and oral geminates on the gating task
can be explained in the same way: lexical geminates are not accessed until
their complete structural description is met. The difference between nasal
259
Segment
(VN) and oral (C:) geminates then lies in when the structural description is
met, rather than in their underlying representation. In both cases, it is the
next acoustic event that is important. As can be seen in figure 9.7, this event
occurs earlier for VN than for Ci. For VN, it is clear that the nasal is a
geminate when the acoustic signal changes from the nasalized vowel to the
nasal consonant, whereas for d , it is not until the following vowel that the
acoustic signal changes.
The preference for oral responses during the vowel of the VN in the
doublets experiment would follow from the structural description for the
nasal geminate not yet being met, combined with a possible tendency for
"oral" vowels to be slightly nasalized (Henderson 1984, for Hindi and
English). This interpretation would also predict that the vowel quality should
differ in oral responses to VN stimuli, since the acoustic information
associated with the nasalization should be (partially) interpreted as vowel-
quality information rather than as nasalization.
260
10
FRANCIS NOLAN
10.1 Introduction
Millennia of alphabetic writing can leave little doubt as to the utility of
phoneme-sized segments in linguistic description.* Western thought is so
dominated by this so successful way of representing language visually that
the linguistic sciences have tended to incorporate the phoneme-sized segment
(henceforth "segment") axiomatically. But, as participants in a laboratory-
phonology conference will be the last to need reminding, the descriptive
domain of the segment is limited.
All work examining the detail of speech performance has worried about
the relation between the discreteness of a segmental representation, on the
one hand, and, on the other, the physical speech event, which is more nearly
continuous and where such discrete events as can be discerned may corres-
pond poorly with segments. One response has been to seek principles
governing a process of translation1 between a presumed symbolic, segmen-
tal representation as input to the speech-production mechanism, and the
overlapping or blended activities observable in the speech event.
In phonology, too, the recognition of patternings which involve domains
other than the segment, and the apparent potential for a phonetic component
to behave autonomously from the segment(s) over which it stretches, have
been part of the motivation for sporadic attempts to free phonological
description from a purely segmental cast. Harris (1944), and Firth (1948),
*The work on place assimilation reported here, with the exception of that done by Martin Barry,
was funded as part of grants C00232227 and R000231056 from the Economic and Social
Research Council, and carried out by Paul Kerswill, Susan Wright, and Howard Cobb,
successively. I am very grateful to the above-named for their work and ideas; they may not, of
course, be fully in agreement with the interpretations given in this paper.
1
The term "translation" was popularized by Fowler (e.g. Fowler et al. 1980). More traditio-
nally, the process is called "(phonetic) implementation."
261
Segment
may be seen as forerunners to the current very extensive exploration of the
effects of loosening the segmental constraints on phonological description
under the general heading of autosegmental phonology. Indeed, so successful
has this current phonological paradigm been at providing insights into
certain phonological patterns that its proponents might jib at its being called
a "sporadic attempt."
Assimilation is one aspect of segmental phonology which has been
revealingly represented within autosegmental notation. Assimilation is taken
to be where two distinct underlying segments abut, and one "adopts"
characteristics of the other to become more similar, or even identical, to it, as
in cases such as [griim peint] green paint, [reg ka:] red car, [bae4 Ooits] bad
thoughts. A purely segmental model would have to treat this as a substitution
of a complete segment. Most modern phonology, of course, would treat the
process in terms of features. Conventionally this would mean an assimilation
rule of the following type:2
Such a notation fails to show why certain subsets of features, and not other
subsets, seem to operate in unison in such assimilations, thus in this example
failing to capture the traditional insight that these changes involve assim-
ilation of place of articulation. The notation also implies an active matching
of feature values, and (to the extent that such phonological representations
can be thought of as having relevance to production) a repeated issuing of
identical articulatory "commands."
A more attractive representation of assimilation can be offered within an
essentially autosegmental framework. Figure 10.1, adapted from Clements
(1985), shows that if features are represented as being hierarchically
organized on the basis of functional groupings, and if each segmental "slot"
in the time course of an utterance is associated with nodes at the different
levels of the hierarchy, then an assimilation can be represented as the deletion
of an association to one or more lower nodes and a reassociation to the
equivalent node of the following time slot. The hierarchical organization of
the features captures notions such as "place of articulation"; and the
autosegmental mechanism of deletion and reassociation seems more in tune
with an intuitive conception of assimilation as a kind of programmed "short-
2
This formulation is for exemplification only, and ignores the question of the degree of
optionality of place assimilation in different contexts.
262
10 Francis Nolan
Timing tier
Root tier
Laryngeal tier
Supralaryngeal tier
Place tier
cut" in the phonetic plan to save the articulators the bother of making one
part of a complex gesture.
But even this notation, although it breaks away from a strict linear
sequence of segments, bears the hallmark of the segmental tradition. In
particular, it still portrays assimilation as a discrete switch from one subset of
segment values to another. How well does this fit the facts of assimilation?
Unfortunately, it is not clear how much reliance we can place on the "facts"
as known, since much of what is assumed is based on a framework of
phonetic description which itself is dominated by the discrete segment.
This paper presents findings from work carried out in Cambridge aimed at
investigating assimilation experimentally. In particular, the experiments
address the following questions:
1 Does articulation mirror the discrete change implied by phonetic and
phonological representations of assimilation?
2 If assimilation turns out to be a gradual process, how is the articulatory
continuum of forms responded to perceptually?
In the light of the experimental findings, the status of the segment in phonetic
description is reconsidered, and the nature of representation input to the
production mechanism discussed.
263
Segment
of the palate with areas of contact marked. Each frame shows the pattern of
contact for a 1/100 second interval. For more details of the technique see
Hardcastle (1972).
The experimental method in the early EPG work on assimilation, as
reported for instance in Kerswill (1985) and Barry (1985), exploited (near)
minimal pairs such as . . . maid couldn't... and . . . Craig couldn't... In these
the test item {maid couldn't) contains, lexically, an alveolar at a potential
place-of-articulation assimilation site, that is, before a velar or labial, and the
control item contains lexically the relevant all-velar (or all-labial) sequence.
It immediately became obvious from the EPG data that the answer to
question 1 above, whether articulation mirrors the discrete change implied by
phonetic and phonological representations of assimilation, is "no." For
tokens with lexical alveolars, a continuum of contact patterns was found,
ranging from complete occlusion at the alveolar ridge to patterns which were
indistinguishable from those of the relevant lexical nonalveolar sequence.
Figure 10.2 shows, for the utterance . . . late calls... spoken by subject WJ,
a range of degrees of accomplishment of the alveolar occlusion, and for
comparison a token by WJ of the control utterance . . . make calls . . . In all
cases just the medial consonant sequence and a short part of the abutting
vowels are shown. In each numbered frame of EPG data the alveolar ridge is
at the top of the schematic plan of the palate, and the bottom row of
electrodes corresponds approximately to the back of the hard palate. Panel
(a) shows a complete alveolar occlusion (frames 0159 to 0163). Panels (b) and
(c) show tokens where tongue contact extends well forward along the sides of
the palate, but closure across the alveolar ridge is lacking. Panel (d) shows the
lexical all-velar sequence in make calls.
Figure 10.3 presents tokens spoken by KG of . . . boat covered... and
. . . oak cupboard... In (a) there is a complete alveolar closure; almost
certainly the lack of contact at the left side of the palate just means that the
stop is sealed by this speaker, in this vowel context, rather below the leftmost
line of electrodes, perhaps against the teeth. The pattern in (b) is very similar,
except that the gesture towards the alveolar ridge has not completed the
closure. In (c), however, although it is a third token of boat covered, the
pattern is almost identical to that for oak cupboard in (d). In particular, in
both (c) and (d), at no point in the stop sequence is there contact further
forward than row 4.
Both Barry (1985) and Kerswill (1985) were interested in the effect of
speaking rate on connected-speech processes (CSPs) such as assimilation,
and their subjects were asked to produce their tokens at a variety of rates. In
general, there was a tendency to make less alveolar contact in faster tokens,
but some evidence also emerged that when asked to speak fast but "care-
fully," speakers could override this tendency. However, the main point to be
264
10 Francis Nolan
(a) ...latecalls...
.00 0 00 00. .
0378 0379 0380 0381 0382 0383 0384 0385 0386 0387 0388 0389
(b) ...latecalls...
0 00...0 0 00...0
.. .0 0. . . 0 0 0. o " " • •
0 0 0 0 0. . . . . .0 0. . . . . .0 0 ....00 0. . . . . 0 0 0. . . . . 0 0 0 0 . . . . 0 00. . . . o o O : : : :
0
.
°
0
°
0 00 0
0 0
(c) ...latecalls...
.180 1181 1182
265
Segment
(a) ...boatjpvered..
(b) ...boat_covered...
(c) ...boat_cpvered...
0 00 0 0 . . . . 0 0
(d) ...oakcupboard...
Figure 10.3 EPG patterns for consonant sequences (subject KG)
266
10 Francis Nolan
267
Segment
Table 10.1 EPG analysis of place assimilation: number of occurrences of
articulation types in different speaking styles. On the left, the average of three
speakers (Barry 1985: table 2) speaking at normal conversational speed and
fast; and on the right speaker ML (Kerswill 1985: table 2) speaking slowly and
carefully, normally, fast but carefully, and fast
Normal Fast Slow and careful Normal Fast and careful Fast
Full alveolar 23 15 10 2 3 0
Residual alveolar 14 15 5 8 3 2
Zero alveolar 11 18 5 10 14 18
These techniques are outside the scope of the present projects; but the
possibility of alveolar traces remaining in zero alveolars is approached via an
alternative, perceptual, route in section 10.3.
Despite these limitations, these experiments make clear that the answer to
question 1 is that assimilation of place of articulation is, at least in part, a
gradual process rather than a discrete change.
268
10 Francis Nolan
Table 10.2 Sentence pairs used in the identification
and analysis tests
269
Segment
phoneticians
80
Si students
2? 6 0
CD
f 40
CD
O
£
20
2 3 4
Articulation type
Figure 10.4 Identification test: percentage of /d/ responses (by phoneticians and students
separately) broken down by articulation type. The articulation type of a token was determined
by EPG. The articulation types are (1) full alveolar, (2) residual alveolar, (3) zero alveolar, (4)
nonalveolar
271
Segment
Students Phoneticians
272
10 Francis Nolan
Nonalveolar (leg)
2141 2142 2143 2144 2145 2146 2147 2148 2149 21S
. .00 00 00 00. .
Nonalveolar {beg)
273
Segment
274
10 Francis Nolan
100
CO 80
Zo 60
CD
40
CD
O
CD
20
Articulation type
Figure 10.6 "Comparison" identification test: percentage correct identification of target alveo-
lar/nonalveolar words from minimal-pair sentences. Each nonalveolar sentence token was
paired with tokens classified as (1) full alveolar, (2) residual alveolar, (3) zero alveolar
In all, ten sentence pairs were used (the seven in table 10.2 plus three with
final nasals), and with each of three "alveolar" articulation-types paired with
the nonalveolar control and presented in four conditions (to achieve balance
as described above), the test tape contained 120 trials.
The results, for twenty naive listeners, are summarized in figure 10.6. The
percentage of correct identifications as either an alveolar or a nonalveolar
word is shown for stimuli consisting of, respectively, a full-alveolar token (1),
a residual-alveolar token (2), and a zero-alveolar token (3) each paired with
the relevant nonalveolar control. It can be seen that even in the case of pairs
containing zero alveolars, correct identification, at 66 percent, is well above
chance (50 percent) - significantly so (p< 0.0001) according to preliminary
statistical analysis using a chi-squared test. Inspection of individual lexical
items reveals, as found in the phoneticians' "considered" identifications, that
some pairs are harder to identify than others.
The finding of better-than-chance performance in zero-alveolar pairs is
hard to explain unless the lexical distinction of alveolar versus nonalveolar
does leave traces in the articulatory gesture, even when there is no EPG
evidence of an alevolar gesture. More generally, the finding is in accord with
the hypothesis that differences in lexical phonological form will always result
in distinct articulatory gestures. Notice, furthermore, that listeners can not
only discriminate the zero-alveolar/nonalveolar pairs, but are able to relate
275
Segment
the nature of difference to the intended lexical form. This does not, of course,
conflict wth the apparent inability of listeners to identify zero alveolars
"correctly" in isolated one-pass listening; the active use of such fine phonetic
detail may only be possible at all when "anchoring" within the speaker's
system is available.
276
10 Francis Nolan
alveolar/nonalveolar contrast, showing both that residual cues to the place
distinction survive even here, and that listeners are at some level aware of the
nature of these cues.
How, then, should place assimilation be modeled? It is certain that the
phonetic facts are more complicated than the discrete switch in node
association implied by figure 10.1. This would only, presumably, account for
any cases where the realization becomes identical with the equivalent
underlying nonalveolar.
One straightforward solution presents itself within the notational frame-
work of autosegmental phonology. The node on the supralaryngeal tier for
the first consonant could be associated to the place node of the second
consonant without losing its association to its "own" place features. This
could be interpreted as meaning, for the first consonant, "achieve the place
target for the second segment, without entirely losing the original features of
the first," thus giving rise to residual alveolars (stage [b] below). A further
process would complete the place assimilation (stage [c]):
Supralaryngeal tier • •
Place tier • • • • • •
(a) (b) (c)
277
Segment
"co-production" (e.g. Fowler 1985: 254ff.). This conceptualization of co-
articulation sees a segment as having constituent gestures associated with it
which extend over a given timespan. The constituent gestures do not
necessarily all start and stop at the same time, and they will certainly overlap
with gestures for abutting segments. Typically, vowels are spoken of as being
coproduced with consonants, but presumably in a similar way adjacent
consonants may be coproduced. The dentality of the (normally alveolar)
lateral, and the velarization of the dental fricative, in a word such as filth,
might be seen as the result of coproduction, presumably as a result of those
characteristics of tongue-tip, and tongue-body, gestures having respectively a
more extensive domain.
Problems with such an account, based on characteristics of the vocal
mechanism and its functioning, soon emerge when it is applied to what is
known about place assimilation. For one thing, the same question arises as
with the phonological solution rejected above: namely, that true coproduc-
tion would lead not to assimilation, but to the simultaneous achievement of
both place targets - double stops again. To predict the occurrence of
residual-alveolar and zero-alveolar forms it might be possible to enhance the
coproduction model with distinctions between syllable-final and syllable-
initial consonants, and a convention that targets for the former are given
lower priority; but this seems to be creeping away from the spirit of an
account based solely in the vocal mechanism. Even then, unless it were found
that the degree of loss of the first target were rigidly correlated with rate of
speech, it is not clear how the continuum of articulation types observed could
be explained mechanically.
More serious problems arise from the fact that place-assimilation behavior
is far from universal. This observation runs counter to what would be
expected if the behavior were the result of the vocal mechanism. Evidence of
variation in how adjacent stops of different places of articulation are treated
is far from extensive, but then it has probably not been widely sought up to
now. However, there are already indications of such variation. Kerswill
(1987: 42, 44) notes, on the basis of an auditory study, an absence or near
absence in Durham English of place assimilation where it would be expected
in many varieties of English. And in Russian, it has been traditionally noted
that dental/alveolar to velar place assimilation is much less extensive than in
many other languages, an observation which has been provisionally con-
firmed by Barry (1988) using electropalatography.
If it becomes firmly established that place assimilation is variable across
languages, it will mean that it is a phenomenon over which speakers have
control. This will provide further evidence that a greater amount of phonetic
detail is specified in the speaker's phonetic representation or phonetic plan than
is often assumed. Compare, for instance, similar arguments recently used in
278
10 Francis Nolan
connection with stop epenthesis (Fourakis and Port 1986) and the micro-
timing of voicing in obstruents (Docherty 1989).
It may, or may not, have been striking that so far this discussion has tacitly
accepted a view which has been commonplace among phonologists particu-
larly since the "generative" era: namely, that the "performance" domain
relevant to phonology is production. This, of course, has not always been the
case. Jakobson, Fant, and Halle (1952: 12) argued very cogently that the
perceptual domain is the one most relevant to phonology:
The closer we are in our investigation to the destination of the message (i.e. its
perception by the receiver), the more accurately can we gage the information
conveyed by its sound shape ... Each of the consecutive stages, from articulation to
perception, may be predicted from the preceding stage. Since with each subsequent
stage the selectivity increases, this predictability is irreversible and some variables of
any antecedent stage are irrelevant for the subsequent stage.
Does this then provide the key to the phonological treatment of assimilation?
Suppose for a moment that residual-alveolar forms were, like zero-alveolar
forms, indistinguishable from lexical nonalveolars in one-pass listening.
Would it then, on the "perceptual-primacy" view of the relation of phono-
logy to performance, be legitimate to revert to the treatment of assimilation
shown in figure 10.1 - saying in effect that as far as the domain most relevant
to the sound patterning of language is concerned, assimilation happens quite
discretely?
This is an intriguing possibility for that hypothetical state of affairs, but
out of step with the actual findings of the identification experiment, in which
residual-alveolar forms allowed a substantial degree of correct lexical identi-
fication. For a given stimulus, of course, the structure of the experiment
forces a discrete response (one word or the other); but the overall picture is
one in which up to a certain point perception is able to make use of partial
cues to alveolarity.
Whatever the correct answer may be in this case, the facts of assimilation
highlight a problem which will become more and more acute as phonologists
penetrate further towards the level of phonetic detail: that is, the problem of
what precisely a phonology is modeling.
One type of difficulty emerged from the fact that the level of phonetic
detail constitutes the interface between linguistic structure and speech
performance, and therefore between a discrete symbol system and an
essentially continuous event. It is often difficult to tell, from speech-
performance data itself (such as that presented here on assimilation), which
effects are appropriately modeled symbolically and which treated as
continuous, more-or-less effects.
A further difficulty emerged from the fact that phonetic detail does not
279
Segment
present itself for analysis directly and unambiguously. Phonetic detail can
only be gleaned by examining speech performance, and speech performance
has different facets: production, the acoustic signal, and perception, at least.
Perhaps surprisingly, these facets are not always isomorphic. For instance,
experimental investigation of phonetic detail appears to throw up cases
where produced distinctions are not perceived (see work on sound changes in
progress, such as the experiment of Costa and Mattingly [1981] on New
England [USA] English, which revealed a surviving measurable vowel-
duration difference in otherwise minimal pairs such as cod-card, which their
listeners were unable to exploit perceptually). In such cases, what is the
reality of the linguistic structure the phonologist is trying to model?
10.5 Conclusions
This paper has aimed to show that place assimilation is a fruitful topic of
study at the interface of phonology and experimental phonetics. It has been
found that place assimilation happens gradually, rather than discretely, in
production; and that residual cues to alveolars can be exploited with some
degree of success in perception.
It is argued that the facts of place assimilation can be neither modeled
adequately at a symbolic, phonological level, nor left to be accounted for by
the mechanics of the speech mechanism. Instead, they must be treated as one
of those areas of subcontrastive phonetic detail over which speakers have
control. The representation of such phenomena is likely to require a more
radical break from traditional segmental notions than witnessed in recent
phonological developments.
Clearly, much remains to be done, both in terms of better establishing the
facts of assimilation of place of articulation, and, of course, of other aspects
of production, and in terms of modeling them. It is hoped that this paper will
provoke others to consider applying their own techniques and talents to this
enterprise.
Comments on chapter 10
BRUCE HAYES
The research reported by Nolan is of potential importance for both phonetic
and phonological theory. The central claim is as follows. The rule in (3),
which is often taught to beginning phonology students, derives incorrect
outputs. In fluent speech, the /t/ of late calls usually does not become a /k/,
but rather becomes a doubly articulated stop, with both a velar and an
280
10 Comments
alveolar closure: [lei{t}kDilz]. The alveolar closure varies greatly in its
strength, from full to undetectable.
(3)
alveolar [ a place ] /. C
stop a place
Nolan takes care to point out that this phenomenon is linguistic and not
physiological - other dialects of English, and other languages, do not show
the same behavior. This means that a full account of English phonology and
phonetics must provide an explicit description of what is going on.
Nolan also argues that current views of phonological structure are
inadequate to account for the data, suggesting that "The representation of
such phenomena is likely to require a more radical break from traditional
segmental notions than witnessed in recent phonological developments."
As a phonologist, I would like to begin to take up this challenge: to suggest
formal mechanisms by which Nolan's observations can be described with
explicit phonological and phonetic derivations. In fact, I think that ideas
already in the literature can bring us a fair distance towards an explicit
account. In particular, I want to show first that an improved phonological
analysis can bring us closer to the phonetic facts; and second, by adopting an
explicit phonetic representation, we can arrive at least at a tentative account
of Nolan's data.
Consider first Nolan's reasons for rejecting phonological accounts of the
facts. In his paper, he assumes the model of segment structure due to
Clements (1985), in which features are grouped within the segment in a
hierarchical structure. For Clements, the place features are grouped together
under a single PLACE node, as in (4a). Regressive place assimilation would be
expressed by spreading the PLACE node leftward, as in (4b).
(4)a. PLACE
b. C C C
281
Segment
A difficulty with this account, as Nolan points out, is that it fails to indicate
that the articulation of the coronal segment is weak and variable, whereas
that of the following noncoronal is robust. However, it should be remem-
bered that (4b) is meant as a phonological representation. There are good
reasons why such representations should not contain quantitative inform-
ation. The proper level at which to describe variability of closure is actually
the phonetic level.
I think that there is something more fundamentally wrong with the rule in
(4b): it derives outputs that are qualitatively incorrect. If we follow standard
phonological assumptions, (4b) would not derive a double articulated
segment, but rather a contour segment. The rule is completely analogous to
the tonal rule in (5), which derives a contour falling tone from a High tone by
spreading.
(5) = falling tone + low tone
Following this analogy, the output of rule (4b) would be a contour segment,
which would shift rapidly from one place of articulation to another.
If we are going to develop an adequate formal account of Nolan's findings,
we will need phonological and phonetic representations that can depict
articulation in greater detail. In fact, just such representations have been
proposed in work by Sagey (1986a), Ladefoged and Maddieson (1989), and
others. The crucial idea is shown in (6): rather than simply dominating a set
of features, the PLACE node dominates intermediate nodes, corresponding to
the three main oral articulators: LABIAL for the lips, CORONAL for the tongue
blade, and DORSAL for the tongue body.
(6) PLACE
LABIAL
LABIAL DORSAL
282
10 Comments
Note that the LABIAL and DORSAL nodes are intended to be simultaneous, not
sequenced.
Representations like these are obviously relevant to Nolan's findings,
because he shows that at the surface, English has complex segments: for
example, the boldface segment in late calls [lei{ t}ko:lz] is a coronovelar, and
the boldface segment in good batch [go{J}baetJ] is a labiocoronal.
A rule to derive the complex segments of English is stated in (8). The rule
says that if a syllable-final coronal stop is followed by an obstruent, then the
articulator node of the following obstruent is spread leftward, sharing the
PLACE node with the CORONAL node.
C C
PLACE PLACE
i
COR DORS
There is an additional issue involved in the expression of (8): the rule must generalize over the
class of articulator nodes without actually spreading the PLACE node that dominates them.
Choi (1989), based on evidence from Kabardian, suggests that this may in fact be the normal
way in which class nodes operate: they define sets of terminal nodes that may spread, but do
not actually spread themselves.
283
Segment
The form of rules in the phonetic component is an almost completely
unsettled issue. For this reason, and for lack of data, I have stated the
phonetic rule of Alveolar Weakening schematically as in (10):
(10) Alveolar Weakening
Depending on rate and casualness of speech, lessen the degree of closure for
a COR autosegment, if it is [ — continuant] and syllable-final.
In (11) is a sketchy derivation showing how the rule would apply. We start
with the output of the phonology applying to underlying /kt/, taken from (9).
Next, the phonetic component assigns degree-of-closure targets to the
CORONAL and DORSAL autosegments. Notice that the target for the DORSAL
autosegment extends across two C positions, since this autosegment has
undergone spreading. Finally, the rule of Alveolar Weakening lessens the
degree of closure for the CORONAL target. It applies variably, depending on
speech style and rate, but I have shown just one possible output.
C C C C C C
1 DORSAL
xxxxx xxxxx closure
value
0
This analysis is surely incomplete, and indeed it may turn out to be entirely
wrong. But it does have the virtue of leading us to questions for further
research, especially along the lines of how the rules might be generalized to
other contexts.
To give one example, I have split up what Nolan treats as a single rule into
two distinct processes: a phonological spreading rule, Place Assimilation (8);
plus a phonetic rule, Alveolar Weakening (10). This predicts that in principle,
one rule might apply in the absence of the other. I believe that this is in fact
true. For example, the segment /t/ is often weakened in its articulation even
when no other segment follows. In such cases, the weakened /t/ is usually
284
10 Comments
"covered" with a simultaneous glottal closure. Just as in Nolan's data, the
degree of weakening is variable, so that with increasingly casual speech we
can get a continuum like the following for what: [wAt], [wAt?], [WA?1], [WA?].
It is not clear yet how Alveolar Weakening should be stated in its full
generality. One possibility is that the alveolar closures that can be weakened
are those that are "covered" by another articulation. A full, accurate
formulation of Alveolar Weakening would require a systematic investigation
of the behavior of syllable-final alveolars in all contexts.
My analysis also raises a question about whether Nolan is right in claiming
that Place Assimilation is a "gradual process." It is clear from his work that
the part of the process I have called Alveolar Weakening is gradual. But what
about the other part, which we might call "Place Assimilation Proper"? In
my analysis, Place Assimilation Proper is predicted to be discrete, since it is
carried out by a phonological rule. The tokens that appear in Nolan's paper
appear to confirm this prediction, which I think would be worth checking
systematically. If the prediction is not confirmed, it is clear that the theory of
phonetic representation will need to be enriched in ways I have not touched
on here.
Another area that deserves investigation is what happens when the
segment that triggers Place Assimilation is itself coronal, as in the dental
fricatives in get Thelma, said three, and ten things. Here, it is impossible to
form a complex segment, since the trigger and the target are on the same tier.
According to my analysis, there are two possible outcomes. If the CORONAL
autosegment on the left is deleted, then the output would be a static dental
target, extending over both segments, as in (12b). But if there is no delinking,
as in (12c), then we would expect the /t/ to become a contour segment, with
the tongue blade sliding from alveolar to dental position.
285
Segment
Comments on chapter 10
JOHN J. OHALA
Nolan's electropalatographic study of lingual assimilation has given us new
insight into the complexities of assimilation, a process which phonologists
thought they knew well but which, the more one delves into it, turns out to
have completely unexpected aspects.
To understand fully place assimilation in heterorganic medial clusters I
think it is necessary to be clear about what kind of process assimilation is. I
presume all phonologists would acknowledge that variation appears in
languages due to "on-line" phonetic processes and due to sound changes. The
former may at one extreme be purely the result of vocal-tract constraints; for
example, Lindblom (1963) made a convincing case for certain vowel-reduction
effects being due to inertial constraints of the vocal mechanism. Sound change,
at the other extreme, may leave purely fossilized variant pronunciations in the
language, e.g., cow and bovine, both from Proto-Indo-European *gwous.
Phonetically caused variation may be continuous and not represent a change
in the instructions for pronunciation driving the vocal tract. Sound change, on
the other hand, yields discrete variants which appear in speech due to one or
the other variant form having different instructions for pronunciation. There
are at least two complicating factors which obscure the picture, however. First,
it is clear that most sound changes develop out of low-level phonetic variation
(Ohala 1974, 1983), so it may be difficult in many cases to differentiate
continuous phonetic variation from discrete variation due to sound change.
Second, although sound changes may no longer be completely active, they can
exhibit varying degrees of productivity if they are extrapolated by speakers to
novel lexical items, derivations, phrases, etc. It was a rather ancient sound
change which gave us the k > s change evident in skeptic ~ skepticism but this
does not prevent some speakers from extending it in novel derivations like
domesticism (with a stem-final [s]) (Ohala 1974).
I think place assimilation of medial heterorganic clusters in English may
very well be present in the language due to a sound change, but one which is
potentially much more productive than velar softening. Nevertheless, its full
implementation could still be discrete. In other words, I think there may be a
huge gap between the faintest version of an alveolar stop in red car and the
fully assimilated version [reg ka:].
Naturally, the same phonetic processes which originally gave rise to the
286
10 Comments
sound change can still be found in the language, i.e. imperfect articulation of
Cl and thus the weakening of the place cues for that consonant vis-a-vis the
place cues for C2. In the first Laboratory Phonology Conference (Ohala
1990a) I presented results of an experiment that showed that artificial
heterorganic stop + stop and nasal + stop clusters may be heard as
homorganic if the duration of the closures is less than a certain threshold
value; in these cases it was the place of C2 which dominated the percept. In
that case there was no question of the Cl's being imperfectly articulated:
rather simply that the place cues for C2 overshadowed those of Cl. The
misinterpretation of a heterorganic cluster C1C2 as a homorganic one was
discrete and, I argued, a duplication of the kind of phonetic event which led
to such sound changes as Late Latin okto > Italian otto. I think a similar
reading can be given to the results of Nolan's perceptual test (summarized in
his fig. 10.4). For the residual-alveolar tokens, slightly less than half of all
listeners (linguists and nonlinguists combined) identified the word as having
an alveolar Cl. I am not saying that this discrete identification change is the
discrete phonological process underlying English place assimilation, just that
it mirrors the process that gave rise to it.
Another example may clarify my point. No doubt, most would agree that
the alternation between /t/ and /tf/ as in act and actual([aekt] [aektjual]) is due to
a sound change. (That most speakers do not "derive" actual from act is
suggested by their surprise or amusement on being told that the two are
historically related.) Nevertheless, we can still find the purely phonetic
processes which gave rise to such a change by an examination of the acoustic
properties of [t]s released before palatal glides or even palatal vowels: in
comparison to the releases before other, more open, vowels, these releases are
intense and noisy in a way that mimics a sibilant fricative. No doubt the t—>tj
sound change arose due to listeners misinterpreting and thus rearticulating
these noisily released [t]s as sibilant affricates (Ohala, 1989). What wefindin a
synchronic phonetic analysis may very well be the "seeds" of sound change but
it was the past "germination" of such seeds which gave rise to the current
discrete alternations in the language; the presence of the affricate in actual is
not there due any change per se being perpetrated by today's speakers or
listeners.
Comments on chapter 10
CATHERINE P. BROWMAN
Nolan's study presents articulatory evidence indicating that assimilation is
not an all-or-none process. It also presents perceptual evidence indicating
that lexical items containing alveolars can be distinguished from control
287
Segment
Partial
t
CD
TB
6 Complete TT
TB
(a)
(b)
Partial
Complete
TB
(C)
289
II
Psychology and the segment
ANNE CUTLER
Something very like the segment must be involved in the mental operations
by which human language users speak and understand.* Both processes-
production and perception - involve translation between stored mental rep-
resentations and peripheral processes. The stored representations must be
both abstract and discrete.
The necessity for abstractness arises from the extreme variability to which
speech signals are subject, combined with the finite storage capacity of
human memory systems. The problem is perhaps worst on the perceiver's
side; it is no exaggeration to say that even two productions of the same
utterance by the same speaker speaking on the same occasion at the same
rate will not be completely identical. And within-speaker variability is tiny
compared to the enormous variability across speakers and across situations.
Speakers differ widely in the length and shape of their vocal tracts, as a
function of age, sex, and other physical characteristics; productions of a
given sound by a large adult male and by a small child have little in common.
Situation-specific variations include the speaker's current physiological state;
the voice can change when the speaker is tired, for instance, or as a result of
temporary changes in vocal-tract shape such as a swollen or anaesthetized
mouth, a pipe clenched between the teeth, or a mouthful of food. Other
situational variables include distance between speaker and hearer, interven-
ing barriers, and background noise. On top of this there is also the variability
due to speakers' accents or dialects; and finally, yet more variability arises
due to speech style, or register, and (often related to this) speech rate.
But the variability problem also exists in speech production; we all vary
our speech style and rate, we can choose to whisper or to shout, and the
"This paper was prepared as an overall commentary on the contributions dealing with segmental
representation and assimilation, and was presented in that form at the conference.
290
11 Anne Cutler
accomplished actors among us can mimic accents and dialects and even
vocal-tract parameters which are not our own. All such variation means that
the peripheral processes of articulation stand in a many-to-one relationship
to what is uttered in just the same way as the peripheral processes of
perception do.
If the lexicon were to store an exact acoustic and articulatory represen-
tation for every possible form in which a given lexical unit might be heard or
spoken, it would need infinite storage capacity. But our brains simply do not
have infinite storage capacity. It is clear, therefore, that the memory
representations of language which we engage when we hear or produce
speech must be in a relatively abstract (or normalized) form.
The necessity for discreteness also arises from the finite storage capacity of
our processing systems. Quite apart from the infinite range of situational and
speaker-related variables affecting how an utterance is spoken, the set of
potential complete utterances themselves is also infinite. A lexicon - that is,
the stored set of meaning representations-just cannot include every utter-
ance a language user might some day speak or hear; what is in the lexicon
must be discrete units which are smaller than whole utterances. Roughly, but
not necessarily exactly, lexical representations will be equivalent to words.
Speech production and perception involve a process of translation between
these lexical units and the peripheral input and output representations.
Whether this process of translation in turn involves a level of representation
in terms of discrete sublexical units is an issue which psycholinguists have
long debated.
Arguments in favor of sublexical representations have been made on the
basis of evidence both from perception and from production. In speech
perception, it is primarily the problem of segmentation which has motivated
the argument that prelexical classification of speech signals into some sub-
word-level representation would be advantageous. Understanding a spoken
utterance requires locating in the lexicon the individual discrete lexical units
which make up the utterance, but the boundaries between such units - i.e. the
boundaries between words-are not reliably signaled in most utterances;
continuous speech is just that-continuous. There is no doubt that a
sublexical representation would help with this problem, because, instead of
being faced with an infinity of points at which a new word might potentially
commence, a recognizer can deal with a string of discrete units which offer
the possibility of a new word beginning only at those points where a new
member of this set of sublexical units begins.
Secondly, arguments from speech perception have pointed out that the
greatest advantage of a sublexical representation is that the set of potential
units can be very much smaller than the set of units in the lexicon. However
large and heterogeneous the lexical stock (and adult vocabularies run into
291
Segment
295
12
Trading relations in the perception of stops and their
implications for a phonological theory
LIESELOTTE SCHIEFER
12.1 Introduction
All feature sets used in the description of various languages are either
phonetically or phonemically motivated. The phonetically motivated fea-
tures are based on acoustic, articulatory, or physiological facts, whereas
phonemically motivated features take into account, for example, the compar-
ability of speech sounds with respect to certain phonemic and/or phono tactic
rules. But even if the authors of feature sets agree on the importance of
having phonetically adequate features, they disagree on the selection of
features to be used in the description of a given speech sample.
This is the situation with the description of Hindi stops. Hindi has a
complicated system of four stop classes (voiceless unaspirated, voice-
less aspirated, voice, and breathy voiced, traditionally called voiced aspir-
ated) in four places of articulation (labial, dental, retroflex, and velar),
plus a full set of four affricates in the palatal region. Since Chomsky
and Halle (1968), several feature sets have been put forward in order
to account for the complexity of the Hindi stop system (Halle and
Stevens 1971; Ladefoged 1971; M. Ohala 1979; Schiefer 1984). The feature
sets proposed by these authors have in common that they make use only of
physiologically motivated features such as "slack vocal cords" (see table
12.1).
In what follows, we concentrate on the features proposed by Ladefoged,
Ohala, and Schiefer. These authors differ not only according to their feature
sets but also with respect to the way these features are applied to the stop
classes. Ladefoged groups together (a) the voiceless unaspirated and voice-
less aspirated stops by assigning them the value "0" of the feature "glottal
stricture," and (b) the breathy voiced and voiceless aspirated stops by giving
them the values "2" of "voice-onset time." But he does not consider voiced
296
12 Lieselotte Schiefer
Table 12.1 Feature sets proposed by Chomsky and Halle (1968), Halle and
Stevens (1971), M. Ohala (1979), and Schiefer (1984)
+ 1 1 1
spread glottis
+ 1 1 +
+ 1+ 1
constricted glottis
stiff cords +
slack cords
3 Ladefoged (1971)
glottal stricture 0 0 2 1
voice-onset time 1 2 0 2
4 M . Ohala (1979) P Ph b bh c
distinctive release
delayed release
voice-onset time 1 2 0 0
glottal stricture 0 0 2 1
vocal-cord tension
5 Schiefer (1984)
distinctive release
delayed release
voice-onset time 1 2 0 2
vocal-cord tension
and breathy voiced stops to form a natural class, since he assigns " 0 " for the
feature "voice-onset time" to the voiced stop.
Ohala uses four features (plus "delayed release"). The feature "distinctive
release" is used to distinguish the "nonaspirates," voiced and voiceless
unaspirated, from voiceless aspirated, breathy voiced, and the affricates. The
feature "voice-onset time" is used to build a natural class with the voiced and
breathy voiced stops, whereas "glottal stricture" allows her both to build a
class with voiceless unaspirated and voiceless aspirated stops, and to account
for the difference in the mode of vibration between voiced and breathy voiced
stops. Finally, she needs the feature "vocal-cord tension" "to distinguish the
voiced aspirates from all other stop types also," as "It is obvious that where
the voiced aspirate stop in Punjabi has become deaspirated, a low rising tone
on the following vowel has developed" (1979: 80).
297
Segment
Schiefer (1984) used the features "distinctive release," "delayed release,"
and "vocal-cord tension" in the same way as Ohala did, but followed
Ladefoged in the use of the feature "voice-onset time." She rejected the
feature "glottal stricture" on grounds which will be discussed later.
It is apparent that Ohala's analysis is based on phonetic as well as
phonemic arguments. She uses phonetic arguments in order to reject the
feature "heightened subglottal air pressure" of Chomsky and Halle (1968)
and to favor the features "glottal stricture" and "voice-onset time," which
she takes over from Ladefoged (1971). On the other hand, she makes use of a
phonemic argument in favor of the feature "delayed release," as "in Maithili
there is a rule which involves the de-aspiration of aspirates when they occur
in non-utterance-initial syllables, followed by a short vowel and either a
voiceless aspirate or a voiceless fricative" (1979: 80). Moreover, in contrast to
Ladefoged and Schiefer, Ohala assigns "onset of voicing" to the feature
"voice-onset time." This implies that the feature "voice-onset time" is
applied to two different acoustic (and physiological) portions of the breathy
voiced stop. That is, since Ohala treats the breathy voiced stop in the same
way as the voiced one, her feature applies to the prevoicing of the stop,
whereas Ladefoged and Schiefer apply the feature to the release of the stop.
As already mentioned, Ladefoged, Ohala, and Schiefer all use physiologi-
cal features, and their investigations are based on relatively small amounts of
physiological or acoustic data. None of these authors relies on perceptual
results - something which is characteristic of other phonological work as well
(see Anderson 1978). The aim of the present paper is therefore to use both an
acoustic analysis (based on the productions of four informants) and results
from perceptual tests as a source of evidence for specific features. Since I lack
physiological data of my own, I will concentrate especially on the feature
"voice-onset time." In doing so, I start from the following considerations, (a)
If Ohala is right in grouping the voiced and breathy voiced stops together
into one natural class by the feature "voice-onset time," then this phonetic
feature, namely prevoicing, is a necessary feature in the production as well as
in the perception of these stops. Otherwise the function of prevoicing differs
between the two stop classes, (b) If the main acoustic (as well as perceptual)
information about a breathy voiced stop is located in the release portion, the
influence of prevoicing in the perception of this stop should be less
important.
12.2.2 Results
Only those results which are relevant for this paper will be presented here;
further details can be found in Schiefer (1986, 1987, 1988, 1989).
Hindi breathy voiced stops can be defined as consisting of three acoustic
portions: the prevoicing during the stop closure ("voicing lead"), the release
of the stop or the burst, and a breathy voiced vowel following the burst,
traditionally called "voiced aspiration." This pattern can be modified under
various conditions: the voicing lead can be missing, and/or the breathy
voiced vowel can be replaced by a voiceless one (or voiceless aspiration).
From this it follows that four different realizations of Hindi breathy voiced
stops occur: the "lead" type, which is the regular one, having voiced during
the stop closure; and the "lag" type, which lacks the voicing lead. Both types
have two different subtypes: the burst can be followed either by a breathy
voiced vowel or by voiceless aspiration, which might be called the "voiceless"
type of breathy voiced stop.
Figure 12.1 displays three typical examples from R.P.J. Figures 12.1a
299
Segment
(a)
r
VV\/V \/\r\fl%/\f\J
(*)
Figure 12.1 Oscillograms of three realizations of breathy voiced stops from R.P.J.: (a) /b h ogna/,
(b) /b*av/, (c) /g"iya/
(/bhogna/) and 12.1b (/bhav/) represent the "lead" type of stop having regular
prevoicing throughout the stop closure, a more (fig. 12.1b) or less
(fig. 12.1a) pronounced burst, and a breathy voiced vowel portion following
the burst. Note that, notwithstanding the difference in the degree of
"aspiration" (to borrow the term preferred by Dixit 1987) between both
examples, the vocal cords remain vibrating throughout. Figure 12. lc (/ghiya/)
gives an example of the "voiceless" type. Here again the vocal cords are
vibrating throughout the stop closure. But unlike the regular "lead" type, the
burst is followed by a period of voiceless instead of voiced aspiration.1
The actual realization of the stop depends on the speaker and on
articulatory facts. The "lag" type of stop is both speaker-dependent and
articulatorily motivated, whereas the "voiceless" type is solely articulatorily
determined. My informants differed extremely with respect to the voicing
1
The results just mentioned caused me in a previous paper (Schiefer 1984) to reject the feature
"glottal stricture," as it is applicable to the regular type of the breathy voiced stop only.
300
12 Lieselotte Schiefer
Table 12.2 Occurrence of the voicing lead in breathy
voiced stops (in percent)
a e i o u X
(a) P.U.N.
Labial 100 93 87 82 71 87
Dental 85 — 93 100 94 93
Retroflex 100 100 80 100 100 96
Velar 100 100 100 88 92 96
X 96 97 90 92 89
(b) S.W.A.
Labial 60 78 78 96 70 76
Dental 85 — 100 100 95 95
Retroflex 91 100 93 100 100 96
Velar 93 100 100 90 90 94
X 82 92 93 96 89
lead. Out of four informants only two had overwhelmingly "lead" realiza-
tions: M.A.N., who produced "lead" stops throughout, and R.P.J., in whose
productions the voicing lead was absent only twice. The data for the other
two informants, P.U.N. and S.W.A., are presented in table 12.2 for the
different stop-vowel sequences, as well as the mean values for the places of
articulation and the vowels. The values are rounded to the nearest integer.
P.U.N. and especially S.W.A. show a severe influence of place of articulation
and vowel on the voicing lead, which is omitted especially in the labial place
of articulation by both of them and when followed by the vowel /a/ in
S.W.A.'s productions. This is interesting, as it is usually believed that
prevoicing is most easily sustained in these two phonetic conditions (Ohala
and Riordan 1980). The "voiceless" type usually occurs in the velar place of
articulation and/or before high vowels, especially before /i/ (see for detail
Schiefer 1984).
The average duration values for the acoustic portions for all informants
are given in figure 12.2 for the different places of articulation. It is interesting
to note that, notwithstanding the articulatorily conditioned differences
within the single acoustic portions, the duration of all portions together (i.e.
voicing lead + burst + breathy voiced vowel) differs only minimally, except
the values for /bh/ in S.W.A.'s productions. This points to a tendency in all
subjects to keep the overall duration of all portions the same or nearly the
same in all environments (see for detail Schiefer 1989). As the words were
read in citation form and not within a sentence frame, it is uncertain whether
these results mirror a general articulatory behavior of the subjects or whether
they are artifacts of the recording procedure.
301
Segment
S.W.A. lab 1 1 1
dent 1 1 1
refr 1 1 1
vel 1 1
MAN. lab
dent
retr D VLD
vel
• burst
R.P.J. lab
dent
retr • BRED
vel
P.U.N. lab
retr
vel
20 40 60 80 100 120 140 160
msec.
Figure 12.2 Durations of voicing lead (VLD), burst, and breathy voiced vowel (BRED)
(a)
(b) (c)
Figure 12.3 (a) In this figure displays the spectrogram of/b h alu/ - the beginning of the breathy
voiced vowel portion is marked by a cursor; (b) gives the oscillogram for the vowel immediately
following the cursor position in (a); and (c) shows the power spectrum, calculated over
approximately 67.5 msec, from cursor position to the right
302
12 Lieselotte Schiefer
(a)
(b) (c).
Figure 12.4 The beginning of the steady vowel is marked by a cursor in (a); (b): oscillogram
from cursor to the right; (c): power spectrum calculated over 67.5 msec, of the clear vowel
303
Segment
Table 12.3 Scheme for the reduction of the breathy voiced portion ("PP" =
pitch period; "—" = pitch periods deleted; "+" = remaining pitch periods)
voiced vowel portion (see fig. 12.3c) compared with the steady one (see fig.
12.4c), which is one of the most efficient acoustic features in the perception of
breathy voiced phonation (Bickley 1982; Ladefoged and Antonanzas-Baroso
1985; Schiefer 1988).
The method used for manipulation was that of speech-editing. The first
syllable was separated from the rest of the word in order to avoid uncontroll-
able influences from context. The syllable was delimited before the transition
into /I/ was audible, and cut into single pitch periods. The first pitch period,
which showed clear frication, was defined as "burst" and was not subjected
to manipulation. The breathy voiced portion of the vowel was separated
from the clear vowel portion by inspection of the oscillogram combined with
an auditory check. The boundary between both portions is marked in figure
12.4a. Note that the difference between the amplitude of the fundamental
and that of the second harmonic in the clear vowel is small and thus
resembles that of modal voice (fig. 12.4c). The following portion of eighteen
pitch periods was divided into three parts containing six pitch periods each.
A so-called "basic-continuum," consisting of seven stimuli, was generated by
reducing the duration of the breathy voiced vowel portion in steps of three
pitch periods (approximately 15 msec). The pitch periods were chosen from
the three subportions of the breathy voiced vowel by applying the scheme
shown in table 12.3.
From this basic continuum eight tests were derived, where, in addition to
the manipulation of the breathy voiced portion, the duration of the voicing
lead was reduced by approximately 10 msec, (two pitch periods) each. Tests
1-8 thus covered the range of 79.55 to 0 msec, voicing lead.
The continua were tested in an identification task in which every stimulus
was repeated five times and presented in randomized order. All stimuli were
followed by a pause of 3.5 sec. while blocks of ten stimuli were separated by
10 sec. Answer sheets (in ordinary Hindi script) required listeners to choose
whether each stimulus was voiceless unaspirated, voiceless aspirated, voiced,
or breathy voiced (forced choice).
304
12 Lieselotte Schiefer
All tests were carried out in the Telefunken language laboratory of the
Centre of German Studies, School of Languages, of the Jawaharlal Nehru
University in New Delhi and were presented over headphones at a comfor-
table level. The twelve subjects were either students or staff members of the
Centre. They were paid for their participation (see for details Schiefer 1986).
12.3.1.2 Results
The results are plotted separately for the different response categories in
figures 12.5-12.8. The ordinate displays the identification ratios in percent,
the abscissa the duration of burst and breathy voiced vowel portion in
milliseconds for the single stimuli. Figure 12.5 shows that stimuli 1—4 elicit
breathy voiced responses in tests 1-5 (voicing lead (VLD) = 79.55 msec, to
32.9 msec). In test 6 (VLD = 22.5 msec.) only the first three stimuli of the
continuum elicited breathy voiced answers, whereas in tests 7 (VLD = 10.4
msec.) and 8 (no VLD) none of the stimuli was identified unambiguously as
breathy voiced. Thus the shortening of the voicing lead does not affect the
breathy voiced responses until the duration of this portion drops below
20 msec, in duration.
Stimuli 5-7 of tests 1 ^ (VLD = 79.55 msec, to 41.9 msec.) were judged as
voiced (see fig. 12.6). In test 5 (VLD = 32.9 msec), on the other hand, only
stimulus 6 was unambiguously perceived as voiced, the responses to stimulus
7 were at chance level. In tests 6-8 no stimulus was perceived as voiced. This
means that the lower limit for the perception of a voiced stop lies at 32.8
msec voicing lead. If the duration of the voicing lead drops below that value,
voiceless unaspirated responses are given, as shown in figure 12.7.
In comparing the perceptual results for the voiced and breathy voiced
category it is obvious that voiced responses require a longer prevoicing
than breathy voiced ones. The shortening of both portions, voicing lead and
breathy voiced, leads to the perception of a voiceless aspirated stop (fig.
12.8).
The perception of voiceless aspirated stops is the most interesting outcome
of this experiment. One may argue that the perception of stop categories (like
voiced and aspirated) simply depends on the perceptibility of the voicing lead
and the amount of frication noise following the release. If this were true,
voiceless aspirated responses should have been given to stimuli 1-4 in tests
6-8, since in these tests the stimuli with a short breathy voiced portion,
stimuli 5-7, were judged as voiceless unaspirated, which implies that the
voicing lead was not perceptible. But it is obvious that at least in test 6
breathy voiced instead of voiceless aspirated responses were elicited. In all
tests, the voiceless aspirated function reaches its maximum in the center of
the continuum (stimulus 4), not at the beginning.
On the other hand it seems that the perception of a voiceless aspirated stop
305
Segment
100
-o- VLD = 79.55
-• VLD = 58.65
- - - VLD = 50.80
- * - VLD = 22.50
-- VLD = 0
100-
-a- VLD = 79.55
-• VLD = 58.65
- • - VLD = 50.80
- * • VLD = 22.50
-- VLD = 0
100-
-•- VLD = 79.55
-^ VLD = 10.40
20-
-• VLD = 0
0*
1
100
-a- VLD = 79.55
- • VLD = 58.65
-- VLD = 0
12.3.2 Experiment 2
In a second experiment we tried to replicate the results of the first one by
using a rather irregular example of a bilabial breathy voiced stop for
manipulation. The original stimulus (/bhola/) consisted of voicing lead (92.75
307
Segment
msec), a burst (11.1 msec), a period of voiceless aspiration (21.9 msec), and
a breathy voiced portion (119.9 msec) followed by the clear vowel. It should
be mentioned that the degree of aspiration in the breathy voiced portion was
less than in the first experiment. The test stimuli were derived from the
original one by deleting the period of voiceless aspiration for all stimuli and
otherwise following the same procedure as described in experiment 1. Thus, 8
test stimuli were obtained, forming the basic continuum. A series offivetests
was generated from the basic continuum by reducing the duration of the
voicing lead from 92.75 msec (test 1) to 37.4 msec (test 2), and then
eliminating two pitch periods each for the remaining three tests. The same
testing procedure was applied as in experiment 1.
The results for tests 1-3 resemble those of experiment 1. The continuum is
divided into two categories: breathy voiced responses are given to stimuli 1-5,
and voiced ones to stimuli 6-8 (see fig. 12.9). Only stimulus 8 cannot be
unambiguously assigned to the voiced or voiceless unaspirated category in test
3. Tests 4 and 5 produced comparable results to tests 7 and 8 of experiment 1:
there is an increase in voiceless aspirated responses for stimuli 4—6 (see fig.
12.10).
In comparing the results from the two experiments, it is obvious that the
main difference between them concerns stimuli 1-3 in those tests in which the
duration of the voicing lead drops below about 20 msec (tests 7 and 8 in
experiment 1 and tests 4 and 5 in experiment 2). Whereas in experiment 1 these
stimuli are ambiguous, they are clearly identified as breathy voiced in
experiment 2. This result is directly connected with the "acoustic content" of
the stimuli, i.e. the greater amount of "aspiration" in experiment 1 and a lesser
one in experiment 2. On the other hand the experiments are comparable as to
the rise of voiceless aspirated answers in the center of the continuum, which
cannot be explained by the acoustic content, as the degree of aspiration
degrades with the reduction of the breathy voiced portion. This result can be
explained when the duration of the unshortened breathy voiced portion is
taken into account: it appears that it exceeds that of the first experiment by
about 40 msec
12.4 Discussion
308
12 Lieselotte Schiefer
- • VLD = 37.4
-x- VLD = 0
100
A
80
-a- VLD = 92.75
60 - • VLD = 37.4
ll
40 U \ --- VLD = 22.15
-x- VLD = 0
20 +
1 4 5
Stimuli
309
Segment
results neither support nor refute M. Ohala's (1979) view that prevoicing is a
relevant feature of Hindi breathy voiced stops.
The results from the perception tests are even more difficult to interpret.
Several outcomes have to be discussed in detail.
There is a clear tendency to divide the continua into two stop classes,
breathy voiced and voiced, if the prevoicing is of sufficient duration. If the
duration drops below 32.9 msec, (experiment 1) and the breathy voiced
portion is reduced to about 40 msec, a voiceless unaspirated instead of a
voiced stop is perceived. On the other hand, a breathy voiced stop is heard
as long as the duration of the voicing lead does not drop below 22.5 msec,
and that of the breathy voiced vowel does not become shorter than about
30-40 msec. Otherwise, a voiceless aspirated stop is perceived. These results
point to some main differences in the perception of voiced and breathy
voiced stops: (a) Hindi stops are perceived as voiced only if they are pro-
duced with voicing lead of a sufficient duration (about 30 msec). There are
no trading relations between the voicing lead and the duration of the
breathy voiced vowel portion, (b) Breathy voiced stops are perceived even if
the duration of the voicing lead approaches the threshold of percepti-
bility, as can be concluded from the perception of either voiced or voice-
less aspirated stops. If the voicing lead is eliminated totally the responses
to the stimuli depend on the duration of the breathy voiced portion: if it
is of moderate duration only (79.55 msec, experiment 1) the first two
stimuli of the continuum cannot be unambiguously identified; if it is long,
as in experiment 2 (131 msec), the stimuli are judged as breathy voiced.
When both the breathy voiced vowel and the voicing lead are short, we
find voiceless aspirated responses. These results provide evidence that
the voicing lead is less important in the perception of breathy voiced
stops than of voiced ones, and that trading relations exist between
the duration of the voicing lead and that of the breathy voiced vowel
portion.
Thus the perception of a breathy voiced stop does not depend solely
either on the perceptibility of the voicing lead or on the duration of the
breathy voiced vowel portion. It is rather subject to the overall duration
of voicing lead + burst + breathy voiced vowel portion. If the duration
of this portion drops below a given value the perceived stop category
changes.
From these puzzling results we can conclude that neither the perception of
breathy voiced stops nor of voiceless aspirated stops can be explained solely
by the acoustic content of the stimuli. Listeners seem to perceive the
underlying glottal gesture of the stimuli, which they clearly separate from a
devoicing gesture: they respond with breathy voiced if the gesture exceeds
about 60 msec, whereas they respond with voiceless aspirated if the duration
310
12 Lieselotte Schiefer
drops below that limit. This hypothesis is directly supported by results from
physiological investigations, where it was shown that the duration of the
glottal gesture for a breathy voiced stop is about double that of a voiceless
aspirated one (Benguerel and Bhatia 1980).
Finally, let us turn to the interpretation of the feature "voice-onset time." It
must be stated that our results support neither Ohala's concept nor that of
Ladefoged and Schiefer, since what is important in the perception of breathy
voiced and voiceless aspirated stops is not the voicing lead by itself but the
trading relation with the breathy voiced portion. On the other hand, from our
results it can be concluded that the Hindi stops form three natural classes, (a)
The voiced and voiceless unaspirated stops form one class, as they are
perceived only if the duration of burst and breathy voiced portion is shorter
than 30-40 msec, i.e. if the burst is immediately followed by a regularly voiced
vowel. This result is comparable with those from voice-onset-time exper-
iments, which showed that the duration of the voicing lag in voiceless
unaspirated stops rarely exceeds 30 msec, (b) Breathy voiced and voiceless
aspirated stops form one class according to the release portion of the stop,
which is either a breathy voiced or voiceless vowel, whose duration has to be
longer than about 44 msec, in breathy voiced stops and has to be 32-65 msec, in
voiceless aspirated stops. Both stops show trading relations between the
voicing lead and the breathy voiced portion, (c) Voiceless unaspirated and
voiceless aspirated stops can be grouped together with regard to their voicing
lead duration which has to be shorter than 32.9 msec, in both stops, (d)
Obviously, voiced and breathy voiced stops do not form a natural class: voiced
stops need a longer voicing lead than breathy voiced ones (32.9 vs. 22.5 msec.)
and voiced stops do not show any trading relations between voicing lead and
the breathy voiced portion, whereas breathy voiced stops do.
All the results of the experiments can be summarized in two main points:
we must distinguish between two acoustic portions, the stop closure and the
stop release; and the perception of stops depends on the duration of the
whole glottal gesture underlying the production of the stop.
p ph b bh
Lead onset time 0 0 2 1
Onset of regular voicing 0 2 0 2
breathy voiced stop. In assigning " 1 " to the breathy voiced stop we take
account of the fact that the voicing lead of breathy voiced stops is shorter
than that of voiced ones (cf. Schiefer 1988) or may even be missing altogether
and that the voicing lead is less important in the perception of breathy voiced
stops than in voiced ones. The way we apply the feature "onset of regular
voicing" to the stops differs from Ladefoged, Ohala, or Schiefer (1984), as we
now group together the voiceless unaspirated and the voiced stops by
assigning them the same value, namely " 0 " . In doing this we account for the
similarity in the perception of both stops: they are perceived if the duration of
the breathy voiced portion is shorter than about 30 msec. In assigning " 2 " of
the feature "onset of regular voicing" to the voiceless aspirated and breathy
voiced stops we take into consideration the similarity of these stops in
the perception experiments. On the other hand, we do not specify the
acoustic nature of this portion. Thus, this portion may be characterized
by either a voiceless or a breathy voiced vowel portion in the case of the
breathy voiced stop, i.e. may represent a regular or a "voiceless" type of the
breathy voiced stops. Therefore, our feature specification is not restricted
to the representation of the regular type of the stop but is applicable to
all stop types mentioned above. The present feature set allows us to
group the stops together in natural classes by (a) assigning the same value
of the feature "lead onset time," " 0 , " to the voiceless unaspirated and
voiceless aspirated tops; (b) assigning the same value, " 0 , " to the voice-
less unaspirated and voiced stops; and (c) assigning " 2 " of the feature
"onset of regular voicing" to the voiceless aspirated and breathy voiced
stops.
In summary: we have used the results of acoustical and perceptual
analysis, as well as the comparison of these results, to set up a feature
specification for the Hindi stops which is based on purely phonetic grounds.
Thus we are able to avoid a mixture of phonetic and phonemic arguments or
to rely on evidence from other languages. In particular, we have shown that
the addition of perceptual results helps us to form natural stop classes which
seem to be psychologically real. It is to be hoped that these results will
312
12 Comments
encourage phonologists not only to rely on articulatory, acoustic, and
physiological facts and experiments but to take into account perceptual
results as well.
Comments on chapter 12
ELISABETH SELKIRK
The Hindi stop series includes voiceless and voiced stops, e.g. /p/, /b/, and
what have been typically described (see e.g. Bloomfield 1933) as the aspirated
counterparts of these, /p h / and /b h /. The transcription of the fourway contrast
in the Indo-Aryan stop system, as well as the descriptive terminology
accompanying it, amounts to a "naive" theory of the phonological features
involved in representing the distinctions, and hence of the natural class into
which the sounds fall. The presence/absence of the diacritic h can be taken to
stand for the presence/absence of a feature of "aspiration," and the b/p
contrast taken to indicate the presence/absence of a feature of "voice." In
other words, according to the naive theory there are two features
which cross-classify and together represent the distinctions among the Hindi
stops, as shown in (1).
(1) Naive theory of the Hindi stop features
IPl /Ph/ /b/ /bh/
"Aspiration" — + — +
"Voice" + +
For clarity's sake I have used " + " to indicate a positive specification for the
features and " — " to indicate the absence of a positive specification, though
the naive theory presumably makes no commitment on the means of
representing the binary contrast for each feature.
The object of Schiefer's study is to provide acoustic and perceptual
evidence bearing on the featural classification of the Hindi stops. Schiefer
interprets "aspiration" as a delay in the onset of normal voicing in vowels,
joining Catford (1977) in this characterization of the feature. "Voice" for
Schiefer corresponds to a glottal gesture (see Browman and Goldstein 1986)
producing voicing which typically precedes the release of the stop, and which
may carry over into the postrelease "aspirated" portion in the case of voiced
aspirated stops. The name given to the feature - "lead onset time"-is
perhaps unfortunate in failing to indicate that what is at issue is the phasing
of the "voice" gesture before, through, and after the release of the stop.
Indeed, the most interesting result of Schiefer's study concerns the phasing of
this voicing gesture. In plain unaspirated voiced stops voicing always
313
Segment
precedes the stop release. In aspirated voiced stops, by contrast, there is a
considerable variability amongst speakers in whether voicing precedes the
release at all, and by how much, and variability too in whether voicing
continues through to the end of the postrelease aspirated portion of the
sound. Schiefer observes a tendency in all subjects to keep constant the
overall duration of the voicing (prevoicing, burst, and postvoicing) in the
voiced aspirates, while the point of onset of the voicing before the stop burst
might vary. This trading relation between prevoicing and postvoicing plays a
role in perception as well: the overall length of the voiced period, indepen-
dent of the precise timing of its realization with respect to the burst, appears
to be the relevant factor in the perception of voiced aspirate stops. Schiefer
proposes, therefore, that prevoicing and postvoicing are part of the same
gesture, in the Browman and Goldstein sense. Thus, while both the voiced
aspirated and plain voiced (unaspirated) stops have in common the presence
of this voicing gesture, named by the feature "lead onset time," they are
distinguished in the details of the timing, or phasing, of the gesture with
respect to the release.
Ladefoged (1971: 13) took issue with the naive theory of Hindi stops:
"when one uses a term such as voiced aspirated, one is using neither the term
voiced nor the term aspirated in the same way as in the descriptions of the
other stops." For Ladefoged "murmured [voiced aspirated] stops represent a
third possible state of the vocal cords." Schiefer's study can be taken as a
refutation of Ladefoged, confirming the naive theory's assumption that
just two features, perfectly cross-classifying, characterize the Hindi stops.
For Schiefer the "different mode of vibration" Ladefoged claims for the
voiced aspirate stops would simply be the consequence of the different
possibilities of phasing of the voicing gesture in the voiced aspirates. The
different phasing results in (a) shorter or nonexistent prevoicing times
(thereby creating a phonetic "contrast" with plain voiced stops), and (b)
the penetration of voicing into the postrelease aspirated period (there-
by creating a phonetic "contrast" with the aspiration of voiceless un-
aspirated stops). These phonetic "contrasts" are a matter of detail in the
phonetic implementation of the voicing gesture, and do not motivate
postulating yet another feature to represent the distinctions among these
sounds.
The chart in (2) gives the Schiefer theory of the featural classification of
Hindi stops:
314
12 Comments
The different values for "lead onset time" given by Schiefer to /b/ and /bh/ -
" 2 " and " 1 , " respectively - are justified by her on the basis of (a) the phasing
difference in the realization of the voicing gesture in the two cases, and (b) the
fact that voicing lead is less important in the perception of voiced aspirates
than it is with plain voiced stops. Schiefer seems to imply that the represen-
tation in (2) is appropriate as a phonological representation of the contrasts
in the Hindi stop series. This is arguably a mistake. I would like to suggest
that (2) should be construed as the representation of the phonetic realization
of the contrasts in the Hindi stop series, and that the chart in (3), a revision of
the naive theory (1) in accordance with Schiefer's characterization of the
features "aspiration" and "voice," provides the appropriate phonological
representation of the contrasts:
315
Segment
that there are just two features involved in characterizing the fourway
contrast /p, ph, b, bh/ in Hindi and the other Indo-Aryan languages. But
phonological feature theory has not looked to phonetics for an answer to the
question of whether features are monovalent, binary, or n-ary. This is a
question that cannot be answered without examining the workings of
phonological systems, along the lines that have been suggested above. The
phonetic dimensions corresponding to phonological features are typically
gradient, not quantal. It is in phonology that one finds evidence of the
manner in which these dimensions are quantized, and so it is phonology that
must tell us how to assign values to phonological features.
318
Section C
Prosody
319
13
An introduction to intonational phonology
D. ROBERT LADD
13.1 Introduction
The assumption of phonological structure is so deeply embedded in instru-
mental phonetics that it is easy to overlook the ways in which it directs our
investigations. Imagine a study of "acoustic cues to negation" in which it is
concluded, by a comparison of spectrographic analyses of negative and
corresponding affirmative utterances, that the occurrence of nasalized for-
mants shows a statistically significant association with the expression of
negation in many European languages. It is quite conceivable that such data
could be extracted from an instrumental study, but it is most unlikely that
anyone's interpretation of such a study would resemble the summary
statement just given. Nasalized formants are acoustic cues to nasal seg-
ments-such as those that happen to occur in negative words like not,
nothing, never (or non, niente, mai or n'e, n'ikto, n'ikogda, etc.)-rather than
direct signals of meanings like "negation." The relevance of a phonological
level of description - an abstraction that mediates between meaningful units
and acoustic/articulatory parameters - is taken for granted in any useful
interpretation of instrumental findings.
Until recently, the same sort of abstraction has been all but absent from
instrumental work on intonation. Studies directly analogous to the hypothe-
tical example have dominated the experimental literature, and the expression
"intonational phonology" is likely to strike many readers as perverse or
contradictory. In the last fifteen years or so, however, a body of theoretical
work has developed, empirically grounded (in however preliminary a fash-
ion) on instrumental data, that gives a clear meaning to this term. That is the
work reviewed here.1
1
The seminal work in the tradition reviewed here is Pierrehumbert (1980), though it has
important antecedents in Liberman (1975) and especially Bruce (1977). Relevant work since
1980 includes Ladd (1983a), Liberman and Pierrehumbert (1984), Gussenhoven (1984),
Pierrehumbert and Beckman (1988), and Ladd (1990).
321
Prosody
This is not to suggest that a target-and-transition model is necessarily the ideal phonetic
model of either Fo or spectral properties, but only that, as a first approximation, it is equally
well suited to both.
322
13 D. Robert Ladd
Figure 13.1 Speech wave and F o contour for the utterance Her mother's a lawyer, spoken with
an unemphatic declarative intonation. The peaks of the two accents (arrows) are aligned near
the end of the stressed syllables mo- and law-
Figure 13.2 Speech wave and F o contour for the sentence The bathroom?! (see text for detail).
The valley of the low(-rising) pitch accent is aligned near the end of the syllable bath-
323
Prosody
13.2.1.2 Association of tunes to texts
The basic phonological analysis of a pitch contour is thus a string of one or
more pitch accents together with relevant boundary tones. Treating this
description as an abstract formula, we can then speak of contour types or
tunes, and understand how the same contour type is applied to utterances
with different numbers of syllables. For example, consider the tune that can
be used in English for a mildly challenging or contradicting echo question, as
in the following exchange.3
On the monosyllabic utterance Sue this contour rises and falls and rises
again. However, we are not dealing with a global rise-fall-rise shape that
applies to whole utterances or to individual syllables, as can be seen when we
apply the same contour to a longer utterance:
324
13 D. Robert Ladd
325
Prosody
found that the phonetic property of the contour that remained most
invariant under range variation-i.e. the thing that most reliably character-
ized the contour types-was the relationship in Fo level between the two
accent peaks of the contours; other measures (e.g. size of pitch excursion)
were substantially more variable. Finally, it has been shown repeatedly that
the endpoints of utterance-final Fo falls are quite constant for a given speaker
in a given situation (e.g. Maeda 1976; Menn and Boyce 1982; Liberman and
Pierrehumbert 1984; Ladd 1988 for English; Ladd et al 1985 for German;
van den Berg, Gussenhoven, and Rietveld [this volume] for Dutch; Connell
and Ladd 1990 for Yoruba). It has been suggested that this constant
endpoint is (or at least, reflects) some sort of "baseline" or reference value for
the speaker's Fo range.4
4
It has also been suggested (e.g. Pierrehumbert and Beckman, this volume) that the invariance
of contour endpoints has been exaggerated and/or is an artifact of Fo extraction methods, and
that the scaling of contour endpoints can be manipulated to signal discourse organization.
Whether or not this is the case, it does not affect the claim that target levels are linguistically
significant - in fact, if Pierrehumbert and Beckman are right it would in some sense strengthen
the argument.
326
13 D. Robert Ladd
written with the diacritic % (H% or L%). Between the last pitch accent and
the boundary tone of any phrase, in Pierrehumbert's original analysis, there
is another tone she called the "phrase accent." Since this tone does not seem
to be associated with a specific syllable but rather trails the pitch accent by a
certain interval of time, Pierrehumbert considered this to be an unstarred
tone like those that can be part of pitch accents, and therefore wrote it T .
For example, the rising-falling-rising tune illustrated earlier would be
written L* + H~ L~ H%, with a low-rising pitch accent, a low phrase accent,
and a high final boundary tone, as in (3):
(3) L* + H - L H%
a computer programmer
327
Prosody
extent over some domain (syllable, word, phrase, utterance), and that Fo
contours are built up by superposing smaller-domain features on larger-
domain ones, in a manner reminiscent of Fourier description of complex
waveforms. The intonational models of e.g. O'Shaughnessy and Allen (1983)
or Gronnum (this volume; see also Thorsen 1980a, 1985) are based on this
approach.
This view was convincingly challenged by Bruce (1977). Bruce showed that
in Swedish, the phonetic manifestations of at least certain phrase-level
intonational features are discrete events that can be localized in the Fo contour
(Bruce's sentence accents). That is, the relationship between lexically speci-
fied Fo features (the Swedish word accents) and intonationally specified ones
is not necessarily a matter of superposing local shapes on global ones, but
involves a simple succession of Fo events in time.
In the most restrictive versions of current intonational phonology, it is
explicitly assumed that independently chosen global shapes-e.g. a "declina-
tion component" - are not needed anywhere in the phonological description.
(This is discussed further in 13.3.2 below.) In effect, the restrictive linear view
says that all languages have tonal strings; the main difference between
languages with and without lexical tone is simply a matter of where the tonal
specifications come from. In some languages ("intonation languages") the
elements of the tonal string are chosen, as it were in their own right, to
convey pragmatic meanings, while in others ("tone languages") the phonolo-
gical form of morphemes often or always includes some tonal element, so
that the tonal string in any given utterance is largely a consequence of the
choice of lexical items.
In this view, the only tonal elements that are free to serve pragmatic
functions in a tone language are boundary tones, i.e. additional tonal
specifications added on to the lexically determined tonal string at the edge of
a phrase or utterance. This is how the theory outlined here would account for
the common observation that the final syllable of a phrase or utterance can
have its lexical tone "modified" for intonational reasons (see e.g. Chang
1958). Additionally, some tone languages also seem to modify pitch range,
either globally or locally, to express pragmatic meanings like "interrog-
ation." The possibilities open to lexical tone languages for the tonal
expression of pragmatic meanings are extensively discussed by Lindsey
(1985).
The principal phonetic difference between tone languages and intonation
languages is simply a further consequence of the functional difference. More
"happens" in Fo contours in a tone language, because the tonal specifications
occur nearly every syllable and the transitions span only milliseconds,
whereas in a language like English most of the tonal specifications occur only
on prominent words and the transitions may span several syllables. But
328
13 D. Robert Ladd
the specifications are the same kind of phonological entity regardless of
their function, and transitions are the same kind of phonetic pheno-
menon irrespective of their length. There is no basis for assuming that
lexical tone involves a fundamentally different layer in the analysis of Fo
contours.
"Tonal space" is an ad hoc term. In various contexts the same general abstraction has been
referred to as tone-level frame (Clements 1979), grid (Garding and her co-workers, e.g.
Garding 1983), transform space (Pierrehumbert and Beckman [1988], but note that this is not
really intended as a technical term), and register (e.g. Connell and Ladd 1990). The lack of any
accepted term for this concept is indicative of uncertainty about whether it is really a construct
in its own right or simply a consequence of the interaction of various model parameters; only
for Garding does the "grid" clearly have a life of its own. See further section 13.3.2.
329
Prosody
\ i ^ ^ * register shift
\ |
\ i
\ / - ^
\ / ' \
\ / \
\y \ tonal space
| \
' \
\
baseline
Figure 13.3 Idealization of the two-accent Fo contour from figure 13.1, illustrating
one way in which the accents might be modeled using the concept of "tonal space"
modeled phonetically as a fall from the top of the tonal space to the bottom,
ending some distance above the baseline. This is shown in Figure 13.3.
The mathematical details vary rather considerably from one model to
another, but this basic approach can be seen in the models of e.g. Bruce
(1977), Pierrehumbert (1980; Liberman and Pierrehumbert 1984; Pierrehum-
bert and Beckman 1988), Clements (1979, 1990), Ladd (1990; Ladd et al.
1985; Connell and Ladd 1990), and van den Berg, Gussenhoven, and
Rietveld (this volume); to a considerable extent it is also part of the models
used by 't Hart and his colleagues and by Garding and her colleagues (see
note 5).
Mathematical details aside, the biggest point of difference among these
various models lies in the way they deal with differences of what is loosely
called "pitch range." Pretheoretically, it can readily be observed that some
speakers have wide ranges and some have narrow ones; that pitch range at
the beginning of paragraphs is generally wide and then gets narrower; that
pitch range is widened for emphasis or interest and narrowed when the topic
is familiar or the speaker is bored or depressed. But we lack the data to decide
how these phenomena should be expressed in terms of the parameters of
phonetic models like those described here. For example, many instrumental
studies (e.g. Williams and Stevens 1972) have demonstrated that emotional
arousal - anger, surprise, etc.-is often accompanied by higher Fo level and
wider Fo range. But level is generally defined in these studies in terms of mean
Fo (sampling Fo every 10-30 msec, and thus giving equal weight to targets
and transitions); range is usually viewed statistically in terms of the variance
around the Fo mean. We do not know, in terms of a phonetic model like the
ones under consideration here, whether these crude data reductions reflect a
raising of the tonal space relative to the baseline, a widening of the tonal
space, a raising of everything including the baseline, or any of a number of
other logical possibilities. A great deal of empirical work is needed to settle
330
13 D. Robert Ladd
questions like these. In the meantime, different models have taken rather
different approaches to these questions.
133.2 Downtrends
Perhaps the best illustration of such differences is the treatment of overall Fo
downtrends ("declination" and the like). Since the work of Pierrehumbert
(1980), it is widely accepted that at least some of what has earlier been treated
as global declination is in fact downstep - a stepwise lowering of high Fo
targets at well defined points in the utterance. However, this leaves open a
large number of fairly specific questions. Does downstep lower high targets
by narrowing the tonal space or by lowering it? Is it related to the "resetting
of pitch range" that often follows prosodic boundaries (see the papers by
Kubozono, and van den Berg, Gussenhoven, and Rietveld, this volume)? Are
there residual downtrends - true declination - that downstep cannot explain?
If there are, do we model such declination as a gradual lowering of the
baseline, as a gradual lowering of the tonal space relative to the baseline, or
in some other way? Appropriately designed studies have made some progress
towards answering these questions; for example, the coexistence of downstep
and declination is rather nicely demonstrated in Pierrehumbert and Beck-
man's work on Japanese (1988: ch. 3). But we are a long way from
understanding all the interactions involved here.
A further theoretical issue regarding downtrends - and in some sense a
more basic one - is whether the shape and direction of the tonal space can be
chosen as an independent variable, or whether it is conditioned by other
features of phonological structure and the tonal string. In the models
proposed by Garding and her co-workers (e.g. Garding 1983; Touati 1987)
and by 't Hart and his colleagues (e.g. Cohen and 't Hart 1967; 't Hart and
Collier 1975; Cohen, Collier, and 't Hart 1982), although there is a clear
notion of describing the contour as a linear string of elements, the tonal space
can also be modified globally in a way that affects the scaling of the elements
in the string. This recalls the global intonation component in traditional
models of tone and intonation (see 13.2.3 above). In models more directly
inspired by Pierrehumbert's work, on the other hand, the tonal space has no
real life of its own; any changes to the tonal space are either due to
paralinguistic modification of range, or are - like downstep - triggered by
phonological choices in the tonal string and/or in prosodic structure.
(4)
(w = weak, s = strong)
has been shown by both Huss (1978) and Nakatani and Schaffer (1978) that
certain putative differences of stress can be directly reflected in syllable
duration without any difference of pitch contour. Second, Beckman (1986)
has demonstrated clear perceptual differences between Japanese and English
with respect to accent cues: in Japanese accent really does seem to be signaled
only by pitch change, whereas in English there are effects of duration,
intensity, and vowel quality that play an important perceptual role even
when pitch change is present.6 Third, there are clear cases where pitch cues
are dependent on or orchestrated by a prominent syllable without being part
of the syllable itself. In Welsh declarative intonation, for example, a marked
pitch excursion up and down occurs on the unstressed syllable following the
major stressed syllable; the stressed syllable may be (but need not be)
distinguished durationally by the length of the consonant that closes it
(Williams 1985). This makes it clear that the pitch movement is an intona-
tional element whose distribution is governed by the (independent) occur-
rence of prominence, and suggests that pitch movement cues prominence in a
rather less direct way than that imagined by Fry and other earlier
investigators.
(5)
Data on the phonetics of prominence in Tamil, reported by Soundararaj (1986), suggest that
Tamil is like Japanese in using only pitch to cue the location of the accented syllable;
Soundararaj found no evidence of differences of duration, intensity, or vowel quality (see also
the statements about Bengali in Hayes and Lahiri 1991). Interestingly, however, the function
of pitch in Tamil (or Bengali) is rather more like the function of pitch in English or Dutch, i.e.
purely intonational, not lexically specified as in Japanese. That is, phonetically speaking,
Tamil is (in Beckman's sense) a "pitch-accent" language like Japanese, but functionally
speaking it uses pitch like the "stress-accent" languages of Europe. If this is true, it provides
further evidence for excluding function from consideration in modeling F o in different
languages (see 13.2.3 above).
333
Prosody
Related findings are reported by e.g. Cooper and Paccia-Cooper (1980) and
by Thorsen (1985, 1986).
The principal issue that these findings raise for intonational phonology
involves the nature of the hierarchical structures and their relationship to
phonetic models of Fo. Are we dealing directly with syntactic constituent
structure, as assumed for example by Cooper and Paccia-Cooper? Are we
dealing with some sort of discourse organization, in which boundary strength
is a measure of the "newness" of a discourse topic? (This has been suggested
by Beckman and Pierrehumbert [1986], following Hirschberg and Pierrehum-
bert [1986].) Or are we dealing with explicitly phonological or prosodic
constituent structure, of the sort that has been discussed within metrical
phonology by, e.g., Nespor and Vogel (1986)? This latter possibility has been
suggested by Ladd, who sees "relative height" relations like
(6)
/ \
(h = high; 1 = low)
334
14
Downstep in Dutch: implications for a model
14.0 Introduction
In this paper we attempt to identify the main parameters which must be
included in a realistic implementation model for the intonation of Dutch.*
The inspiration for our research was provided by Liberman and Pierrehum-
bert (1984) and Ladd (1987a). A central concern in those publications is the
phonological representation and the phonetic implementation of descending
intonation contours. The emphasis in our research so far has likewise been
on these issues. More specifically, we addressed the issue of how the
interruption of downstep, henceforth referred to as reset (Maeda 1974;
Cooper and Sorensen 1981: 101) should be represented. Reset has been
viewed as (a) an upward register shift relative to the register of the preceding
accent, and (b) as a local boost which interrupts an otherwise regular
downward trend. We will argue that in Dutch, reset should be modeled as a
register shift, but not as an upward shift relative to the preceding accent, but
as a downward one relative to a preceding phrase. Accordingly, we propose
that a distinction should be made between accentual downstep (which applies
to H* relative to a preceding H* inside a phrase), and phrasal downstep,
which reduces the range of a phrase relative to a preceding phrase, and
creates the effect of reset, because the first accent of the downstepped phrase
will be higher than the last downstepped accent of the preceding phrase.
We will present the results of two fitting experiments, one dealing with
accentual downstep, the other with phrasal downstep. Both address the
question whether the two downstep factors (accentual and phrasal) are
independent of speaker (cf. men vs. women) and prominence (excursion size
of the contour). In addition, the first experiment addressed the issue whether
This paper has benefited from the comments made by an anonymous reviewer.
335
Prosody
the accentual downstep factor depends on the number of downstepped
accents in the phrase.
Before reporting on these data, we give a partial characterization of the
intonation of Dutch in section 14.1. This will enable us to place the
phenomenon of (accentual and phrasal) downstep in a larger phonological
context, and will also provide us with a more complete picture of the
parameters to be included in an implementation model. This model, given in
section 14.2, will be seen to amount to a reorganization of a model proposed
in Ladd (1987b). Section 14.3.1 reports on the experiment on accentual
downstep and discusses the data fitting procedure in detail. The experiment
on phrasal downstep is presented in section 14.3.2.
336
14 R. van den Berg, C. Gussenhoven, and T. Rietveld
0.6
Time (sec.)
(b) 500
400
0.0 0.6
Time (sec.)
Figure 14.1 Contours (H*L L%) AD and (L*H H%) AD on the sentence Lee*uwarden wil meer
mannen
considered the same unit, while the same goes for the (b) examples. Our claim
is that the explanation for the difference between the contours in figure 14.2
and those in figure 14.3 is to be found in a difference in phrasing, rather than
in a different choice of pitch accent. In a slow, explicit type of pronunciation,
as illustrated by the contours in figure 14.2, the tone segments of a pitch
accent are confined to the highest constituent that dominates it, without
dominating another pitch accent. This constituent, which we refer to as the
association domain (AD), is Leeuwarden in the case of the first pitch accent in
the contours in figure 14.2, because the next higher node also dominates the
following H*L. The AD for this H*L, obviously, is wil meer mannen. Unless
the first syllable of the AD is the accented syllable, there will be an
unaccented stretch in the AD (none in Leeuwarden, and wil meer in wil meer
mannen), which we refer to as the onset of the AD. By default, onsets are low-
pitched in Dutch (but they may also be high-pitched, to give greater
expressiveness to the contour). Turning now to the contours in figure 14.3,
we observe that the AD-boundary after Leeuwarden is lacking, which we
337
Prosody
(a) 500
400
l e - w a r d a w l l m e - r m a n
w a r d a w I I m e* r m an
50
0.6
Time (sec.)
Figure 14.2 Countours (H*L)AD, (H*L L%) AD and (L*H)AD (H*L L%) AD on the sentence
Lee*uwarden wil meer ma*nnen
suggest is the result of the restructuring of the two ADs to a single AD'. One
obvious consequence is that wil meer is no longer an onset. More specifically,
the consequence for the first pitch accent is that the spreading rule applying
to the second tone segment no longer has a right-hand boundary it can refer
to. What happens in such cases is that this segment associates with a point in
time just before the following accented syllable. The pitch of the interaccen-
tual stretch -warden wil meer is an interpolation between the shifted tone
segment and the tone segment to its left.2 The association domain is thus an
intonational domain, constructed on the basis of the independently existing
(prosodic) structure of the utterance, not as a constituent of the prosodic tree
itself. Restructuring of ADs is more likely as the prosodic boundary that
separates them is lower in rank (see Gussenhoven, forthcoming).
2
The analysis follows that given for English in Gussenhoven (1983). Note that the introduction
of a pitch accent L + H* for the second accent in the contour in figure 14.3a, as in
Pierrehumbert (1980), would unnecessarily add a term to the inventory of pitch accents.
Moreover, it would force one to give up the generalization that contours like those in figure
14.2 are more carefully pronounced versions of those in figure 14.3.
338
14 R. van den Berg, C. Gussenhoven, and T. Rietveld
a r d a w I I m e - r m an
(b) 500
400:
300
*
200 H H
^ ^ ^ \
\
100 \ ^ _ -~v/ \ L L%
— ^
I e• w a r d a w I I m e• r m a n a
50
0.0 0.3 0.6 0.9 1.2
Time (sec.)
Figure 14.3 Contours (H*L H*L L%) AD , and (L*H H*L L%) AD , on the sentence Lee*uwarden
wil meer ma*nnen
To relate the contours in figure 14.2 to those in figure 14.1, we assume that
the lexical representations of the two pitch accents are H*L and L*H.
Boundary Tone Assignment (see (1) below) provides the bitonal H*L and
L*H with a boundary tone which is a copy of their last tone segment. We
have no account for when this boundary tone segment appears, and
tentatively assume it is inserted when the AD ends at an intonational phrase
boundary.3
(1) Boundary Tone Assignment'. 0 —• aH / aH •)„
Rule (1) should be seen as a default rule that can be preempted by other tonal processes of
Dutch. These include the modifications which H*L and L*H can undergo, and the stylistic
rule NARRATION, by which the starred tone in any L*H and H*L can spread, causing the
second tone segment to be a boundary segment. In addition, there is a third contour H*L H%.
These have been described in Gussenhoven (1988), which is a critique and reanalysis of the
well-known description of the intonation of Dutch by Cohen and 't Hart (1967), 't Hart and
Collier (1975), Collier and 't Hart (1981).
339
Prosody
A
Leewarden wil meer mannen
A A
j Leewarden wil meer mannen
H LL%
J ^ \ A
Leewarden wil meer mannen
\
H L H LL%
Figure 14.4 Tonal structure of the contours in figures 14.1a, 14.2a, and 14.3a
In summary, we find:
End of IP H*L L% L*H H%
Second tone segment spreads
End of AD, but inside IP H*L L*H
Second tone segment spreads
Inside AD' H*L L*H
Interpolation between T* and following
tone segment spans stretch between ac-
cents
Figure 14.4 gives schematic representations of the tonal structures of the
contours in figures 14.1a, 14.2a, and 14.3a.
340
14 R. van den Berg, C. Gussenhoven, and T. Rietveld
Two of the forms that downstepped contours may take are illustrated in
figure 14.5. Both are instances of the sequence of placenames Haa*rlem,
Rommelda*m, Harderwij*k en Den He*lder. In figure 14.5a, the H*s after a
H*L have been lowered (a lowered H* is symbolized !H*), and the H*s
before a H*L have spread. In the contour in figure 14.5b, the spreading of the
first and second H*s stopped short at the syllable before the accent. In fact,
the spreading of these H*s may be further restrained. Maximal spreading of
H* will cause the L to be squeezed between it and the next !H*, such that it is
no longer clear whether it has a phonetic realization (i.e. whether it forms
part of the slope down to !H*) or is in fact deleted. In our phonological
representations, we will include this L. It is to be noted that the inclusion of
dips before nonfinal !H* in artificially produced downstepped contours
markedly improves their naturalness, which would seem to suggest that the
segment is not in fact deleted. Lastly, a final !H* merges with the preceding L
to produce what is in effect a L* target for the final accented syllable. We will
not include this detail in (2) below.
The rule for downstep, then, contains two parts: the obligatory downstep-
ping of noninitial H*s, and the variable spreading of nonfinal H*s. The
domain for downstep is the AD': observe that the L does not respect the
boundary between Haarlem and Rommeldam, Rommeldam and Harderwijk,
etc., as shown in the contours in figure 14.5.
(2) Downstep a. H* -* !H* /T*T (obligatory)
b. Spread H* / LH*L (variable) Domain: AD'
Following widely attested tonally triggered downstep phenomena in lan-
guages with lexical tone, Pierrehumbert (1980) and Beckman and Pierrehum-
bert (1986) propose that in English, too, downstep is tonally triggered. The
reason why we cannot make the same assumption for Dutch is that contours
like those in figure 14.3a exist. Here, no downstep has applied; yet we seem to
have a sequence of H*L pitch accents, between which, moreover, no AD
boundary occurs. Since the tonal specification as well as the phrasing
corresponds with that of the contours in figure 14.5, we must assume that
instead of being tonally triggered, downstep is morphologically triggered.
That is, the rule in (2) is the phonological instantiation of the morpheme
[downstep], which can be affixed to AD'-domains, as an annotation on the
node of the constituent within which downstep takes place.
The characterization of downstep as a morpheme has the advantage that
all downstepped contours can be characterized as a natural class. That is,
regardless of the extent to which H* is allowed to spread, and regardless of
whether a modification like DELAY is applied to H*Ls (see Gussenhoven
1988), downstepped contours are all characterized as having undergone rule
(2). This advantage over an analysis in which downstep is viewed as a
341
Prosody
h a - r l r m r o m a l d a mh a r d a w f I k f d
h a * r | r m r o m a l d a m h a r d a w r l k <~n d e n h e I d a r
50
1.0 2.0
Time (sec.)
Figure 14.5 Downstepped contours H*L !H*L !H*L !H*L on the sentence Haa*rlem,
Rommelda*m, Harderwij*k en Den He*lder
342
14 R. van den Berg, C. Gussenhoven, and T. Rietveld
mid pitch of the vocative chant, for example, a contour with a very different
meaning from the downstepped contours illustrated in figure 14.5, is
obtained by the same implementation rule that produces downstepped
accents in Pierrehumbert (1980), Pierrehumbert and Beckman (1986). A
disadvantage of our solution is that it is not immediately clear what a
nondiacritic representation of our morpheme would look like. We will not
pursue this question here.
14.1.3 Reset
As in other languages, non-final AD's in which downstepping occurs, may be
followed by a new AD' which begins with an accent peak which is higher
than the last downstepped H* of the preceding AD'. In theory, there are a
number of ways in which this phenomenon, henceforth referred to as reset,
could be effected. First, reset could be local or global: that is, it could consist
of a raising of the pitch of the accent after the boundary, or it could be a
raising of the register for the entire following phrase, such that the pitch of all
following tone segments would be affected.
Reset as local boost. The idea of a local boost is tentatively entertained by
Clements (1990). In this option, the downward count goes on from left to
right disregarding boundaries, while an Fo boost raises the first part or the
first H* in the new phrase, without affecting following H*s. Although local
boosting may well be possible in certain intonational contexts, the contour in
figure 14.6 shows that reset involves a register shift. It shows a sequence of
two AD's, the first with four proper names and the second with five, all of
them having H*L. The fifth proper name, Nelie, which starts the new AD', is
higher than the fourth, Remy. The second and third accent peaks of the
second AD', however, are also higher in pitch than the final accent of the
preceding AD'. Clearly, the scaling of more than just the first H* of the new
AD' has been affected by the Reset.
Reset as register shift using accentual downstep factor. If reset does indeed
involve a register shift, what is the mechanism that effects it? This issue is
closely related to that of the size of the shift. Adapting a proposal for the
description of tonally triggered downstep in African languages made by
Clements (1981), Ladd (1990) proposes that pitch accents are the terminal
nodes of a binary tree, whose branches are labeled [h-1] or [1-h], and whose
constituency is derived from syntax in a way parallel to the syntax-
phonology mapping in prosodic/metrical phonology. Every terminal node is
a potential location for a register shift. Whether the register is shifted or not,
and if so, whether it is shifted upward or downward, is determined by the
Relative Height Projection Rule, reproduced in (3).
343
Prosody
Figure 14.6 Contour (H*L !H*L !H*L !H*L) !(H*L !H*L !H*L !H*L !H*L) on the sentence
(Merel, Nora, Leo, Remy), en (Nelie, Mary, Leendert, Mona en Lorna)
(3) Relative Height Projection Rule: In any metrical tree or constituent, the
highest terminal element (HTE) of the subconstituent dominated by / is one
register step lower than the HTE of the subconstituent dominated by h, iff
the / is on a right branch.
(4) (a)
h 1 h 1 h h 1 h 1 h h h 1 h 1
0 1 1 2 0 1 2 1 2 0 1 2 3 1 2
344
14 R. van den Berg, C. Gussenhoven, and T. Rietveld
there is reset, whose size depends on the number of accents that precede in
the left-hand constituent (see (4b, c)).
We doubt whether [h-1] labeled trees provide the appropriate represen-
tation for handling downstep. For one thing, within the AD', H*Ls are
downstepped regardless of the configuration of the constituents in which
they appear (see also Pierrehumbert and Beckman 1988: 168). What we are
concerned with at this point, however, is the claim implicit in Ladd's
proposal that the size of the reset is the same as, or a multiple of, the size of
downstep inside an AD'. The contour in figure 14.7a suggests they are not
equal in size. In this contour, the third H*, the one beginning the second AD,
does not go up from the second H* by the same distance that the second H*
was lowered from the first. Neither is the third H* scaled at the same pitch as
the second. It in fact falls somewhere between the first and the second, a
situation which does not obviously follow from the algorithm in (3).
Figure 14.7 Scaling domain of three ADs: (a) with phrasal downstep on the sentence
(Merel, Nora, Leo), (Remy, Nelie, Mary), (en Mona en Lorna); (b) without phrasal
downstep on the sentence de mooiste kleren, de duurste schoenen, de beste school...
("the finest clothes, the fanciest shoes, the best school . . . " )
346
14 R. van den Berg, C. Gussenhoven, and T. Rietveld
347
Prosody
14.2 An implementation model for Dutch intonation
On the basis of the above discussion, we conclude that, minimally, a model
for implementing Dutch intonation contours will need to include the
following five parameters:
1 one parameter for speaker-dependent differences in general pitch range;
2 one parameter to model the effect of overall (contour-wide) prominence;
3 one to control the distance between the targets for H* and L*;
4 one to model the effect of accentual downstep within the phrase;
5 one to model the effect of phrasal downstep.
Our work on implementation was informed by Ladd (1987a), which sets out
the issues in F0-implementation and proposes a sophisticated model of target
scaling. His model, inspired by three sources (Pierrehumbert 1980; Garding
1983; Fujisaki and Hirose 1984) is given in (5).
(5) F0(n) = Fr * NWKA)
F0(n) is the pitch value of the nth accent in Hz;
Fr is the reference line at the bottom of the speaker's range (lowest pitch
reached);
N defines the current range (N > 1.0);
f(Pn) is the phrase function with f(Pn) = f(Pn_,) * ds and f(P,) = 1; this
phrase function scales the register down or up;
d is the downstep factor (0 < d < 1);
s is + 1 for downstep or — 1 for upstep;
f(A) is the accent function of the form WE*T, which scales H* and L* targets;
W dictates register width, i.e. the distance between H* and L*
targets;
T represents the linguistic tone (T = + 1 for H*, T = — 1 for L*);
E is an emphasis factor, its normal value is 1.0; values > 1.0 result
in more emphasis, i.e. the higher scaling of H*'s and lower
scaling of L*.
We distinguish the actual scaling parameters Fr, N, IV, d, and E from the
binary choice variables T and s, which latter two are specified as + 1 or — 1
by the phonological representation. (In fact, s may be —2, —3, to effect
double, treble, etc. upsteps: cf. (4)). Of the scaling parameters, Fr is assumed
to be a speaker-constant, while N, W, and E are situation-dependent. Notice
that increasing Whas the same effect as increasing E.4 Figure 14.8 gives the
model in pictorial form. It should be emphasized that in the representation
given here N, W, and d do not represent absolute differences in F0-height.
When comparing this model with our list offiveparameters, we see that Fr,
N, and Wcorrespond to parameters 1, 2, and 3, respectively. Ladd's dis used
4
Parameter E is no longer included in the version of the model presented in Ladd (1990). The
difference was apparently intended to be global vs. local: is can be used to scale locally defined
prominence, as it is freely specifiable for each target.
348
14 R. van den Berg, C. Gussenhoven, and T. Rietveld
H
|Jd !H H
r W
i 1 _| *
i 1
i
1 J
L
Fr
for both downward and upward scaling, which, in our model, is taken care of
by the two downstep parameters 4 and 5. However, an important aspect of
Ladd's formula is that it allows for the independent modeling of intraphrasal
and interphrasal effects. Although our conception of downstep and reset
differs from that of Ladd, we can retain his formula in a modified form. We
include two downstep factors, which makes s superfluous. We also exclude E
(see note 4). The model we propose scales all H* and L* targets within a
scaling domain, i.e. the domain over which TV and W remain constant. It uses
the same mathematical formula to scale targets in scaling domains with and
without phrasal downstep, as well as targets in AD's with and without
accentual downstep.
(6) F0(m,n) = Fr * Nf(pm) * «An)
F0(m,n) is the pitch value for the wth accent in the rath phrase;
Fr is the reference frequency determining general pitch range;
N defines the current range (N > 1.0);
f(Pm) = dpsp*(ml), the phrase function, scales the phrase reference lines for
the mth phrase;
dp is the phrasal downstep-factor (0 < dp < 1);
SP indicates whether phrasal downstep takes place or not; Sp =
+ 1 if it does and Sp = 0 if it does not.
f(An) = WT * da i/2*saM+T)*(n-i)9 t h e accent function, scales H* and L* targets;
W determines register width, the distance between H* and L*
targets;
da is the downstep factor (0 < da < 1) for downstepping H*
targets within the phrase;
T represents the linguistic tone (T = + 1 for H*, T = — 1 for L*);
Sa indicates whether accentual downstep takes place in that AD';
Sa = + 1 if it does and Sa = 0 if it does not;
the inclusion of the weighting factor \ in the accent function
ensures that the exponentiation for da is 1 when n = 2, 2
when n = 3, and so on.
349
Prosody
H
!H da
!H
r W
!H
Fr
Again we distinguish the actual scaling parameters, Fr, N, W, da, and dp,
corresponding to the parameters mentioned under 1 to 5 above, respectively,
from the binary choice variables, T, Sa, and Sp. The latter serve to
distinguish between H* and L* targets, between AD's with and without
accentual downstep, and between scaling domains with and without phrasal
downstep, respectively. A pictorial representation of this model is given in
figure 14.9. Again, we emphasize that N, W, da, and dp should not be taken to
represent absolute Fo differences.
In figure 14.9, the high reference line represents the Fo for the H* targets
not subject to accentual downstep, while the low reference line represents the
Fo for all L* targets. If phrasal downstep is in force (Sp= 1), we assume, at
least for the time being, that its effect amounts to a downward scaling of both
reference lines. If it is not (Sp = 0), both reference lines are scaled equally high
for all AD's within the scaling domain. The phrase function we propose
accomplishes both: f(Pm) = dp^-1) if phrasal downstep takes place, and
f(Pm) = 1 if it does not.
The scaling of the targets within the AD' is taken care of by the accent
function given above. For an AD' without accentual downstep (Sa = 0), the
accent function reduces to f(An) = WT. Consequently, all H* targets
(T= + 1) are scaled on the phrase high-line (7) and all L* targets (T= — 1) on
the phrase low-line (8).
(7) Fr*N f ( p m)*w
f(An) = W * da1/2*<1+'Wn-1>. The pitch value for the nth H* target (T= + 1)
T
is given by (9), which scales the first H* target (n= 1) on the high-line. L*
targets (T= — 1) are scaled on the low-line (10). Note that the scaling of L* is
not affected by accentual downstep, but that it is by phrasal downstep (see
14.1.4).
(9) F r * Nf(pm) * W * dan - 1
The model assumes that the parameters Fr, da, and dp are constants, with
(speaker-specific) values which are constant across utterances. In fact, the
downstep factors dp and da may also be constant across utterances and
speakers. (This is one of the questions to be dealt with below.) If these
assumptions hold, all target values in a scaling domain are fully deter-
mined with the speaker's choice of N and W for that particular scaling
domain.
351
Prosody
times 4 prominence levels) were recorded in two sessions, which were one
week apart. Subsequently, some utterances were judged to be inadequate,
and discarded. In order to obtain an equal number of utterances for each of
the four speakers, we randomly discarded as many utterances as necessary to
arrive at a total of thirty-two sentences per length for each subject. The
speech material was digitized with a sampling rate of 10 kHz. The resulting
speech files were analyzed with the LPC procedures supplied by the LVS
package (Vogten 1985) (analysis window 25 msec, prediction order 10 and
pre-emphasis 0.95). For all nonfinal H* targets we measured the highest pitch
reached in the accented vowel. This procedure resulted in Fo values which in
our opinion form a fair representation of the height of the plateau that could
visually be distinguished in the Fo contour. It is somewhat more problematic
to establish an intuitively acceptable norm for measuring the final H* target.
Visual inspection showed that the Fo contour sloped down from the plateau
for the prefinal target to a lower plateau. The slope between them was
generally situated in the first half of the final accented syllable. Since the
contour did not offer clear cues as to where the last target should be
measured, we arbitrarily decided to take a measurement at the vowel
midpoint. Where the pitch-extraction program failed because vocal fry had
set in, we measured the pitch period at the vowel midpoint and derived a F o
value from it. Because our model does not (yet) include a final-lowering
factor, these final values will be excluded from the fitting experiment.
However, we will use them to obtain an estimate for the speaker's general
pitch range.
all the speaker's accent series and four da values, one for each length. This is
indicated in the subscript, dak. The second model incorporates one Fr value
and one da value, irrespective of series length, forcing the four daks to have
the same value. With this provision, the same mathematical formula can be
used for both models. Applying the above to our model's general formula,
the predicted value for the nth accent in a series of k accent values thus
becomes (11).
(11) Pksn = Fr * R k s ^ - i
143.13 Results
To obtain an idea of the best possible fit the model allows, we first ran, for
each speaker separately, optimizing analyses for a wide range of Fr values.
(Of course, different Fr values lead to different da values because of the
353
Prosody
Length-dependent Length-independent
Length-dependent Length-independent
humbert (1984: 220) call "soft" preplanning, i.e. behavioral common sense,
as opposed to "hard" preplanning, which would involve right-to-left compu-
tation of the contour. From the dak values for the length-dependent-da model
it is clear that all speakers do to some extent have higher values for da with
more targets. However, as was also the case in the data collected by
Liberman and Pierrehumbert (1984), this trend is small. Indeed, a compari-
son of the indexes in table 14.1 shows that the inclusion of a length-
dependent da in the model results in only a limited gain. To test whether
some speakers adjusted the pitch height of the first accent to the number of
accents, we subjected the Fo values for the initial accents to an analysis of
variance. The effect of the factor series length was not significant, F(3,496) =
1.36(p>0.10).
While the values for Fr in table 14.1 are optimal, they are in no way theory-
based. In order to assess how realistic it would be to assume a language-
specific da, we ran optimizing procedures for both models with an externally
motivated estimate of the general pitch-range parameter. To this end, we
adopted Ladd's operational definition of Fr as the average of endpoint values
of utterance-final falling contours. This corresponds with the mean F o of the
utterance-final plateaus. Table 14.2 gives the results. The indexes appear to
be higher than the optimal values. (The one exception must probably be
attributed to the procedurally inevitable two-step optimizing.) They are
nearly optimal for Fl and F2, somewhat higher for M2, and quite a bit
higher for Ml. The increase in da values with increasing number of targets
appears to be independent of Fr and, again, only a small improvement is
obtained with a length-dependent da. Finally, observe that this particular Fr
estimate does not result in speaker-independent da values.
Since the mathematical interdependence of Fr and da is of a nonlinear
nature, a different Fr might give speaker-independent da values. If we take a
value of approximately two-thirds of the endpoint average, the da obtained
355
Prosody
Table 14.3 Fr estimates as "international locus" and optimal Adi values for a
length-dependent-Adi model and for a length-independent-Adi model and the
indexes for the four speakers separately
Length-dependent Length-independent
Speaker Fr da2 da3 da4 da5 Index Fr da Index
Fl 100 0.76 0.79 0.81 0.83 7.42 100 0.81 9.09
F2 100 0.72 0.74 0.77 0.79 7.55 100 0.77 9.03
Ml 50 0.73 0.77 0.78 0.79 10.21 50 0.78 10.99
M2 50 0.74 0.78 0.81 0.82 7.26 50 0.81 9.21
356
14 R. van den Berg, C. Gussenhoven, and T. Rietveld
Table 14.4 Values for Fr, dp, da, and the goodness-of-fit index for (a) the
optimal parameter combination, (b) the "endpoint average" Fr estimate, and
(c) the "intonational locus" Fr estimate, for four speakers separately
Fl 156 0.73 0.35 6.64 135 0.82 0.54 6.91 100 0.89 0.72 7.61
F2 160 0.78 0.29 6.07 150 0.81 0.37 6.16 100 0.90 0.63 7.82
Ml 70 0.84 0.61 7.85 74 0.83 0.59 7.79 50 0.89 0.71 8.09
M2 90 0.83 0.09 10.13 65 0.88 0.34 11.13 50 0.90 0.46 12.72
compare the two downstep factors, dp and da. The speakers produced the
utterance at four prominence levels with five replications for each level,
reading them from a stock of twenty cards with a different random order for
each speaker.
The same procedure for digitizing, speech-file analysis, and F o measure-
ment was followed. Since we needed to fit the H* targets in the first three
AD's, we collected 120 data points for each speaker. These are referred to as
Msmn, M standing for "measured," s indicating the replication (1 0 ^ 2 0 ) , m
the AD' within the scaling domain (1 ^ra^3), and n the target within the
AD' (1 ^ « ^ 2 ) . With only H* targets (T= + 1) Nand Emerge into a single
range parameter R ( = TV^). For a given (fixed) value of Fr, the Rs for a
particular replication is estimated as Ms]]/Fr. Sp= 1 and Sa= 1, so Psmn, the
predicted (P) pitch value for the nth H* target in the rath AD' in the sth
replication, is given by (13).
v * smn s
With Fr and the Rs fixed, we optimized dp and da using the same distance
measure as before. In a subsequent procedure, we set dp and da at the
optimal values obtained and optimized the Rs. We then calculated the
goodness-of-fit index.
14.3.2.2 Results
As before, we ran optimizing procedures for a number of different Fr values
to obtain the best possible fit the model allows. We also optimized dp and da
for both the "endpoint average" and the "intonational locus" estimates of
Fr. Table 14.4 gives the results.
The increase in the index values (compared to the best attainable in table
14.4a) is somewhat smaller for the "endpoint average" estimate of Fr than
357
Prosody
for the "international locus" estimate. The latter estimate has the advantage
that dp is apparently speaker-independent. However, for both estimates the
da values vary considerably across speakers. In fact, no Fr estimate led to
speaker-independent values for both dp and da, but a combination of the
"intonational locus" Fr, a dp of 0.90, and a da of 0.70 (as observed for Fl and
Ml) would seem to be a reasonable choice to be used in a computer
implementation of our model.
For both Fr estimates and for all speakers, dp is larger than da, which
supports our view that phrasal downstep entails a smaller step down than
accentual downstep. Comparing the da values for the "locus" estimate across
the two experiments, we observe that here da is lower (and sometimes
considerably so) than in our study of accentual downstep. It would appear
that speakers tend to use a smaller da, so a larger step down, if they have to
realize accentual downstep in combination with phrasal downstep, possibly
in an attempt to keep the two as distinct as possible.
We conclude that (1) phrasal downstep can be modeled with a single
downstep factor which is independent of prominence, but that (2) the answer
to the question whether it is speaker-independent is determined by the way in
which the general pitch-range parameter is defined.
358
14 Comments
factor {dp is 0.90) as well as for the accentual downstep factor {da is 0.80 in a
single AD' and 0.70 if there are more AD's).
Comments on chapter 14
NINA GR0NNUM (formerly THORSEN)
Introduction
My comments concern only van den Berg, Gussenhoven, and Rietveld's
proposed analysis and model of Dutch intonation. As I am not a Dutch
speaker, and I do not have first-hand knowledge of data on Dutch inton-
ation, my comments are questions and suggestions which I would like
readers and the authors to consider, rather than outright denials of the
proposals. Nevertheless, it will be apparent from what follows that I think
van den Berg, Gussenhoven, and Rietveld's description obscures the most
important fact about accentuation in Dutch, and that it tends to misrepresent
the relevant difference between contours in some instances because it
disregards linguistic function (in a narrower as well as a wider sense). The
purported phonological analysis thus nearly reduces to a phonetic transcrip-
tion (though a broad one) and not always an adequate one at that, as far as I
can judge. To mute a likely protest from the outset: I am not generally
against trying to reduce rich phonetic detail to smaller inventories of
segments, features or parameters: it is the nature of van den Berg, Gussen-
hoven, and Rietveld's description and its relevance to a functional descrip-
tion that I question.
I begin with some general criticisms which bear upon the concrete
examples below. First, it is difficult to evaluate the adequacy of a description
which is based on examples, in the literal sense of the word, i.e. sample
utterances recorded once, by one speaker. Second, we are led to understand
that the various contours accompanying the same utterance are meaningfully
different, but we are not told in which way they are different, what kind of
difference in meaning is expressed, and whether or not Dutch listeners would
agree with the interpretation. Third, I miss an account of the perceived
prominence of the accented syllables in the examples, which might have been
relevant to the treatment of downstep. And fourth, in that connection I miss
some reflections about what accentual and phrasal downstep are for, what
functions they serve.
359
Prosody
Accentuation
From the publications of Cohen, Collier, and 't Hart (e.g. 't Hart and Cohen
1973; 't Hart and Collier 1975; Collier and 't Hart 1981), I have understood
Dutch to epitomize the nature of accentuation: accented syllables are stressed
syllables which are affiliated with a pitch change.1 Beyond that-as far as I
can see - there are few restrictions, i.e. the change may be in either direction,
it may be quick or slow, it may be coordinated with either onset or offset of
the stressed syllable, and it may be bidirectional. Not all the logical
combinations of directions, timings, and slopes occur, I suppose, but many
of them do, as witnessed also by van den Berg, Gussenhoven, and Rietveld's
examples. Nor does a pitch change necessarily evoke an accent, as for
example when it is associated with unstressed syllables at certain boundaries.
From this freedom in the manifestation of accent, a rich variety of patterns
across multiaccented phrases and utterances arises.2
I would therefore like to enquire how far one could take a suggestion that
the underlying mechanism behind accentuation in Dutch is pitch change, and
that the particular choice of how pitch is going to change is a combination of
(1) restrictions at phrase and utterance boundaries, connected with utterance
function and pragmatics, (2) focus distribution, (3) degree and type of
emphasis, (4) syntagmatic restrictions (i.e. certain pitch patterns cannot
precede or follow certain other ones if pitch changes are to be effected and
boundary conditions met), and (5) speech style, i.e. factors outside the realm
of phonology/lexical representation. I realize, of course, that some of these
factors (especially speech style and pragmatics) are universally poorly
understood and I cannot blame van den Berg, Gussenhoven, and Rietveld
for not incorporating them in their model. I do think, however, that they
could have exploited to greater advantage the possibility of separating out
effects from utterance function and boundary conditions, and they could at
least have hinted at the possible effects of the two different elocutionary styles
involved in their examples (complete utterances versus lists of place and
1
Here and throughout I will use "pitch" to refer to both F o and perceived pitch, unless I need to
be more specific.
2
Lest it be thought that I am implicitly comparing van den Berg, Gussenhoven, and Rietveld's
analysis to Cohen, Collier, and 't Hart's, and preferring the latter: this is not so. Even if the
latter's description in terms of movements is phonetically accurate, it would gain from being
further reduced to a specification of target values, pinning down the perceived pitch (pitch
level) of the accented syllables: many rapid local F o changes are likely to be produced and
perceived according to their onsets or offsets, although we can (be trained to) perceive them as
movements, especially when they are isolated. However, Cohen, Collier, and 't Hart's purpose
and methodology (to establish melodic identity, similarity, categorization) apparently have
not motivated such a reduction. A further difficulty with them is that they have not committed
themselves to a specification of the functional/pragmatic circumstances under which specific
combinations of accent manifestations are chosen. (Collier [1989] discusses the reasons why.)
This latter is, however, equally true of van den Berg, Gussenhoven, and Rietveld.
360
14 Comments
proper names). As it is, they represent every contour in their examples as a
sequence of L*H or H*L pitch accents, with utterance final L% and H%
tones added, with conventions for representing slow and level movements,
and a downstep morpheme to characterize series of descending H*s.3
Low-pitchedfinalaccents
Thefinalmovement in e.g.figures14.2a and 14.3b is rendered as H*L, i.e. the
stressed syllable is associated with a H in both instances. My own experience
with listening to speech while looking at Fo traces made me strongly doubt
that the two are perceptually equivalent. I would expect the one in figure
14.3b to be perceived with a low-pitched stressed syllable. To check this, I
recorded two utterances which were to resemble as closely as possible the one
in figure 14.3b, globally and segmentally, but where one was produced and
perceived with a high pitched the other with a low pitched stressed syllable at
the end. The result is shown in figure 14.10. (Note that neither of these
contours is an acceptable Standard Danish contour; intonationally they are
nonsense.) The two traces look superficially alike, but the fall is timed
differently with respect to the stressed vowel, corresponding to the high
(figure 14.10a) and low (figure 14.10b) pitch level of the stressed syllable.
Figure 14.10b rather resembles figure 14.3b, from which I infer that it should
be given a L* representation. This would introduce a monotonal L* (or a
L*L) to the system.
The same suspicion attaches to the final !H* infigures14.5a, b, and 14.7a. I
suspect that since van den Berg, Gussenhoven, and Rietveld have decided
that utterances can only terminate in H*LL% or L*HH% (or H*LH%), and
since figure 14.3b (and figures 14.5, 14.7a) is clearly not L*HH%, it must be
H*LL%. I return to this on pages 363-4 below.
I am, likewise, very skeptical about the reality of the assignment of !H*L to
kleren, schoenen, and school in figure 14.7b. I would expect the perceptually
salient part of such steep movements to correspond to the Fo offset rather
3
They do not mention tritonal pitch accents, and in the following I will assume that those
H*LHs which appear in Gussenhoven (1988) have been reinterpreted as L*LH%, which is
reasonable, considering his examples.
361
Prosody
100 L i l a m t r d n u e h a
100 centiseconds
u
100
100 centiseconds
Figure 14.10 Two Dutch intonation contours imposed on a Danish sentence, with a final
stressed syllable that sounds high (upper plot) or low (lower plot). The Danish text is Lille
Morten vil have mere manna ("Little Morten wants more manna")
than the onset of the stressed vowel. Is that not what led to the assignment of
H*(L), rather than L*H, to mooiste, duurste (together with the acknowledg-
ment that the Fo rise is a consequence of the preceding low value on de and of
the syllabic structure with a voiced initial consonant and a long vowel, cf.
beste in the last phrase)? In other words, if the Fo rises in mooiste, duurste are
assigned tones in accordance with the offset of their Fo movements, why are
the falls in kleren, schoenen, school not treated in the same way? Even though
phonological assignment need not and should not exactly mirror phonetic
detail, I think these assignments are contradictory. (An utterance where all
the stressed syllables had short vowels in unvoiced obstruent surroundings,
like beste, might have been revealing.) What are the criteria according to
which, e.g. mooiste and kleren are variants of the same underlying pattern,
rather than categorically different?
Summary
I fundamentally agree with van den Berg, Gussenhoven, and Rietveld that
there is a sameness about the first pitch accent across the (a)s infigures14.1-
3, and likewise across the (b)s. I suggest, however, that this sameness applies
to all the accented syllables, and that the different physical manifestations are
forced by circumstances beyond the precincts of accentuation, and presuma-
bly also by pragmatic and stylistic factors which are not in the realm of
phonology at all.
Boundary tones
Van den Berg, Gussenhoven, and Rietveld state that they are uncertain
about the occurrence of the L%, but they suppose it to characterize the
termination of IPs (identical to utterances, in their examples). I find the
concept of boundary tone somewhat suspect, when it can lead them to state
363
Prosody
that a T% copies the second T of the last pitch accent (if it is not pushed off
the board by the preceding T), except in attested cases of H*LH%. A
boundary tone proper should be an independently motivated and specified
characteristic, associated with demands for intonation-contour termination
connected to utterance function, in a broad syntactic and pragmatic sense. It
cannot be dictated by a preceding T*T; on the contrary, it can be conceived
of as being able to affect the manifestation of the preceding accent, as
suggested above.
364
14 Comments
HP
semitones
20
15
10
0L Kofoed og Thorsen skal med rutebilen fra Gudhjem til SnogebaBk klokken fire pa tirsdag
PBP
semitones
15
n = 6
10
0L Kofoed og Thorsen skal med rutebilen fra Tingler til Tonder klokken fire pa tirsdag
10
Figure 14.11 Three contours illustrating "final lowering" in the sense introduced in the text. In
each case the final fall from High to Low spans a considerably wider pitch range than the
preceding non-final falls. The top two contours are from regional varieties of standard Danish,
while the bottom contour is (northern) standard German; more detail can be found in Gronnum
(forthcoming), from which thefiguresare taken
365
Prosody
Downstep
Accentual downstep
Could the lack of downstep in figures 14.2a and 14.3a be due to the second
accented syllable being produced and perceived as more prominent than the
first one? (See Gussenhoven and Rietveld's [1988] own experiment on the
perception of prominence of later accents with identical F o to earlier ones.) If
uneven prominence does indeed account for an apparent lack of downstep
within prosodic phrases, there would be no need to introduce a downstep
morpheme, successive lowering of H*s would characterize any succession of
evenly prominent H*s in a phrase. The disruption of downstep would be a
matter of pragmatic and speech style effects (which I suspect to be at work in
figure 14.7b), not of morphology. In my opinion, the authors owe the reader
an account of the meaning of [!]: when does it and when does it not appear?
Van den Berg, Gussenhoven, and Rietveld find no evidence of downstep-
ping L*s. In fact, it appears from their examples that low turning points, be
they associated with stressed syllables or not, are constantly low during the
utterance, and against this low background the highs are thrown in relief and
develop their downward trend. This makes sense if the Ls are produced so as
to form a reference line against which the upper line (smooth or bumpy as the
case may be) can be evaluated - which is the concept of Garding's (1983)
model for Swedish by whom van den Berg, Gussenhoven, and Rietveld claim
to have been inspired. With such a view of the function of low tones, one
would not expect L*s to downstep, and one would look for the manifestation
of unequal prominence among L*s in the interval to a preceding or
succeeding H.
Phrasal downstep
In the discussion of how to conceive of and model resets, I think the various
options could have been seen in the light of the function resetting may serve.
If it is there for purely phonetic reasons, i.e. to avoid going below the bottom
of the speaking range in a long series of downstepping H*s, then it is
reasonable to expect that H*s after the first reset one would also be higher. If
resetting is there for syntactic or pragmatic reasons - to signal a boundary -
it would be logically necessary only to do something about the first
succeeding H*. However, I imagine that if consecutive H*s did not accom-
pany the reset one upwards, then the reset one would be perceived as being
relatively more prominent. To avoid that, the only way to handle phrasing in
this context is to adjust all the H*s in the unit. Yet it is doubtful whether the
phenomenon is properly labeled a shift of register, since the L*s do not
appear to be affected (cf. the phrase terminations in figures 14.6 and 14.7a). It
seems to be the upper lines only which are shifted.
366
14 Comments
Otherwise, I entirely endorse van den Berg, Gussenhoven, and Rietveld's
concept of a "wheels-within-wheels" model where phrases relate to each
other as wholes and to the utterance, because it brings hierarchies and
subordination back into the representation of intonation. In fact, their
"lowering model" (i.e. one which lowers successive phrase-initial pitch
accents [to which accentual downstep then refers], rather than one which
treats phrasal onsets as a step up from the last accent in the preceding phrase)
is exactly what I suggested a "tone sequence" representation of intonation
would require in order to accommodate the Danish data on sentences in a
text: "Consecutive lowering of individual clauses/sentences could be handled
by a rule which downsteps thefirstpitch accent in each component relative to
the first pitch accent in the preceding one" (Thorsen 1984b: 307). However,
data from earlier studies (Thorsen 1980b, 1981, 1983) make it abundantly
clear that the facts of accentual downstep within individual phrases in an
utterance or text are not as simple as van den Berg, Gussenhoven, and
Rietveld represent them.
Conclusion
I have been suggesting that perhaps Dutch should not be treated along the
same lines as tone languages or word-accent languages, or even like English.
Perhaps the theoretical framework does not easily accommodate Dutch data;
forcing Dutch intonation into a description in terms of sequences of
categorically different, noninteracting pitch accents can be done only at the
expense of phonetic (speaker/listener) reality. But even accepting van den
Berg, Gussenhoven, and Rietveld's premise, I think their particular solution
can be questioned on the grounds that it brings into the realm of prosodic
phonology phenomena that should properly be treated as modifications due
to functions at other levels of description.
367
Modeling syntactic effects on downstep in Japanese
HARUO KUBOZONO
15.1 Introduction
One of the most significant findings about Japanese intonation in the past
decade or so has been the existence of downstep*. At least since the 1960s, the
most widely accepted view had been that pitch downtrend is essentially a
phonetic process which occurs as a function of time, more or less indepen-
dently of the linguistic structure of utterances (see e.g. Fujisaki and Sudo
1971a). Against this view, Poser (1984) showed that downtrend in Japanese is
primarily due to a downward pitch register shift ("catathesis" or "down-
step"), which is triggered by (lexically given) accents of minor intonational
phrases, and which occurs iteratively within the larger domain of the so-
called major phrase. The validity of this phonological account of downtrend
has subsequently been confirmed by Beckman and Pierrehumbert (Beckman
and Pierrehumbert 1986; Pierrehumbert and Beckman 1988) and myself
(Kubozono 1988a, 1989).1
Consider first the pair of examples in (1).
(1) a. uma'i nomi'mono "tasty drink"
b. amai nomi'mono "sweet drink"
The phrase in (la) consists of two lexically accented words while (lb)
T h e research reported on in this paper was supported in part by a research grant from the
Japanese Ministry of Education, Science, and Culture (no. 01642004), Nanzan University Pache
Research Grant IA (1989) and travel grants from the Japan Foundation and the Daiko
Foundation. I am grateful to the participants in the Second Conference on Laboratory
Phonology, especially the discussants, whose valuable comments led to the improvement of this
paper. Responsibility for the views expressed is, of course, mine alone.
1
It must be added in fairness to Fujisaki and his colleagues that they now account for
phonemena analogous to downstep by positing an "accent-level rule," which resembles
McCawley's (1968) "accent reduction rule" (see Hirose et al. 1984; Hirose, Fujisaki, and
Kawai 1985).
368
75 Haruo Kubozono
consists of an unaccented word (of which Tokyo Japanese has many) and an
accented word. Downstep in Japanese looks like figures 15.1a and 15.2 (solid
line), where an accented phrase causes the lowering of pitch register for
subsequent phrases, accented and unaccented alike, in comparison with the
sequences in which the first phrase is unaccented (i.e. figures 15.1b and 15.2,
dotted line). The effect of downstep can also be seen from figure 15.3, which
shows the peak values of the second phrase as a function of those of the first.
The reader is referred to Kubozono (1989) for an account of the difference in
the height of the first phrases. Downstep thus defined is a rather general
intonational process in Japanese, where such syntactic information as
category labels are essentially irrelevant.
Previous studies of downstep in Japanese have concentrated on the scaling
of the peak values of phrases, or the values of High tones, which constitute
the ceiling of the pitch register. There is, on the other hand, relatively little
work on the scaling of valley values or Low tones, which supposedly
constitute the bottom of the register. This previous approach to the modeling
of downstep can be justified in most part by the fact, reported by Kubozono
(1988a), that the values for Low tones vary considerably depending on
factors other than accentual structure.
Downstep seems to play at least two roles in the prosodic structure of
Japanese. It has a "passive" function, so to speak, whereby its absence signals
a sort of pragmatic "break" in the stream of a linguistic message (see
Pierrehumbert and Beckman 1988; Fujimura 1989a). A second and more
active role which this intonational process plays is to signal differences in
syntactic structure by occurring to different degrees in different syntactic
configurations. As we shall see below, however, there is substantial disagree-
ment in the literature as to the details of this syntax-prosody interaction, and
it now seems desirable to consider more experimental evidence and to
characterize this particular instance of syntax-prosody interaction in the
whole prosodic system of Japanese. With this background, this paper
addresses the following two related topics: one is an empirical question of
how the downward pitch shift is conditioned by syntactic structure, while the
other is a theoretical question of how this interaction can or should be
represented in the intonational model of the language.
369
Prosody
180 "
Pi
160 •
* P2
1
140 -
120 - v2
v3
100 -
(*) 1
•«-/
180 •
160 -
Pi P2
•
140 •
V2
Vi
120 -
v3
100 -
Figure 15.1 (a) A typical pitch contour for phrase (la), pronounced in the frame sorewa ... desu
("It is ..."); (b) A typical pitch contour for phrase (lb), pronounced in the frame sorewa ... desu
("It is ...")
370
15 Haruo Kubozono
(Hz)
160
150
140
130
120
110
Pi v2 p2 v3
Figure 15.2 Schematic comparison of the pitch patterns of (la)-type utterances (solid line) and
(lb)-type utterances (dotted line), plotted on the basis of averaged values at peaks and valleys
(Hz)
[Pd
170
160
150
140
371
Prosody
(2) a. [A B] -> AB
b. [[A B] C] -> ABC
c. [A [B C]] -> A/BC
d. [[A [B C]] D] -> A/BCD
e. [A [B [C D]]] -> A/B/CD
Examples of such prosodic rules include (a) lexical rules like compound
formation ("compound accent rules"; see Kubozono 1988a, b) and sequen-
tial voicing rules also characteristic of the compound process (Sato n.d.; Otsu
1980), and (b) phrasal rules such as the intonational phrasing process
generally known as "minor phrase formation" (see Fujisaki and Sudo 1971a;
Kubozono 1988a).
Given that Japanese shows a left-right asymmetry in these prosodic
processes, it is natural to suspect that the right-branching structure shows a
similar marked behavior in downstep as well. The literature, however, is
divided on this matter. Poser (1984), Pierrehumbert and Beckman (1988),
and Kubozono (1988a, 1989), on the one hand, analyzed many three-phrase
right-branching utterances and concluded that the register shift occurs
between the first two phrases as readily as it occurs between the second and
third phrases in this sentence type: the second phrase is "phonologically
downstepped" as relative to the first phrase in the sense that it is realized at a
lower level when preceded by an accented phrase than when preceded by an
unaccented phrase.
On the other hand, there is a second group who claim that downstep is
blocked between the first two phrases in (at least some type of) the right-
branching structure. Fujisaki (see Hirose, Fujisaki, and Kawai 1985), for
instance, claims that the "accent level rule," equivalent to our downstep rule
(see note 1), is generally blocked between the first two phrases of the right-
branching utterances. Fujisaki's interpretation is obviously based on the
observation that three-phrase right-branching utterances generally show a
markedly higher pitch than left-branching counterparts in the second com-
ponent phrase, a difference which Fujisaki attributes to the difference in the
occurrence or absence of downstep in the relevant position (see figure 15.4
below). Selkirk and Tateishi (1988a, b) outline a similar view, although they
differ from Fujisaki in assuming that downstep is blocked only in some type
of right-branching utterances. They report that downstep is blocked in the
sequences of [Noun-«o [Noun-H0 Noun]] (where no is a relativizer or genitive
particle) as against those of [Adjective [Adjective Noun]], taking this as
evidence for the notion of "maximal projection" in preference to the
generalization based on branching structure (see Selkirk 1986 for detailed
discussion). Selkirk and Tateishi further take this as evidence that the "major
phrase," the larger intonational phrase defined as the domain of downstep,
can be defined in a general form by this new concept.
372
75 Haruo Kubozono
With a view to reconciling these apparently conflicting views in the
literature, I conducted experiments in which I analyzed utterances of the two
different syntactic structures made by two speakers of Tokyo Japanese (see
Kubozono 1988a and 1989 for the details of the experimental design and
statistical interpretations). In these experiments various linguistic factors
such as the accentual properties and phonological length of component
elements were carefully controlled, as illustrated in (3); the test sentences
included the two types of right-branching utterances such as those in (4) and
(5) which, according to Selkirk and Tateishi, show a difference in downstep.
(3) a. [[ao'yama-ni a'ru] daigaku]
Aoyama-in exist university
"a university in Aoyama"
b. [ao'yama-no [a'ru daigaku]]
Aoyama-Gen certain university
"a certain university in Aoyama"
(4) a. [a'ni-no [me'n-no eri'maki]]
brother-Gen cotton-Gen muffler
"(my) brother's cotton muffler"
b. [ane-no [me'n-no eri'maki]]
"(my) sister's cotton muffler"
(5) a. [ao'i [o'okina eri'maki]]
"a blue, big muffler"
b. [akai [o'okina eri'maki]]
"a red, big muffler"
The results obtained from these experiments can be summarized in three
points. First, they confirmed the view that the two types of branching
structure exhibit distinct downtrend patterns, with the right-branching
utterances generally showing a higher pitch for their second elements than
the left-branching counterparts, as illustrated in figure 15.4.3 The distribu-
tion of the values for Peak2(P2), shown in figure 15.5, does not hint that the
averaged pitch values for this parameter represent two distinct subgroups of
tokens for either of the two types of branching structure (see Beckman and
Pierrehumbert, this volume). Second, it was also observed that the pitch
difference very often propagates to the third phrase, as illustrated in figure
15.6, suggesting that the difference in the second phrases represents a
difference in the height of pitch register, not a difference in local prominence.
Third and most important, the experimental results also confirmed the
observation previously made by Poser, Pierrehumbert and Beckman, and
3
Analysis of the temporal structure of the two patterns has revealed no significant difference,
suggesting that the difference in the pitch height of the second phrase is the primary prosodic
cue to the structural difference.
373
Prosody
(Hz)
160
140
120
100
80
60
Vi Pi V2 P2 V3 P3 V4
Figure 15.4 Schematic comparison of the pitch patterns of (3a) (dotted line) and (3b) (solid
line), plotted on the basis of averaged values at peaks and valleys
140 • o
o
120 •
100 •
:a •
.---•"
a
80 •
Pal 1
100
o
ooo o
80
60
respect to the height of the second component phrases. Figures 15.7 and 15.8
show such a comparison of the pairs in (4) and (5) respectively, where the
peak values of the second phrase are plotted as a function of those of the first
phrase. The results in thesefiguresreveal that the second phrase is realized at
a lower pitch level when preceded by an accented phrase than when preceded
by an unaccented phrase. Noteworthy in this respect is the fact that the two
types of right-branching constructions discriminated by Selkirk and Tateishi
are equally subject to downstep and, moreover, show no substantial differ-
ence from each other in downstep configuration. In fact, the effect of
downstep was observed between the first two phrases of right-branching
utterances irrespective of whether the component phrases were a simple
adjective or a Noun-«o sequence, suggesting that at least as far as the results
of my experiments show, it is the notion of branching structure and not that
of maximal projection that leads to a linguistically significant generalization
concerning the occurrence of downstep; in the absence of relevant data and
information, it is not clear where the difference between Selkirk and
Tateishi's experimental results and mine concerning the interaction between
syntax and downstep come from - it may well be attributable to the factor
of speaker strategies discussed by Beckman and Pierrehumbert (this
volume).
The observation that there are two distinct downstep patterns and that
they can be distinguished in terms of the branching structure of utterances is
375
Prosody
[P 2 ]
130
120
110
Figure 15.7 Distribution of PI and P2 for the two sentences in (4): P2 values as a function of PI
values in utterances of (4a) (circles) and (4b) (squares)
[P2]
150
140
130 o
o o
120
Figure 15.8 Distribution of PI and P2 for the two sentences in (5): P2 values as a function of PI
values in utterances of (5a) (circles) and (5b) (squares)
376
75 Haruo Kubozono
377
Prosody
(Hz)
180
160
140
120
Vi Pi V2 P2 V3 P3 V4 P4 V5
Figure 15.9 Schematic pitch contour of (6a)-type utterances, plotted on the basis of averaged
values at peaks and valleys
Pal
(Hz)l
170
160
150
140
Figure 15.10 Distribution of PI and P3 for the three sentences in (6): P3 values as a function of
PI values in utterances of (6a) (circles), (6b) (squares) and (8c) (crosses)
378
75 Haruo Kubozono
(Hz)l
160
140
120
100
80
60
V2 P2 V3 P3 V4
Figure 15.11 Effect of metrical boost in (3b)-type utterances: basic downstep pattern (dotted
line) and downstep pattern on which the effect of metrical boost is superimposed (solid line)
triggers the downward register shift. Under this analysis, it can be under-
stood that the downstepped (i.e. third) phrase has been raised by the phonetic
realization rule of metrical boost to such an extent that it is now realized
higher than the second minor phrase (fig. 15.12). This case is a typical
example where the syntactically induced pitch boost modifies the phonologi-
cally defined downstep pattern. Syntactically more complex utterances show
further complex patterns in downstep as shown in Kubozono (1988a), but all
these patterns can be described as interactions of downstep and metrical
boost, the two rules which control the shifting of pitch register in two
opposite directions.
The notion underlying the rule of metrical boost is supported by yet
another piece of evidence. In previous studies of Japanese intonation, it is
reported that sudden pitch rises occur at major syntactic boundaries such as
sentence, clause, and phrase boundaries. These "juncture phenomena" have
been explained by way of the "resetting" of pitch register or other analogous
machinery in intonational models (see Han 1962; Hakoda and Sato 1980;
Uyeno et al. 1981; Hirose et al. 1984). In the sentences given in (8),
for example, remarkable degrees of pitch rise reportedly occur at the
beginning of the fourth phrase in (8a) and the second phrases in (8b) and
(8c).
380
15 Haruo Kubozono
(Hz)
180
160
140
120
Vi Pi V2 P2 V3 P3 V4 P4 V5
Figure 15.12 Effect of metrical boost in (6a)/(7a)-type utterances: basic downstep pattern
(dotted line) and downstep pattern on which the effect of metrical boost is superimposed (solid
line)
381
Prosody
382
15 Haruo Kubozono
syntactic structure. Given the orthodox view that intonational representation
is the only source of information available for intonational (i.e. phonetic
realization) rules to derive correct surface pitch patterns, it follows that the
intonational structure itself is hierarchically organized, at least to such an
extent that left branching and right-branching structures can be readily
differentiated in the domain where downstep is denned.
If we define the major phrase (MP) as the domain where downstep occurs
between adjacent minor phrases (m.p.), the conventional intonational model
can be illustrated in (10). Obviously, this model is incapable of describing the
kind of syntactic information under consideration.
(10)
(11) a. b c.
Utterance Utterance Utterance
MP MP MP
IP I> IP IP IP
A
m / \ / A\
m .p. m.p. m.p. n l.p. m.p. m.p.
.p. m.p. m.p. m.p
4 «jp» m u s t n o t be confused with the "intermediate phrase" posited by Beckman and
Pierrehumbert, which they define as the domain of downstep and hence corresponds to our
"major phrase."
383
Prosody
(lla) and (lib) are the representations assigned to the two types of three-
phrase sentence in (3a) and (3b) respectively, while (1 lc) is the representation
assigned to the symmetrically branching four-phrase sentences in (6a)/(7a).
Given the representations in (11), it is possible to account for the syntactic
effect on downstep in two different ways: either we assume that the rule of
metrical boost raises pitch register upwards at the beginning of IPs that do
not begin a MP, or we assume, as proposed by van den Berg, Gussenhoven,
and Rietveld (this volume), that downstep occurs to varying degrees depend-
ing upon whether it occurs within an IP or over two IPs, to a lesser degree in
the latter case than in the former.
Of these two analyses based on the model in (11), thefirstanalysis falls into
several difficulties. One of them is that the motivation of positing the new
intonational phrase (level) lies in and only in accounting for the syntactic
effect upon downstep. Moreover, if this analysis is applied to syntactically
more complex sequences of phrases, it may indeed eventually end up with
assuming more than one such additional phrase between the minor phrase
and the MP. If this syntactic effect on downstep can be handled by some
other independently motivated mechanism, it would be desirable to do
without any additional phrase or level. A second and more serious problem
arises from the characterization of IP as the trigger of metrical boost. It has
been argued that metrical boost is a general principle of register shift whose
effects can be defined on an «-ary and not binary basis. If we define
occurrence of metrical boost with reference to "IP," we would be obliged to
posit more than one mechanism for upward pitch register shifts, that of MB,
which applies within the major phrase, and that of the conventional reset
rule, which applies at the beginning of every utterance-internal MP. This is
illustrated in a hypothetical intonational representation in (12).
(12) Utterance
384
75 Haruo Kubozono
problem with the revised representation. That is, this analysis assumes two
types of the intermediate phrase, MP-initial IPs which do not trigger the
boost and MP-internal IPs which do trigger it.
Similarly, the analysis proposed by van den Berg, Gussenhoven, and
Rietveld (this volume) poses several difficult problems. While this analysis
may dispense with the principle of metrical boost as far as downstep is
concerned, it falls into the same difficulties just pointed out. Specifically, it
fails to capture the general nature of the upstep principle which can be
defined on the basis of syntactic consituency, requiring us instead to posit
either the conventional rule of register reset or a third type of downstep in
order to account for the upward register shift occurring at the beginning of
every utterance-internal major phrase (see (12)).
If the revised model illustrated in (11) is disfavored because of these
problems, the only way to represent the difference of syntactic structure in
intonational representation will be to take a substantially different approach
to intonational phrasing by abandoning the generally accepted hypothesis
that intonational structure is «-ary branching. Noteworthy in this regard is
the recursive model proposed by Ladd (1986a), in which right-branching and
left-branching structures can be differentiated in a straightforward manner
by a binary branching recursive mechanism. Under this approach, the two
types of pitch pattern in figure 15.4 can be represented as in (13a) and (13b)
respectively, and the four-phrase pattern in figure 15.9 as in (13c).
(13) a. b.
Utterance Utterance Utterance
MP MP MP
/
m.p. m.p. m.p. m.p. m.p. m.p. m.p. m.p. m.p. m.p.
T T
Upstep Upstep
385
Prosody
This new approach can solve all the problems with the conventional
approach. That is, it is not necessary to posit any additional intonational
level/phrase which lacks an independent motivation. Nor is it necessary to
postulate more than one mechanism for upstep phenomena because occur-
rence and magnitudes of upsteps can be determined by the prosodic
constituency rather than by intonational category labels, as illustrated in
(14).
(14) Utterance
386
14 and 15 Comments
commentary in this section), but the evidence presented in this paper suggests
that the new approach is at least worth exploring in more depth.
388
14 and 15 Comments
390
14 and 15 Comments
391
Prosody
experiments would be due to the typical choice of a lower pitch range for the
second phrase of the utterance, reflecting the discourse structure of the
mini-paragraph.
A criticism that has been raised against our characterization in point 3 is
that is introduces too many degrees of freedom. Ladd (1990), for example,
has proposed that instead of independent paradigmatic choices of pitch
range for each phrase and of tonal prominence for each accent, there is only
the limited choice of relative pitch registers that can be represented in binary
branching trees. Kubozono in his paper finds this view attractive and adapts
it to the specification of pitch registers for Japanese major and minor
phrases. Such a phonological characterization may seem in keeping with
results of experiments such as the one that Liberman and Pierrehumbert
(1984) describe, where they had subjects produce sentences with two inton-
ation phrases that were answer and background clauses, and found a very
regular relationship in the heights of the two nuclear accent peaks across ten
different levels of overall vocal effort. Indeed, results such as these are so
reminiscent of the preservation of stress relationships under embedding that
it is easy to see why Ladd wants to attribute the invariant relationship to a
syntagmatic phonological constraint on the pitch range values themselves,
rather than to the constant relative pragmatic saliences.
However, when we consider more closely the circumstances of such results,
this criticism is called into question. The design of Liberman and Pierrehum-
bert's (1984) experiment is typical in that it encouraged the subjects to zero in
on a certain fixed pragmatic relationship - in that case, the relationship of an
answer focus to a background focus. The constant relationship between the
nuclear peak heights for these two foci may well reflect the subject's uniform
strategy for realizing this constant pragmatic relationship. In order to
demonstrate a syntagmatic phonological constraint, we would need to show
that the peak relationships are constant even when we vary the absolute
pragmatic salience of one of the focused elements.
The analogy to stress relationships also fails under closer examination in
that the purely syntagmatic characterization of stress is true only in the
abstract. A relational representation of a stress pattern predicts any number
of surface realizations, involving many paradigmatic choices of different
prominence-lending phonological and phonetic features. For example, the
relatively stronger second syllable of red roses relative to the first might be
realized by the greater prominence of a nuclear accent relative to a prenuclear
accent (typical of the citation form), or by a bigger pitch range for the second
of two nuclear accents (as in a particularly emphatic pronunciation that
breaks the noun phrase into two intermediate phrases), or by the greater
prominence of a prenuclear accent relative to no accent (as in a possible
pronunciation of the sentence The florist's red roses are more expensive).
392
14 and 15 Comments
Similarly, a weak-strong pragmatic relationship for the two nouns in Anna
came with Manny can be realized as a particular choice of pitch ranges for
two intonational phrases, or as the relative prominence of prenuclear versus
nuclear pitch accent if the speaker chooses to produce the sentence as one
intontation phrase. As Jackendoff (1972), Carlson (1983), and others have
pointed out, the utterance in this case has a somewhat different pragmatic
interpretation due to Anna's not being a focus, although Anna still is less
salient pragmatically than Manny.
The possibility of producing either two-foci or one-focus renditions of this
sentence raises an important strategic issue. Liberman and Pierrehumbert
(1984) elicited two-foci productions by constructing a suitable context frame
and by pointing out the precise pragmatic interpretation while demonstrat-
ing the desired intonation pattern. If they had not taken care to do this, some
of their subjects might have given the other interpretation and produced the
other intonation for this sentence, making impossible the desired comparison
of the two nuclear-accent peaks. A more typical method in experiments on
pitch range, however, is to present the subject with a randomized list of
sentences to read without providing explicit cues to the desired pragmatic
and intonational interpretation. In this case, the subject will surely invent an
appropriate pragmatic context, which may vary from experiment to exper-
iment or from utterance to utterance in uncontrolled ways. The effects of this
uncontrolled variation is to have an uncontrolled influence on the phrasal
pitch ranges and on the prominences of pitch accents within a pitch range.
The variability of results concerning the interaction of syntax and downstep
in the literature on Japanese (e.g., Kubozono 1989, this volume; Selkirk
1990; Selkirk and Tateishi, 1990) may reflect this lack of control more than it
does anything about the interaction between syntactic constituency and
downstep.
The fact that a sentence can have more than one pragmatic interpretation
also raises a methodological point about statistics: before we can use
averages to summarize data, we need to be sure that the samples over which
we are averaging are homogeneous. For example, both in Poser (1984) and in
Pierrehumbert and Beckman (1988), there were experimental results which
could be interpreted as showing that pragmatic focus reduces but does not
block downstep. When we looked at our own data more closely, however, we
found that the apparent lesser downstep was actually the result of a single
outlier in which the phrasing was somewhat different and downstep had
occurred. Including this token in the average made it appear as if the
downstep factor could be chosen paradigmatically in order to give greater
pitch height than normal to prosodic constituents with narrow focus. The
unaveraged data, however, showed that the interaction is less direct; elements
bearing narrow focus tend to be prosodically separated from preceding
393
Prosody
elements and thus are realized in pitch ranges that have not been down-
stepped relative to the pitch range of preceding material. It is possible that
Kubozono could resolve some of the apparent contradictions among his
present results and those of other experiments in the literature on Japanese if
he could find appropriate ways of looking at all of the data token by token.
The specific tactical lesson to draw here is that since our understanding of
pragmatic structure and its relationship to phrasing and tone-scaling is not as
well developed as our understanding of phonological structure and its
interpretation in Fo variation, we need to be very cautious about claiming
from averaged data that downstep occurs to a greater or lesser degree in
some syntactic or rhythmic context. A more general tactical lesson is that we
need to be very ingenious in designing our experiments so as to elicit
productions from our subjects that control all of the relevant parameters. A
major strategic lesson is that we cannot afford to ignore the knotty questions
of semantic and pragmatic representation that are now puzzling linguists
who work in those areas. Indeed, it is possible that any knowledge concern-
ing prosodic structure and prominence that we can bring to these questions
may advance the investigative endeavor in previously unimagined ways.
Returning now to assumptions about control mechanism, there is another
topic touched on in the paper by van den Berg, Gussenhoven, and Rietveld
which also raises a strategic issue of major importance to future progress in
our understanding of tone and intonation. This is the question of how to
model L tones and the bottom of the pitch range. Modeling the behavior of
tones in the upper part of the pitch range - the H tones - is one of the success
stories of laboratory phonology. The continuing controversies about many
details of our understanding (evident in the two papers in this section) should
not be allowed to obscure the broad successes. These include the fact that
[ + H] is the best understood distinctive-feature value. While work in speech
acoustics has made stunning progress in relating segmental distinctive
features to dimensions of articulatory control and acoustic variation, the
exact values along these dimensions which a segment will assume in any
given context in running speech have not been very accurately modeled. In
contrast, a number of different approaches to H-tone scaling have given rise
to Fo synthesis programs which can generate quite accurately the contours
found in natural speech. Also, work on H-tone scaling has greatly clarified
the division of labor between phonology and phonetics. In general, it has
indicated that surface phonological representations are more abstract than
was previously supposed, and that much of the burden of describing sound
patterns falls on phonetic implementation rules, which relate surface phono-
logical representations to the physical descriptions of speech. Moreover,
attempts to formulate such rules from the results of appropriately designed
experiments have yielded insights into the role of prosodic structure in speech
394
14 and 15 Comments
production. They have provided additional support for hierarchical struc-
tures in phonology, which now stand supported from both the phonetic and
morphological sides, a fate we might wish on more aspects of phonological
theory.
In view of these successes, it is tempting to tackle L tones with the same
method that worked so well for H tones - namely, algebraic modeling of the
Fo values measured in controlled contexts. Questions suggested under this
approach include: What functions describe the effects of overall pitch range
and local prominence on Fo targets for L tones? What prevents L tones from
assuming values lower than the baseline? In downstep situations, is the
behavior of L tones tied to that of H tones, and if so, by what function?
We think it is important to examine the assumptions that underlie these
questions, particularly the assumptions about control mechanisms. We
suggest that it would be a strategic error to apply too narrowly the precedents
of work on H-tone scaling.
Looking first at the physiological control, we see that L-tone scaling is
different from H-tone scaling. A single dominant mechanism, cricothyroid
contraction, appears to be responsible for H tone production, in the sense
that this is the main muscle showing activity when F o rises into a H tone. In
contrast, no dominant mechanism for L tone production has been found.
Possible mechanisms include:
Cricothyroid relaxation - e.g., Simada and Hirose (1971), looking at the
production of the initial boundary L in Tokyo Japanese; Sagart et al.
(1986), looking at the fourth (falling) tone in Mandarin.
Reduction of subglottal pressure - e.g., Monsen, Engebretson, and Ver-
mula, (1978), comparing L and H boundary tones.
Strap muscle contraction - e.g., Erickson (1976), looking at L tones in Thai;
Sagart et al. (1986), looking at the third (low) tone in Mandarin;
Sugito and Hirose (1978), looking at the initial L in L-initial words
and the accent L in Osaka Japanese; Simada and Hirose (1971) and
Sawashima, Kakita, and Hiki (1973), looking at the accent L in
Tokyo Japanese.
Cricopharyngeus contraction - Honda and Fujimura (1989), looking at L
phrase accents in English.
Some of these mechanisms involve active contraction, whereas others involve
passive relaxation. There is some evidence that the active gesture of strap
muscle contraction comes into play only for L tones produced very low in the
pitch range. For example, of the four Mandarin tones, only the L of the third
tone seems to show sternohyoid activity consistently (see Sagart et al. 1986).
Similarly, the first syllable of L-initial words in Osaka Japanese shows a
marked sternohyoid activity (see Sugito and Hirose 1978) that is not usually
observed in the higher L boundary tone at the beginning of Tokyo Japanese
395
Prosody
accentual phrases (see, e.g., Simada and Hirose 1971). Lacking systematic
work on the relation of the different mechanisms to different linguistic
categories, we must entertain the possibility that no single function controls
L-tone scaling. Transitions from L to H tones may bring in several
mechanisms in sequence, as suggested in Pierrehumbert and Beckman (1988).
One of the tactical imports of the different mechanisms is that we need to be
more aware of the physiological constraints on transition shape between
tones; we should not simply adopt the most convenient mathematical
functions that served us so well in H-tone scaling models.
Another common assumption that we must question concerns the func-
tional control of the bottom of the pitch range. We need to ask afresh the
question: Is there a baseline? Does the lowest measured value at the end of an
utterance really reflect a constant floor for the speaker, which controls the
scaling of tones above it, and beyond which the speaker does not aim to
produce nor the hearer perceive any L tone?
Tone-scaling models have parlayed a great deal from assuming a baseline,
on the basis of the common observation that utterance-final L values are
stable for each speaker, regardless of pitch range. On the other hand, it is not
clear how to reconcile this assumption with the observation that nuclear L
tones in English go up with voice level (see e.g. Pierrehumbert, 1989). This
anomaly is perturbing because it is crucial that we have accurate measures of
the L tones; estimates of the baseline from H-tone scaling are quite unstable
in the sense that different assumptions about the effective floor can yield
equally good model fits to H-tone data alone. The assumption that the
bottom of the pitch range is controlled via a fixed baseline comes under
further suspicion when we consider that the last measure Fo value can be at
different places in the phrasal contour, depending on whether and where the
speaker breaks into vocal fry or some other aperiodic mode of vocal-fold
vibration. It is very possible that the region past this point is intended as, and
perceived as, lower than the last point where Fo measurement is possible.
A third assumption that relates to both the physiological and the func-
tional control of L tones concerns the nature of overall pitch-range variation.
It has been easy to assume in H-tone modeling that this variation essentially
involves the control of Fo. Patterns at the top of the range have proved
remarkably stable at the different levels of overall Fo obtained in exper-
iments, allowing the phenomenon to be described with only one or two
model parameters.
We note, however, that the different Fo levels typically are elicited by
instructing the subject to "speak up" to varying degrees. This is really a
variation of overall voice effort, involving both an increased subglottal
pressure and a more pressed vocal-fold configuration. It seems likely,
therefore, that the actual control strategy is more complicated than our H-
396
14 and 15 Comments
tone models make it. While the strategy for controlling overall pitch range
interacts with the physiological control of H tones in apparently simple ways,
its interaction with the possibly less uniform control mechanism for L tones
may yield more complicated Fo patterns. In order to find the invariants in this
interaction, we will probably have to obtain other acoustic measures besides
Fo to examine the other physiological correlates of increased pitch range
besides the increased rate of vocal-fold vibration. Also, it may be that pitch-
range variation is not as uniform functionally as the H-tone results suggest. It
is possible that somewhat different instructions to the subject or somewhat
different pragmatic contexts will emphasize other aspects of the control
strategy, yielding different consequences for Fo variation, particularly at the
bottom of the pitch range.
These questions about L-tone scaling have a broader implication for
research strategy. Work on H tones has brought home to us several
important strategic lessons: experimental designs should orthogonally vary
local and phrasal properties; productions should be properly analyzed
phonologically; and data analysis should seek parallel patterns within data
separated by speaker. We clearly need to apply these lessons in collecting Fo
measurements for L tones. However, to understand fully L tones, we will
need something more. We will need more work relating linguistic to
articulatory parameters. We will need to do physiological experiments in
which we fully control the phonological structure of utterances we elicit, and
we will need to develop acoustic measures that will help to segregate the
articulatory dimensions in large numbers of utterances.
397
i6
Secondary stress: evidence from Modern Greek
AMALIA ARVANITI
16.1 Introduction
The need to express formally stress subordination in English has always been
felt and many attempts to do so have been made, e.g. Trager and Smith
(1951), Chomsky and Halle (1968). However, until the advent of metrical
phonology (Liberman and Prince 1977) all models tried to express stress
subordination through linear analyses. The great advantage of metrical
phonology is that by presenting stress subordination through a hierarchical
structure it captures the difference in stress values between successive stresses
in an economical and efficient way.
When Liberman and Prince presented their model, one of their intentions
was to put forward a "formalization of the traditional idea of'stress timing'"
(1977: 250) through the use of the metrical grid. This reference to stress-
timing implies that their analysis mainly referred to the rhythm of English.
However, the principles of metrical phonology have been adopted for the
rhythmic description of other languages (Hayes 1981; Hayes and Puppel
1985; Roca 1986), including Polish and Spanish, which are rhythmically
different from English. The assumption behind studies like Hayes (1981) is
that, by showing that many languages follow the same rhythmic principles as
English, it can be proved that the principles of metrical phonology, namely
binarity of rhythmic patterns and by consequence hierarchical structure, are
universal. However, such evidence cannot adequately prove the universality
of metrical principles; what is needed is evidence that there are no languages
which do not conform to these principles. Thus, it would be interesting to
study a language that does not seem to exhibit a strictly hierarchical, binary
rhythmic structure. If the study of such a language proves this to be the case,
then the claim for the universality of binary rhythm may have to be revised.
One language that seems to show a markedly different kind of rhythmic
patterning from English is Modern Greek.
398
16 Amalia Arvaniti
In fact, the past decade has seen the appearance of a number of studies of
Modern Greek prosody both in phonology (Malikouti-Drachman and
Drachman 1980; Nespor and Vogel 1986, 1989; Berendsen, 1986) and in
phonetics (Dauer 1980; Fourakis 1986; Botinis 1989). These studies show
substantial disagreement concerning the existence and role of secondary
stress in Greek.
By way of introduction I present a few essential and undisputed facts
about Greek stress. First, in Greek, lexical stress conforms to a Stress Well-
formedness Condition (henceforth SWFC), which allows lexical stress on any
one of the last three syllables of a word but no further to the left (Joseph and
Warburton 1987; Malikouti-Drachman and Drachman 1980). Because of the
SWFC, lexical stress moves one syllable to the right of its original position
when affixation results in the stress being more than three syllables from the
end of the word; e.g.
(1) /'ma0ima/ "lesson" > /'maGima-l-ta/ "lesson-hs" > /ma'Oimata/
"lessons"
Second, as can be seen from example (1), lexical stress placement may depend
on morphological factors, but it cannot be predicted by a word's metrical
structure because there are no phonological weight distinctions either among
the Greek vowels, /i, e, a, o, u/, or among syllables of different structure; i.e.
in Geek, all syllables are of equal phonological weight. Therefore, it is quite
common for Greek words with the same segmental structure to have stress
on different syllables; e.g.
(2) a. /'xo.ros/ "space"
b. /xo.'ros/ "dance" (noun)
It is equally possible to find words like those in (3),
(3) a. /'pli.6os/ "crowd"
b. /'plin.Gos/ "brick"
where both words are stressed on their first syllable, although this is open in
(3a) and closed in (3b). Finally, when the SWFC is violated by the addition of
an enclitic to a host stressed on the antepenultimate, a stress is added two
syllables to the right of the lexical stress. For example,
(4) /'maBima tu/>/'ma0i'ma tu/ "his lesson"
(5) /'5ose mu to/>/'6ose 'mu to/ "give it to me"
All investigators (Setatos 1974; Malikouti-Drachman and Drachman
1980; Joseph and Warburton 1987; Botinis 1989) accept that the two stresses
in the host-and-clitic group have different prominence values. However, not
all of them agree as to the relative prominence of the two stresses. Most
399
Prosody
400
16 Amalia Arvaniti
w w w s w
A A
6er fi mu > i a 5er fi mu "my sister"
The stresses on /daskalos/ are presented here with the prominence values assumed by Nespor
and Vogel (1989).
401
Prosody
(9)
16.2 Experiment 1
16.2.1 Method
16.2.1.1 Material
The first experiment is a simple perceptual test whose aim is to see first
whether the lexical stress of the host-and-clitic group is more prominent than
402
16 Amalia Arvaniti
Table 16.1 One of the two test pairs (la and lb) and one of the distractors (2a
and 2b) in the context in which they were read. The test phrases and distractors
are in bold type.
the SWFC-induced one as Setatos (1974) and Nespor and Vogel (1986, 1989)
maintain, and second whether Botinis's phrase stress and word stress are two
perceptually distinct stress categories as his analysis suggests.
Two test pairs were designed (see the parts of table 16.1, la and lb, in bold
type): in each test pair the two members are segmentally identical but have
word boundaries at different places and are orthographically distinct in
Greek. The first member, (a), of each pair consists of one word stressed on
the antepenultimate and followed by an enclitic possessive pronoun. As this
pattern violates the SWFC a stress is added on the last syllable of the word.
The second member, (b), consists of two words which together form a phrase
and which are stressed on the same syllables as member (a). Thus, the
difference between members (a) and (b) of each test pair is that in (a) the
phrase contains a lexical and a SWFC-induced stress whereas in (b) each one
of the two words carries lexical stress on the same syllables as (a). According
to Nespor and Vogel, the most prominent stress in (a) phrases is the lexical
stress of the host while in (b) phrases it is the stress of the second word (i.e.
the one that falls on the same syllable as the SWFC-induced stress in (a))
since the second word is the head of the phonological phrase O (1986: 168).
Also, in Botinis's terms (a) and (b) phrases have different stress patterns, (a)
containing one word and one phrase stress and (b) containing two word
stresses; these stress patterns are said by Botinis to be perceptually distinct. If
either Nespor and Vogel or Botinis are correct, (a) and (b) phrases should be
distinguishable.
The test phrases were incorporated into meaningful sentences (see table
16.1). Care was taken to avoid stress clashes, and to design, for each pair,
sentences of similar prosodic structure and length. Two distractor pairs were
devised on the same principle as the test pairs (see table 16.1, 2a and 2b). The
difference is that in the distractors one member contains two words, each one
403
Prosody
with its own lexical stress (/'mono 'loyo/"only reason"), while in the other
member the same sequence of syllables makes one word with lexical stress on
a different syllable from those stressed in the first member
(/mo' noloyo/"monologue").
The sentences were read by four subjects including the author. Each
subject read the test sentences and the distractors six times from a random-
ized list, typed in Greek. The recorded sentences and the distractors were
digitized at 16 kHz and then were edited so that only the test phrases and
distractors were left. For each test phrase and distractor one token from each
one of the four subjects was selected for the test tape. The tokens chosen
were, according to the author's judgment, those that sounded most natural
by showing minimum coarticulatory interference from the carrier phrase.
To make the listening tape, the test phrases and the distractors were
recorded at a sampling rate of 16 kHz using computer-generated random-
ization by blocks so that each token from each subject was heard twice. Each
test phrase and distractor was preceded by a warning tone. There were 100
msec, of silence between the tone and the following phrase and 2 sec. between
each stimulus and the following tone. Every twenty stimuli there was a 5 sec.
pause. In order for listeners to familiarize themselves with the task, the first
four stimuli were repeated at the end of the tape, and the first four responses
of each listener were discarded. Thus, each subject heard a total of seventy
stimuli: 4 speakers x (4 test phrases + 4 distractors) x 2 blocks + 4
repeated items -I- 2 stimuli that consisted of two tones each (a result of the
randomization program).
16.2.1.2 Subjects
As mentioned, four subjects took part in the recording. Three of them (two
female, one male) were in their twenties and they were postgraduate students
at the University of Cambridge. The fourth subject was a sixty-year-old
woman visiting Cambridge. All subjects were native speakers of Greek and
spoke the standard dialect. All, apart from the fourth subject, had extensive
knowledge of English. None of the subjects had any history of speech or
hearing problems. Apart from the author all subjects were naive as to the
purpose of the experiment.
Eighteen subjects (seven male and eleven female) did the perceptual test.
They were all native speakers of Greek and had no history of speech or
hearing problems. Twelve of them were between 25 and 40 years old and the
other six were between 40 and 60 years old. Fourteen of them spoke other
languages in addition to Greek but only one had extensive knowledge and
contact with a foreign language (English). All subjects had at least secondary
education and fourteen of them held university degrees. All subjects spoke
404
16 Amalia Arvaniti
Standard Greek, as spoken in Athens, where sixteen of them live. They were
all naive as to the purposes of the experiment.
16.2.13 Procedure
The subjects did the test in fairly quiet conditions using headphones and a
portable Sony Stereo Cassette-Corder TCS-450. No subject complained that
their performance might have been marred by noise or poor-quality equip-
ment. The subjects were given a response sheet, typed in Greek, which gave
both possible interpretations of every stimulus in the tape (70 x 2 possible
answers). The task was explained to them and they were urged to give an
answer to all stimuli even if they were not absolutely certain of their answer.
The subjects were not allowed to play back the tape.
16.2.2 Results
The subjects gave a total of 576 responses excluding the distractors (18
subjects x 32 test phrases/answer sheet). There were 290 mistakes, i.e. 50.34
percent of the responses to the test phrases were wrong (identification rate
49.66 percent). The number of mistakes ranged from a minimum of nine (one
subject) to twenty-one (one subject). By contrast, the identification rate of the
distractors was 99.1 percent; out of 18 subjects only two made one and four
mistakes respectively. Most subjects admitted that they could not tell the test
phrases apart although they found the distractors easy to distinguish. Even
the subjects who insisted that they could tell apart the test pairs made as
many mistakes as the rest.
Thus the results of experiment 1 give an answer to the first two questions
addressed here. The results clearly indicate (see table 16.2) that, contrary to
Setatos (1974) and Nespor and Vogel (1986, 1989), the SWFC-induced stress
is the most prominent in the host-and-clitic group, whereas the original
lexical stress of the host weakens. This weakening is similar to that of the
lexical stress of a word which is part of a bigger prosodic constituent, such as
a O, without being its head. Also, the results show that in natural speech
Botinis's "phrase stress" is not perceptually distinct from word stress as he
suggests.
16.3 Experiment 2
16.3.1 Method
16.3.1.1 Material
Experiment 2 includes a perceptual test and acoustical analyses of the
utterances used for it. With the answers to questions (a) and (b) established,
this experiment aims at answering the third question addressed here: namely,
whether rhythmic stress and the weakened lexical stress (or "secondary
405
Prosody
Table 16.2 Experiment 1: contingency table of type of stimulus by subject
response.
Type of stimulus
Note: Total deviance (%2) = 0.026 ldf. The difference between the relevant conditions
is not significant.
as the test sentences, the difference being that the word pairs in the distractors
differed in the position of the primary stress only; /a'poxi/ "hunting net":
/apo'xi/ "abstention". The sentences were read by six subjects, in conditions
similar to those described for experiment 1, from a typed randomized list
which included six repetitions of each test sentence and distractor. The first
two subjects (El and MK) read the sentences from hand-written cards three
times each.
A listening tape was made in the same way as in experiment 1. The stimuli
were the whole sentences not just the test word pairs. The tape contained one
token of each test sentence and distractor elicited from each subject. There
were 3 sec. of silence between sentences and 5 sec. after every tenth sentence.
Thefirstfour sentences were repeated at the end of the tape, and the first four
responses of each listener were discarded. Each subject heard a total of 100
sentences: 6 speakers x (8 test sentences + 8 distractors) + 4 repeated
stimuli.
MK and the first three tokens of HP's recording were digitized at a sampling
rate of 16 kHz and measurements of duration, amplitude, and F o were
obtained. Comparisons of the antepenultimate and final syllables of the SS
words with the equivalent syllables of the RS words are presented here (see
figures 16.1-16.4, below). For instance, the duration, Fo and amplitude
values of/e/ in SS word /.eni'ko/ "tenant" were compared to those of/e/ in
RS word /eni'ko/ "singular."
Duration was measured from spectrograms. The error range was one pitch
period (about 4-5 msec, as all three subjects were female). Measurements
followed common criteria of segmentation (see Peterson and Lehiste 1960).
VOT was measured as part of the following vowel.
Three different measurements of amplitude were obtained: peak amplitude
(PA), root mean square (RMS) amplitude, and amplitude integral (AI). All
data have been normalized so as to avoid statistical artifacts due to
accidental changes such as a subject's leaning towards the microphone etc.
To achieve normalization, the PA of each syllable was divided by the highest
PA in the word in question while the RMS and AI of each syllable was
divided by the word's RMS and AI respectively; thus the RMS and AI of
each syllable are presented as percentages of the word's RMS and AI
respectively. All results refer to the normalized data. All original measure-
ments were in arbitrary units given by the signal processing package used.
For peak amplitude, measurements were made from waveforms at the
point of highest amplitude of each syllable nucleus. RMS and AI were
measured using a computer program which made use of the amplitude
information available in the original sample files.2 To calculate the RMS,
the amplitude of each point within the range representing the syllable nucleus
was squared and the sum of squared amplitudes was divided by the number
of points; the square root of this measurement represents the average
amplitude of the sound (RMS) and is independent of the sound's duration.
AI measurements were obtained by simply calculating the square root of the
sum of squared amplitudes of all designated points without dividing the sum
by the number of points. In this way, the duration of the sound is taken into
account when its amplitude is measured, as a longer sound of lower
amplitude can have the same amplitude integral as a shorter sound of higher
amplitude. This way of measuring amplitude is based on Beckman (1986);
Beckman, indeed, found that there is strong correlation between stress and
AI for English.
Fundamendal frequency was measured using the Fo tracker facility of a
signal processing package (Audlab). To ensure the reliability of the Fo
tracks, narrow-band spectrograms were also made and the contour of the
2
I am indebted to Dr. D. Davies and Dr. K. Roussopoulos for writing the program for me.
408
16 Amalia Arvaniti
Table 16.4 Experiment 2: contingency table oftype of stimulus by subject
response.
Type of stimulus
Note: Total deviance (x2) = 727.10 ldf. The result is significant; p<0.001
250
200
150
100
50
(b) Msec.
300 i
250
200
150
100
50
PI pi Li li XI xi KO ko
epitropi simvuli simetoxi eniko
Figure 16.1 (a) Means (series 1) and SDs (series 2) of the duration, in msec, of antepenultimate
syllables of SS words (left, upper case) and RS words (right, lower case) for all subjects, (b) Same
measurements for final syllables
The data from the three subjects are pooled, as t-tests performed on each
subject's data separately showed no differences across subjects. One-tailed t-
tests for the data of all three subjects show that, for antepenultimate
syllables, the duration of the antepenult of SS words is significantly longer
than that of RS words in all word pairs (see table 16.5 for the t-test results).
For vowel durations, one-tailed t-tests show that the duration of the
410
16 Amalia Arvaniti
80
iJllLu
60
40
20
PI pi SIM sim ME me E e
epitropi simvuli simetoxl eniko
(b) % of word's Al
V
' 100 i
80
IULUJI
60
40
20
PI pi Li li XI xi KO ko
epitropi simvuli simetoxi eniko
uJLlLJI
80
60
40
20
PI pi SIM sim ME me E e
epitropi simvuli simetoxi eniko
(b) % of word's Al
100 i
80
60
ILJLJLII
40
20
PI pi Li li XI xi KO ko
epitropi simvuli simetoxi eniko
in his data, stressed syllables had significantly higher peak amplitude than
unstressed syllables. On the other hand, he also reports that in perceptual
tests amplitude changes did not affect the subjects' stress judgments. These
results could mean that as amplitude is not a robust stress cue, its acoustical
presence is not necessary and some speakers might opt not to use it. Clearly,
both a more detailed investigation into how to measure amplitude, and data
413
Prosody
Table 16.7 Results of one-tailed t-tests performed on
the AI of the antepenultimate syllables ofSS and RS
words of all test word pairs, for subject HP. In all
cases, df= 4. The syllables that are being compared
are in upper case letters
16.4 Discussion
The results of the first experiment show that native speakers of Greek cannot
differentiate between the rightmost lexical stress of a phrase and a SWFC-
induced stress which fall on the same syllable of segmentally identical
phrases. This implies that, contrary to the analyses of Setatos (1974) and of
Nespor and Vogel (1986, 1989), the SWFC-induced stress is the most
prominent stress in a host-and-clitic group, in the same way that the most
prominent stress in a O is the rightmost lexical stress. This conclusion agrees
with the description of the phenomenon by most analyses of Greek, both
phonological (e.g. Joseph and Warburton 1987; Malikouti-Drachman and
Drachman 1980) and phonetic (e.g. Botinis 1989), and also with the basic
requirement of the SWFC; namely, that the main stress must fall at most
three syllables to the left of its domain boundary. Moreover, the results
414
16 Amalia Arvaniti
3.50
e + 02
Fo ^^_ —
Hz
e + 02 I I I I I I I I I
1.11 2.02
Time (sec.)
hp /tonenikotis/SS
(*)
0.00
1.15
Time (sec.)
3.50
e + 02
Fo
Hz
1.00
e + 02 i i i i i I I I
1.98
Time (sec.)
hp /tonenikotis/RS
Figure 16.4 Characteristic Fo contours together with the corresponding narrow band
spectrograms for HP.'s /ton eniko tis/; (a) SS word (b) RS word. The thicker line on the plot
represents the smoothed contour
415
Prosody
(12) SD SD SD
s w w s w w w w s w w
pi ra ma > pi ra ma + ta > pi ra ma ta
"experiment" > "experiment •+• s" > "experiments"
The fact that all words constitute independent stress domains is true even
of monosyllabic "content" words. The difference between those and clitics
becomes apparent when one considers examples like (13):
416
16 Amalia Arvaniti
(13) /'anapse to 'fos/ "turn on the light"
which shows that SWFC violations do not arise between words because these
form separate stress domains.
Clitics, however, remain unattached weak syllables until they are attached
to a host postlexically. In this way, clitics extend the boundaries of words, i.e.
of stress domains (SDs); clitics form compound SDs with their hosts. For
example,
(14)
SD
w w s w w w w s w w w w s w w
ton pa te ra mu ton pa te ra mu to(m) b a t e r a mu
"my father (ace.)"
The stress domain formed by the host and its enclitic still has to conform to
the SWFC. When cliticization does not result in a SWFC violation no change
of stress pattern is necessary. When, however, the SWFC is violated by the
addition of enclitics, the results of the violation are different from those
observed within the lexical component. This is precisely because the host has
already acquired lexical stress and constitutes an independent stress domain
with fixed stress. Thus, in SWFC violations the host's stress cannot move
from its position, as it does within the lexical component. The only
alternative, therefore, is for another stress to be added in such a position that
it can comply with the SWFC, thus producing the stress two syllables to the
right of the host's lexical stress. In this case the compound SD is divided into
two SDs.
SD
(15)
SD SD SD
w w s w w w w w s w w w
//W A
w w s w s w
to ti le fo no mas to ti le fo no mas >
"our telephone" to ti le fo no mas
In this way, the subordination of the first stress is captured, as well as the
fact that both stresses still belong to one stress domain, albeit a compound
one, and therefore they are at the same prosodic level. The disadvantage of
417
Prosody
16.5 Conclusion
It has been shown that the Greek Stress Well-Formedness Condition applies
both lexically, moving lexical stress to the right of its original position, and
postlexically, adding a stress two syllables to the right of the host in a host-
and-clitic group. In the latter case, the SWFC-induced stress becomes the
most prominent in the group; this stress was shown to be perceptually
identical to a lexical stress. The weakened lexical stress of the host was shown
to be acoustically and perceptually similar to a subordinate lexical stress and
not to rhythmic stress as has often been thought. The experimental evidence
together with the absence of strong phonological arguments to the contrary
suggest that Greek might not exhibit rhythmic stresses at all.
419
Prosody
Appendix 1
The test phrases (bold type) of experiment 1 in the
context in which they were read
420
16 Amalia Arvaniti
Appendix 2
The distractors (bold type) of experiment 1 in the
context in which they were read
l(a) /pi'stevo 'oti 'ksero to 'mono 'loyo yi'a a'fti tin ka'tastasi/
"I believe I know the only reason for this situation."
(b) /den 'exo a'kusi pi'o vare'to mo'noloyo sto 'Oeatro/
"I haven't listened to a more boring theatrical monologue."
2(a) /6e '9elo 'pare 'dose me a'fto to 'atomo/
"I don't want to have anything to do with this person."
(b) /'ksero 'oti to pa'redose stus dike'uxus/
"I know that he delivered it to the beneficiaries."
421
Prosody
Appendix 3
The test sentences of experiment 2.
The test words are in bold type
422
16 Amalia Arvaniti
Appendix 4
The distractor sentences of experiment 2.
The distractors are in bold type
423
References
Abbreviations
ARIPUC Annual Report, Institute of Phonetics, University of Copenhagen
CSLI Center for the Study of Language and Information
IPO Instituut voor Perceptie Onderzoek (Institute for Perception Research)
IULC Indiana University Linguistics Club
JASA Journal of the Acoustical Society of America
JL Journal of Linguistics
JPhon Journal of Phonetics
JSHR Journal of Speech and Hearing Research
JVLVB Journal of Verbal Language and Verbal Behaviour
LAGB Linguistics Association of Great Britain
Lg Language
Lg & Sp Language and Speech
LI Linguistic Inquiry
MITPR Massachusetts Institute of Technology, Progress Report
NELS North East Linguistic Society
PERILUS Phonetic Experimental Research at the Institute of Linguistics, University of
Stockholm
Proc. IEEE Int. Conf Ac, Sp. & Sig. Proc. Proceedings of the Institute of Electrical
and Electronics Engineers Conference on Acoustics, Speech and Signal Processing
PY Phonology Yearbook
RILP Report of the Institute of Logopedics and Phoniatrics
STL-QPSR Quarterly Progress and Status Report, Speech Transmission Laboratory,
Royal Institute of Technology (Stockholm)
WCCFL West Coast Conference on Formal Linguistics
Albrow, K. H. 1975. Prosodic theory, Hungarian and English. Festschrift fur Norman
Denison zum 50. Geburtstag (Grazer Linguistiche Studien, 2). Graz: University of
Graz Department of General and Applied Linguistics.
Alfonso, P. and T. Baer. 1982. Dynamics of vowel articulation. Lg & Sp 25: 151-73.
424
References
Ali, L. H., T. Gallagher, J. Goldstein, and R. G. Daniloff. 1971. Perception of
coarticulated nasality. JASA 49: 538^0.
Allen, J., M. S. Hunnicutt, and D. Klatt. 1987. From Text to Speech: the MITalk
System. Cambridge: Cambridge University Press.
Anderson, L. B. 1980. Using asymmetrical and gradient data in the study of vowel
harmony. In R. M. Vago (ed.), Issues in Vowel Harmony. Amsterdam: John
Benjamins.
Anderson, M., J. Pierrehumbert, and M. Liberman. 1984. Synthesis by rule of English
intonation patterns. Proc. IEEE Int. Conf. Ac, Sp. & Sig. Proc. 282-4. New
York: IEEE.
Anderson, S. R. 1974. The Organization of Phonology. New York: Academic Press.
1978. Tone features. In V. Fromkin (ed.), Tone: a Linguistic Survey. New York:
Academic Press.
1982. The analysis of French schwa. Lg 58: 535-73.
1986. Differences in rule type and their structural basis. In H. van der Hulst and N.
Smith (eds.), The Structure of Phonological Representations, part 2. Dordrecht:
Foris.
Anderson, S. R. and W. Cooper. Fundamental frequency patterns during sponta-
neous picture description. Ms. University of Iowa.
Archangeli, D. 1984. Underspecification in Yawelmani phonology. Doctoral disser-
tation, Cambridge, MIT.
1985. Yokuts harmony: evidence for coplanar representation in nonlinear phono-
logy. LI 16: 335-72.
1988. Aspects of underspecification theory. Phonology 5.2: 183-207.
Arvaniti, A. 1990. Review of A. Botinis, 1989. Stress and Prosodic Structure in Greek:
a Phonological, Acoustic, Physiological and Perceptual Study. Lund University
Press. JPhon 18: 65-9.
Bach, E. 1968. Two proposals concerning the simplicity metric in phonology. Glossa
4: 3-21.
Barry, M. 1984. Connected speech: processes, motivations and models. Cambridge
Papers in Phonetics and Experimental Linguistics 3.
1985. A palatographic study of connected speech processes. Cambridge Papers in
Phonetics and Experimental Linguistics 4.
1988. Assimilation in English and Russian. Paper presented at the colloquium of
the British Association of Academic Phoneticians, Trinity College, Dublin,
March 1988.
Beattie, G., A. Cutler, and M. Pearson. 1982. Why is Mrs Thatcher interrupted so
often? Nature 300: 744-7.
Beckman, M. E. 1986. Stress and Non-Stress Accent (Netherlands Phonetic Archives
7). Dordrecht: Foris.
Beckman, M. E. and J. Kingston. 1990. Introduction to J. Kingston and M. Beckman
(eds.), Papers in Laboratory Phonology I: Between the Grammar and the Physics
of Speech. Cambridge: Cambridge University Press, 1-16.
Beckman, M. E. and J. B. Pierrehumbert. 1986. Intonational structure in English and
Japanese. PY 3: 255-310.
425
References
Beddor, P. S., R. A. Krakow, and L. M. Goldstein. 1986. Perceptual constraints and
phonological change: a study of nasal vowel height. PY 3: 197-217.
Bell-Berti, F. and K. S. Harris. 1981. A temporal model of speech production.
Phonetica 38: 9-20.
Benguerel, A. P. and T. K. Bhatia. 1980. Hindi stop consonants: an acoustic and
fiberscopic study. Phonetica 37: 134—48.
Benguerel, A. P. and H. Cowan. 1974. Coarticulation of upper lip protrusion in
French. Phonetica 30: 41-55.
Berendsen, E. 1986. The Phonology of Cliticization. Dordrecht: Foris.
Bernstein, N. A. 1967. The Coordination and Regulation of Movements. London:
Pergamon.
Bertch, W. F., J. C. Webster, R. G. Klumpp, and P. O. Thomson. 1956. Effects of two
message-storage schemes upon communications within a small problem-solving
group. JASA 28: 550-3.
Bickley, C. 1982. Acoustic analysis and perception of breathy vowels. Working
Papers, MIT Speech Communications 1: 74—83.
Bing, J. M. 1979. Aspects of English prosody. Doctoral dissertation, University of
Massachusetts.
Bird, S. and E. Klein. 1990. Phonological events. JL 26: 33-56.
Bloch, B. 1941. Phonemic overlapping. American Speech 16: 278-84.
Bloomfield, L. 1933. Language. New York: Holt.
Blumstein, S. E. and K. N. Stevens. 1979. Acoustic invariance in speech production:
evidence from the spectral characteristics of stop consonants. JASA 66: 1011-17.
Bolinger, D. 1951. Intonation: levels versus configuration. Word!: 199-210.
1958. A theory of pitch accent in English. Word 14: 109-49.
1986. Intonation and its Parts. Stanford, CA: Stanford University Press.
Botha, R. P. 1971. Methodological Aspects of Transformational Generative Phonology.
The Hague: Mouton.
Botinis, A. 1989. Stress and Prosodic Structure in Greek: A Phonological, Acoustic,
Physiological and Perceptual Study. Lund: Lund University Press.
Boyce, S. 1986. The "trough" phenomenon in Turkish and English. JASA 80: S.95
(abstract).
Broe, M. 1988. A unification-based approach to prosodic analysis. Edinburgh
Working Papers in Linguistics 21: 63-82.
Bromberger, S. and M. Halle. 1989. Why phonology is different. L/20.1: 51-69.
Browman, C. P. 1978. Tip of the tongue and slip of the ear: implications for language
processing. UCLA Working Papers in Phonetics, 42.
Browman, C. P. and L. Goldstein. 1985. Dynamic modeling of phonetic structure. In
V. Fromkin (ed.), Phonetic Linguistics. New York: Academic Press.
1986. Towards an articulatory phonology. PY 3: 219-52.
1988. Some notes on syllable structure in articulatory phonology. Phonetica 45:
140-55.
1989. Articulatory gestures as phonological units. Phonology, 6.2: 201-51.
1990. Tiers in articulatory phonology with some implications for casual speech. In
J. Kingston and M. Beckman (eds.), Papers in Laboratory Phonology I: Between
426
References
the Grammar and the Physics of Speech. Cambridge: Cambridge University Press,
341-76.
Browman, C. P., L. Goldstein, E. L. Saltzman, and C. Smith. 1986. GEST: a
computational model for speech production using dynamically defined articula-
tory gestures. JASA, 80, Suppl. 1 S97 (abstract).
Browman, C. P., L. Goldstein, J. A. S. Kelso, P. Rubin, and E. L. Saltzman. 1984.
Articulatory synthesis from underlying dynamics. JASA 75: S22-3. (abstract).
Brown, G., K. Currie, and J. Kenworthy. 1980. Questions of Intonation. London:
Croom Helm.
Brown, R. W. and D. McNeill. 1966. The "tip of the tongue" phenomenon. JVLVB
5: 325-37.
Bruce, G. 1977. Swedish word accents in sentence perspective. Lund: Gleerup.
1982a. Developing the Swedish intonation model. Working Papers, Department of
Linguistics, University of Lund, 22: 51-116.
1982b. Textual aspects of prosody in Swedish. Phonetica 39: 274-87.
Bruce, G. and E. Garding. 1978. A prosodic typology for Swedish dialects. In E.
Garding, G. Bruce, and R. Bannert (eds.), Nordic Prosody. Lund: Gleerup.
Bullock, D. and S. Grossberg. 1988. The VITE model: a neural command circuit for
generating arm and articulator trajectories. In J. A. S. Kelso, A. J. Mandell, and
M. F. Shlesinger (eds.), Dynamic Patterns in Complex Systems. Singapore:
World Scientific, 305-26.
Carlson, L. 1983. Dialogue Games: an Approach to Discourse Analysis (Synthese
Language Library 17). Dordrecht: Reidel.
Carnochan, J. 1957. Gemination in Hausa. In Studies in Linguistic Analysis. The
Philological Society, Oxford: Basil Blackwell.
Catford, J. C. 1977. Fundamental Problems in Phonetics. Edinburgh: Edinburgh
University Press.
Chang, N-C. 1958. Tones and intonation in the Chengtu dialect (Szechuan, China).
Phonetica 2: 59-84.
Chao, Y. R. 1932. A preliminary study of English intonation (with American
variants) and its Chinese equivalents. T'sai Yuan Pei Anniversary Volume, Suppl.
Vol. 1 Bulletin of the Institute of History and Philology of the Academica Sinica.
Peiping.
Chatterjee, S. K. 1975. Origin and Development of the Bengali Language. Calcutta:
Rupa.
Chiba, T. and M. Kajiyama. 1941. The Vowel. Its Nature and Structure. Tokyo:
Taseikan.
Choi, J. 1989. Some theoretical issues in the analysis of consonant to vowel spreading
in Kabardian. MA thesis, Department of Linguistics, UCLA.
Chomsky, N. 1964. The nature of structural descriptions. In N. Chomsky, Current
Issues in Linguistic Theory. The Hague: Mouton.
1965. Aspects of the Theory of Syntax. Cambridge, MA: MIT Press.
Chomsky, N. and M. Halle. 1965. Some controversial questions in phonological
theory. JL 1:97-138.
1968. The Sound Pattern of English. New York: Harper and Row.
427
References
Clark, M. 1990. The Tonal System of Igbo. Dordrecht: Foris.
Clements, G. N. 1976. The autosegmental treatment of vowel harmony. In W.
Dressier and O. Pfeiffer (eds.), Phonologica 1976. Innsbruck: Innsbrucker
Beitrage zur Sprachwissenschaft.
1978. Tone and syntax in Ewe. In D. J. Napoli (ed.), Elements of Tone, Stress, and
Intonation. Georgetown: Georgetown University Press.
1979. The description of terrace-level tone languages. Lg 55: 536-58.
1981. The hierarchical representation of tone features. Harvard Studies in Phono-
logy 2: 50-105.
1984. Principles of tone assignment in Kikuyu. In G. N. Clements and J. Goldsmith
(eds.), Autosegmental Studies in Bantu Tone. Dordrecht: Foris, 281-339.
1985. The geometry of phonological features. PY 2: 225-52.
1986. Compensatory lengthening and consonant gemination in Luganda. In L.
Wetzels and E. Sezer (eds.), Studies in Compensatory Lengthening. Dordrecht:
Foris.
1987. Phonological feature representation and the description of intrusive stops.
Papers from the Parasession on Autosegmental and Metrical Phonology. Chicago
Linguistics Society, University of Chicago.
1990a. The role of the sonority cycle in core syllabification. In J. Kingston and M.
Beckman (eds.), Papers in Laboratory Phonology I: Between the Grammar and the
Physics of Speech. Cambridge: Cambridge University Press, 283-333.
1990b. The status of register in intonation theory. In J. Kingston and M. Beckman
(eds.), Papers in Laboratory Phonology I: Between the Grammar and the Physics
of Speech. Cambridge: Cambridge University Press, 58-71.
Clements, G. N. and J. Goldsmith. 1984. Introduction. In G. N. Clements and J.
Goldsmith (eds.), Autosegmental Studies in Bantu Tone. Dordrecht: Foris.
Clements, G. N. and S. J. Keyser. 1983. CV Phonology: a Generative Theory of the
Syllable. Cambridge, MA: MIT Press.
Cohen, A. and J. 't Hart. 1967. On the anatomy of intonation. Lingua 19: 177-92.
Cohen, A., R. Collier, and J. 't Hart. 1982. Declination: construct or intrinsic feature
of speech pitch? Phonetica 39: 254-73.
Cohen, J. and P. Cohen. 1983. Applied Multiple Regression/ Correlation Analysis for
the Behavioral Sciences, 2nd edn. Hillsdale, NJ: Lawrence Erlbaum.
Coleman, J. 1987. Knowledge-based generation of speech synthesis parameters. Ms.
Experimental Phonetics Laboratory, Department of Language and Linguistic
Science, University of York.
1989. The phonetic interpretation of headed phonological structures containing
overlapping constituents. Manuscript.
Coleman, J. and J. Local. 1991. "Constraints" in autosegmental phonology. To
appear in Linguistics and Philosophy.
Collier, R. 1989. On the phonology of Dutch intonation. In F. J. Heyvaert and F.
Steurs (eds.), Worlds behind Words. Leuven: Leuven University Press.
Collier, R. and J. 't Hart. 1981. Cursus Nederlandse intonatie. Leuven: Acco.
Connell, B. and D. R. Ladd 1990. Aspects of pitch realisation in Yoruba. Phonology
1: 1-29.
428
References
Cooper, A. forthcoming. Stress effects on laryngeal gestures.
Cooper, W. E. and J. M. Paccia-Cooper. 1980. Syntax and Speech. Cambridge, MA:
Harvard University Press.
Cooper, W. E. and J. Sorensen, 1981. Fundamental Frequency in Sentence Production.
New York: Springer.
Costa, P. J. and I. G. Mattingly. 1981. Production and perception of phonetic
contrast during phonetic change. Status Report on Speech Research Sr-67/68.
New Haven: Haskins Laboratories, 191-6.
Cotton, S. and F. Grosjean. 1984. The gating paradigm: a comparison of successive
and individual presentation formats. Perception and Psychophysics 35: 41-8.
Crompton, A. 1982. Syllables and segments in speech production. In A. Cutler (ed.),
Slips of the Tongue and Language Production. Amsterdam: Mouton.
Crystal, D. 1969. Prosodic Systems and Intonation in English. Cambridge: Cambridge
University Press.
Cutler, A. 1980. Errors of stress and intonation. In V. A. Fromkin (ed.), Errors in
Linguistic Performance. New York: Academic Press.
1987. Phonological structure in speech recognition. PY 3: 161-78.
Cutler, A., J. Mehler, D. Norris, and J. Segui. 1986. The syllable's differing role in the
segmentation of French and English. Journal of Memory and Language 25:
385-400.
Dalby, J. M. 1986. Phonetic Structure of Fast Speech in American English. Bloom-
ington: IULC.
Daniloff, R. and R. E. Hammarberg. 1973. On defining coarticulation. JPhon, 1:
239-48.
Daniloff, R., G. Shuckers, and L. Feth. 1980. The Physiology of Speech and Hearing.
Englewood Cliffs, NJ: Prentice Hall.
Dauer, R. M. 1980. Stress and rhythm in modern Greek. Doctoral dissertation,
University of Edinburgh.
Delattre, P. 1966. Les Dix Intonations de base du Francais. French Review 40: 1-14.
1971. Pharyngeal features in the consonants of Arabic, German, Spanish, French,
and American English. Phonetica 23: 129-55.
Dell, F. forthcoming. L'Accentuation dans les phrases en Francais. In F. Dell, J.-R.
Vergnaud, and D. Hirst (eds.), Les Representations en phonologic Paris:
Hermann.
Dev, A. T. 1973. Students' Favourite Dictionary. Calcutta: Dev Sahitya Kutir.
Diehl, R. and K. Kluender. 1989. On the objects of speech perception. Ecological
Psychology 1.2: 121-44.
Dinnsen, D. A. 1983. On the Characterization of Phonological Neutralization. IULC.
1985. A re-examination of phonological neutralisation. JL 21: 265-79.
Dixit, R. P. 1987. In defense of the phonetic adequacy of the traditional term "voiced
aspirated." UCLA Working Papers in Phonetics 67: 103-11.
Dobson, E. J. 1968. English Pronunciation. 1500-1700. 2nd edn. Oxford: Oxford
University Press.
Docherty, G. J. 1989. An experimental phonetic study of the timing of voicing in
English obstruents. Doctoral dissertation, University of Edinburgh.
429
References
Downing, B. 1970. Syntactic structure and phonological phrasing in English.
Doctoral dissertation, University of Texas.
Dowty, D. R., R. E. Wall, and S. Peters. 1981. Introduction to Montague Semantics.
Dordrecht: Reidel.
Erikson, D. 1976. A physiological analysis of the tones of Thai. Doctoral dissertation,
University of Connecticut.
Erikson, Y. 1973. Preliminary evidence of syllable locked temporal control of Fo.
STL-QPSR 2-3: 23-30.
Erikson, Y. and M. Alstermark. 1972. Fundamental frequency correlates of the grave
accent in Swedish: the effect of vowel duration. STL-QPSR 2-3: 53-60.
Fant, G. 1959. Acoustic analysis and synthesis of speech with applications to
Swedish. Ericsson Technics 1.
1960. Acoustic Theory of Speech Production. The Hague: Mouton.
Fant, G. and Q. Linn. 1988. Frequency domain interpretation and derivation of
glottal flow parameters. STL-QPSR 2-3: 1-21.
Ferguson, C. A. and M. Chowdhury. 1960. The phonemes of Bengali. Lg 36.1: 22-59.
Firth, J. R. 1948. Sounds and prosodies. Transactions of the Philological Society
127-52; also in F. R. Palmer (ed.) Prosodic Analysis. Oxford: Oxford University
Press.
1957. A synopsis of linguistic theory 1930-1955. In Studies in Linguistic Analysis.
The Philological Society, Oxford: Basil Black well.
Fischer-Jorgensen, E. 1975. Trends in Phonological Theory. Copenhagen: Akademisk.
Flanagan, J. L. 1972. Speech Analysis, Synthesis, and Perception, 2nd edn. Berlin:
Springer.
Folkins, J. W. and J. H. Abbs. 1975. Lip and jaw motor control during speech:
responses to resistive loading of the jaw. JSHR 18: 207-20.
Foss, D. J. 1969. Decision processes during sentence comprehension: effects of lexical
item difficulty and position upon decision times. JVLVB 8: 457-62.
Foss, D. J. and M. A. Blank. 1980. Identifying the speech codes. Cognitive Psychology
12: 1-31.
Foss, D. J. and M. A. Gernsbacher. 1983. Cracking the dual code: toward a unitary
model of phoneme identification. JVLVB 22: 609-32.
Foss, D. J. and D. A. Swinney. 1973. On the psychological reality of the phoneme:
perception, identification, and consciousness. JVLVB 12: 246-57.
Foss, D. J., D. A. Harwood, and M. A. Blank. 1980. Deciphering decoding decisions:
data and devices. In R. A. Cole (ed.), The Perception and Production of Fluent
Speech, Hillsdale, NJ: Lawrence Erlbaum.
Fourakis M. 1986. An acoustic study of the effects of tempo and stress on segmental
intervals in modern Greek. Phonetica 43: 172-88.
Fourakis, M. and R. Port. 1986. Stop epenthesis in English. JPhon 14: 197-221.
Fowler, C. A. 1977. Timing Control in Speech Production. Bloomington, IULC.
1980. Coarticulation and theories of extrinsic timing control. JPhon 8: 113-33.
1981a. Perception and production of coarticulation among stressed and unstressed
vowels. JSHR 24: 127-39.
430
References
1981b. A relationship between coarticulation and compensatory shortening. Pho-
netica 38: 35-50.
1985. Current perspectives on language and speech perception: a critical overview.
In R. Daniloff (ed.), Speech Science: Recent Advances. San Diego, CA:
College-Hill.
1986. An event approach to the study of speech perception from a direct-realist
perspective. JPhon 14: 3-28.
Fowler, C. A., P. Rubin, R. E. Remez, and M. T. Turvey. 1980. Implications for
speech production of a skilled theory of action. In B. Butterworth (ed.),
Language Production I. London: Academic Press.
Frederiksen, J. R. 1967. Cognitive factors in the recognition of ambiguous auditory
and visual stimuli. (Monograph) Journal of Personality and Social Psychology
7.
Franke, F. 1889. Die Umgangssprache der Nieder-Lausitz in ihren Lauten. Phone-
tische Studien II, 21.
Fritzell, B. 1969. The velopharyngeal muscles in speech. Acta Otolaryngologica.
Suppl. 250.
Fromkin, V. A. 1971. The non-anomalous nature of anomalous utterances. Lg 47:
27-52.
1976. Putting the emPHAsis on the wrong sylLABle. In L. M. Hyman (ed.), Studies
in Stress and Accent. Los Angeles: University of Southern California.
Fry, D. B. 1955. Duration and intensity as physical correlates of linguistic stress.
JASA 27: 765-8.
1958. Experiments in the perception of stress. Lg & Sp 1: 126-52.
Fudge, E. C. 1987. Branching structure within the syllable. JL 23: 359-77.
Fujimura, O. 1962. Analysis of nasal consonants. JASA 34: 1865-75.
1986. Relative invariance of articulatory movements: an iceberg model. In J. S.
Perkell and D. H. Klatt (eds.), Invariance and Variability in Speech Processes.
Hillsdale, NJ: Lawrence Erlbaum.
1987. Fundamentals and applications in speech production research. Proceedings
of the Eleventh International Congress of Phonetic Sciences. 6: 10-27.
1989a. An overview of phonetic and phonological research. Nihongo to Nihongo
Kyooiku 2: 365-89. (Tokyo: Meiji-shoin.)
1989b. Comments on "On the quantal nature of speech", by K. N. Stevens. JPhon
17: 87-90.
1990. Toward a model of articulatory control: comments on Browman and
Goldstein's paper. In J. Kingston and M. Beckman (eds.), Papers in Laboratory
Phonology I: Between the Grammar and the Physics of Speech. Cambridge:
Cambridge University Press, 377-81.
Fujimura, O. and M. Sawashima. 1971. Consonant sequences and laryngeal control.
Annual Bulletin of the Research Institute of Logopedics and Phoniatrics 5: 1-6.
Fujisaki, H. and K. Hirose. 1984. Analysis of voice fundamental frequency contours
for declarative sentences of Japanese. Journal of the Acoustical Society of Japan
5.4: 233-42.
431
References
Fujisaki, H. and H. Keikichi. 1982. Modelling the dynamic characteristics of voice
fundamental frequency with applications to analysis and synthesis of intonation.
Preprints of Papers, Working Group on Intonation, Thirteenth International
Congress of Linguists, Tokyo.
Fujisaki, H. and S. Nagashima. 1969. A model for the synthesis of pitch contours of
connected speech. Tokyo University Engineering Research Institute Annual
Report 28: 53-60.
Fujisaki, H. and H. Sudo. 1971a. A generative model for the prosody of connected
speech in Japanese. Tokyo University Engineering Research Institute Annual
Report 30: 75-80.
1971b. Synthesis by rule of prosodic features of Japanese. Proceedings of the
Seventh International Congress of Acoustics 3: 133-6.
Fujisaki, H., M. Sugito, K. Hirose, and N. Takahashi. 1983. Word accent and
sentence intonation in foreign language learning. Preprints of Papers, Working
Group on Intonation, Thirteenth International Congress of Linguists, Tokyo:
109-19.
Gage, W. 1958. Grammatical structures in American English intonation. Doctoral
dissertation, Cornell University.
Gamkrelidze, T. V. 1975. On the correlation of stops and fricatives in a phonological
system. Lingua 35: 231-61.
Garding, E. 1983. A generative model of intonation. In A. Cutler and D. R. Ladd
(eds.), Prosody: Models and Measurements. Heidelberg: Springer.
Garding, E., A. Botinis, and P. Touati. 1982. A comparative study of Swedish, Greek
and French intonation. Working Papers, Department of Linguistics, University of
Lund, 22: 137-52.
Gay, T. 1977. Articulatory movements in VCV sequences. JASA 62: 183-93.
1978. Articulatory units: segments or syllables. In A. Bell and J. Hooper (eds.),
Segments and Syllables. Amsterdam: North Holland.
1981. Mechanisms in the control of speech rate. Phonetica 38: 148-58.
Gazdar, G., E. Klein, G. Pullum, and I. Sag. 1985. Generalised Phrase Structure
Grammar. London: Basil Blackwell.
Gimson, A. C. 1960. The instability of English alveolar articulations. Le Maitre
Phonetique 113: 7-10.
1970. An Introduction to the Pronunciation of English. London: Edward Arnold.
Gobi, C. 1988. Voice source dynamics in connected speech. STL-QPSR 1: 123-59.
Goldsmith, J. 1976. Autosegmental Phonology. MIT Doctoral dissertation. New
York: Garland, 1979.
1984. Tone and accent in Tonga. In G. N. Clements and J. Goldsmith (eds.),
Autosegmental Studies in Bantu Tone (Publications in African Languages and
Linguistics 3). Dordrecht: Foris, 19-51.
Gracco, V. and J. Abbs. Variant and invariant characteristics of speech movements.
Experimental Brain Research 65: 165-6.
Greene, P.H. 1971. Introduction. In I. M. Gelfand, V. S. Gurfinkel, S. V. Fomin, and
M. L. Tsetlin (eds.), Models of Structural Functional Organization of Certain
Biological Systems. Cambridge, MA: MIT Press, xi-xxxi.
432
References
Gronnum, N. forthcoming. Prosodic parameters in a variety of regional Danish
standard languages, with a view towards Swedish and German. To appear in
Phone tica.
Grosjean, F. 1980. Spoken word recognition processes and the gating paradigm.
Perception and Psychophysics 28: 267-83.
Giinther, H. 1988. Oblique word forms in visual word recognition. Linguistics 26:
583-600.
Gussenhoven, C. 1983. Focus, mode and the nucleus. JL 19: 377-417.
1984. On the Grammar and Semantics of Sentence Accents. Dordrecht: Foris.
1988. Adequacy in intonational analysis: the case of Dutch. In H. van der Hulst
and N. Smith (eds.), Autosegmental Studies in Pitch Accent. Dordrecht: Foris.
forthcoming. Intonational phrasing and the prosodic hierarchy. Phonologica 1988.
Cambridge: Cambridge University Press.
Gussenhoven, C. and T. Rietveld. 1988. Fundamental frequency declination in
Dutch: testing three hypotheses. JPhon 16: 355-69.
Hakoda, K. and H. Sato. 1980. Prosodic rules in connected speech synthesis. Trans.
IECE. 63-D No. 9: 715-22.
Halle, M. and K. N. Stevens. 1971. A note on laryngeal features. MITPR 101:
198-213.
Halle, M. and J. Vergnaud. 1980. Three dimensional phonology. Journal of Linguistic
Research 1: 83-105.
Halliday, M. A. K. 1967. Intonation and Grammar in British English. The Hague:
Mouton.
Hammond, M. 1988. On deriving the well-formedness condition. LI 19: 319-25.
Han, M. S. 1962. Japanese Phonology: An Analysis Based on Sound Spectrograms.
Tokyo: Kenkyusha.
Haraguchi, S. 1977. The Tone Pattern of Japanese: An Autosegmental Theory of
Tonology. Tokyo: Kaitakushi.
Hardcastle, W. J. 1972. The use of electropalatography in phonetic research.
Phonetica 25: 197-215.
1976. Physiology of Speech Production: An Introduction for speech scientists.
London: Academic Press.
Harris, Z. H. 1944. Simultaneous components in phonology. Lg 20: 181-205.
Harshman, R., P. Ladefoged and L. Goldstein. 1977. Factor analysis of tongue
shapes. JASA 62: 693-707.
't Hart, J. 1979a. Naar automatisch genereeren van toonhoogte-contouren voor
tamelijk lange stukken spraak. IPO Technical Report No. 353, Eindhoven.
1979b. Explorations in automatic stylization of Fo curves. IPO Annual Progress
Report 14: 61-5.
1981. Differential sensitivity to pitch distance, particularly in speech. JASA 69:
811-21.
't Hart, J. and A. Cohen. 1973. Intonation by rule: a perceptual quest. JPhon 1:
309-27.
't Hart, J. and R. Collier. 1975. Integrating different levels of intonation analysis.
JPhon 3: 235-55.
433
References
1979. On the interaction of accentuation and intonation in Dutch. Proceedings of
The Ninth International Congress of Phonetic Sciences 2: 385-402.
Hawking, S. W. 1988. A Brief History of Time. London: Bantam Press.
Hawkins, S. 1984. On the development of motor control in speech: evidence from
studies of temporal coordination. In N. J. Lass (ed.), Speech and Language:
Advances in Basic Research and Practice 11. ZY1-1A.
Hayes, B. 1981. A Metrical theory of stress rules. Bloomington: IULC.
1986. Inalterability in CV Phonology. Lg 62.2: 321-51.
1989. Compensatory lengthening in moraic phonology. LI 20.2: 253-306.
Hayes, B. and A. Lahiri. 1991. Bengali intonational phonology. Natural Language
and Linguistic Theory 9.1: 47-96.
Hayes, B. and S. Puppel. 1985. On the rhythm rule in Polish. In H. van der Hulst and
N. Smith (eds.), Advances in Non-Linear Phonology. Dordrecht: Foris.
Hebb, D. O. 1949. The Organization of Behavior. New York: Wiley.
Helfrich, H. 1979. Age markers in speech. In K. R. Scherer and H. Giles (eds.), Social
Markers in Speech. Cambridge: Cambridge University Press.
Henderson, J. B. 1984. Velopharyngeal function in oral and nasal vowels: a cross-
language study. Doctoral dissertation, University of Connecticut.
Henke, W. L. 1966. Dynamic articulatory model of speech production using
computer simulation. Doctoral dissertation, MIT.
Hewlett, N. 1988. Acoustic properties of /k/ and /t/ in normal and phonologically
disordered speech. Clinical Linguistics and Phonetics 2: 29—45.
Hirose, K., H. Fujisaki, and H. Kawai. 1985. A system for synthesis of connected
speech - special emphasis on the synthesis of prosodic features. Onsei Ken-
kyuukai S85^13: 325-32. The Acoustical Society of Japan.
Hirose, K., H. Fujisaki, M. Yamaguchi, and M. Yokoo. 1984. Synthesis of funda-
mental frequency contours of Japanese sentences based on syntactic structure (in
Japanese). Onsei Kenkyuukai S83-70: 547-54. The Acoustical Society of Japan.
Hirschberg, J. and J. Pierrehumbert. 1986. The intonational structuring of discourse.
Proceedings of the 24th Annual Meeting, Association for Computational Linguis-
tics, 136-44.
Hjelmslev, L. 1953. Prolegomena to a Theory of Language, Memoir 7, translated by
F. J. Whitfield. Baltimore: Waverly Press.
Hockett, C. F. 1958. A Course in Modern Linguistics. New York: Macmillan.
Hombert, J-M. 1986. Word games: some implications for analysis of tone and other
phonological processes. In J. J. Ohala and J. J. Jaeger (eds.), Experimental
Phonology. Orlando, FL: Academic Press.
Honda, K. and O. Fujimura. 1989. Intrinsic vowel F o and phrase-final lowering:
Phonological vs. biological explanations. Paper presented at the 6th Vocal Fold
Physiology Conference, Stockholm, August 1989.
Hooper, J. B. 1976. An Introduction to Natural Generative Phonology. New York:
Academic Press.
Houlihan, K. and G. K. Iverson. 1979. Functionally constrained phonology. In D.
Dinnsen (ed.), Current Approaches to Phonological Theory. Bloomington:
Indiana University Press.
434
References
Householder, F. 1957. Accent, juncture, intonation, and my grandfather's reader.
Word 13: 234^45.
1965. On some recent claims in phonological theory. JL 1: 13-34.
Huang, C-T. J. 1980. The metrical structure of terraced level tones. In J. Jensen (ed.),
NELS 11. Department of Linguistics, University of Ottawa.
Huggins, A. W. F. 1964. Distortion of the temporal pattern of speech: interruption
and alternation. JASA 36: 1055-64.
Huss, V. 1978. English word stress in the post-nuclear position. Phonetica 35: 86-105.
Hyman, L. M. 1975. Phonology: Theory and Analysis. New York: Holt, Rinehart, and
Winston.
1985. A Theory of Phonological Weight. Dordrecht: Foris.
Jackendoff, R. 1972. Semantic Interpretation in Generative Grammar. Cambridge,
MA: MIT Press.
Jakobson, R., G. Fant, and M. Halle. 1952. Preliminaries to Speech Analysis: the
Distinctive Features and their Correlates. Cambridge, MA: MIT Press.
Jespersen, O. 1904. Phonetische Grundfragen. Leipzig: Teubner.
1920. Lehrbuch der Phonetik. Leipzig: Teubner.
Johnson, C. D. 1972. Formal Aspects of Phonological Description. The Hague:
Mouton.
Jones, D. 1909. Intonation Curves. Leipzig: Teubner.
1940. An Outline of English Phonetics. Cambridge: Heffer.
Joos, M. 1957. Readings in Linguistics 1. Chicago: University of Chicago Press.
Joseph, B. D. and I. Warburton. 1987. Modern Greek. London: Croom Helm.
Kahn, D. 1976. Syllable Based Generalizations in English Phonology. Bloomington,
IULC.
Kaisse, E. M. 1985. Connected Speech: the Interaction of Syntax and Phonology. New
York: Academic Press.
Kawasaki, H. 1982. An acoustical basis for universal constraints on sound sequences.
Doctoral dissertation, University of California, Berkeley.
1986. Phonetic explanation for phonological universals: the case of distinctive
vowel nasalization. In J. J. Ohala and J. J. Jaeger (eds.), Experimental Phonology.
Orlando, FL: Academic Press.
Kay, B. A., K. G. Munhall, E. Vatikiotis-Bateson, and J. A. S. Kelso. 1985. A note on
processing kinematic data: sampling, filtering and differentiation. Hashins
Laboratories Status Report on Speech Research SR-81: 291-303.
Kaye, J. 1988. The ultimate phonological units - features or elements? Handout for
paper delivered to LAGB, Durham Spring 1988.
Kaye, J., J. Lowenstamm and J.-R. Vergnaud. 1985. The internal structure of
phonological elements: a theory of charm and government. PY 2: 305-28.
Keating, P. A. 1983. Comments on the jaw and syllable structure. JPhon 11: 401-6.
1985. Universal phonetics and the organization of grammars. In V. Fromkin (ed.),
Phonetic Linguistics: Essays in Honor of Peter Ladefoged. Orlando, FL: Aca-
demic Press.
1988a. Underspecification in phonetics. Phonology 5: 275-92.
1988b. The phonology-phonetics interface. In F. Newmeyer (ed.), Cambridge
435
References
Linguistic Survey, vol. 1: Linguistic Theory: Foundations. Cambridge: Cambridge
University Press.
Kelly, J. and J. Local. 1986. Long domain resonance patterns in English. In
International Conference on Speech Input/Output; Techniques and Applications.
IEE Conference Publication 258: 304-9.
1989. Doing Phonology: Observing, Recording, Interpreting. Manchester: Manches-
ter University Press.
Kelso, J. A. S. and B. Tuller. 1984. A dynamical basis for action systems. In M.
Gazzaniga (ed.), Handbook of Cognitive Neuroscience. New York: Plenum,
321-56.
Kelso, J. A. S., E. Saltzman and B. Tuller. 1986a. The dynamic perspective on speech
production: data and theory. JPhon 14: 29-59.
1986b. Intentional contents, communicative context, and task dynamics: a reply to
the commentators. JPhon 14: 171-96.
Kelso, J. A. S., K. G. Holt, P. N. Kugler, and M. T. Turvey. 1980. On the concept of
coordinative structures as dissipative structures, II: Empirical lines of conver-
gence. In G. E. Stelmach and J. Requin (eds.), Tutorials in Motor Behavior.
Amsterdam: North-Holland, 49-70.
Kelso, J. A. S., B. Tuller, E. Vatikiotis-Bateson, and C. A. Fowler. 1984. Functionally
specific articulatory cooperation following jaw perturbations during speech:
evidence for coordinative structures. Journal of Experimental Psychology: Hu-
man Perception and Performance 10: 812-32.
Kelso, J. A. S., E. Vatikiotis-Bateson, E. L. Saltzman, and B. Kay. 1985. A qualitative
dynamic analysis of reiterant speech production: phase portraits, kinematics,
and dynamic modeling. JASA 11: 266-80.
Kenstowicz, M. 1970. On the notation of vowel length in Lithuanian. Papers in
Linguistics 3: 73-113.
Kent, R. D. 1983. The segmental organization of speech. In P. F. MacNeilage (ed.),
The Production of Speech. New York: Springer.
Kerswill, P. 1985. A socio-phonetic study of connected speech processes in Cam-
bridge English: an outline and some results. Cambridge Papers in Phonetics and
Experimental Linguistics. 4.
1987. Levels of linguistic variation in Durham. JL 23: 2 5 ^ 9 .
Kerswill, P. and S. Wright. 1989. On the limits of auditory transcription: a
sociophonetic approach. York Papers in Linguistics 14: 35-59.
Kewley-Port, D. 1982. Measurement of formant transitions in naturally produced
stop consonant-vowel syllables. JASA 72.2: 379-81.
King, M. 1983. Transformational parsing. In M. King (ed.), Natural Language
Parsing. London: Academic Press.
Kingston, J. 1990. Articulatory binding. In J. Kingston and M. Beckman (eds.),
Papers in Laboratory Phonology I: Between the Grammar and the Physics of
Speech. Cambridge: Cambridge University Press, 406-34.
Kingston, J. and M. E. Beckman (eds.). 1990. Papers in Laboratory Phonology I:
Between the Grammar and the Physics of Speech. Cambridge: Cambridge
University Press.
436
References
Kingston, J. and R. Diehl. forthcoming. Phonetic knowledge and explanation. Ms.,
University of Massachusetts, Amherst, and University of Texas, Austin.
Kiparsky, P. 1979. Metrical structure assignment is cyclic. LI 10: 421^2.
1985. Some consequences of Lexical Phonology. PY 2: 85-138.
Kiparsky, P. and C. Kiparsky. 1967. Fact. In M. Bierwisch and K. E. Heidolph (eds.),
Progress in Linguistics. The Hague: Mouton.
Klatt, D. H. 1976. Linguistic uses of segmental duration in English: acoustic and
perceptual evidence. JASA 59: 1208-21.
1980. Software for a cascade/parallel formant synthesizer. JASA 63.3: 971-95.
Klein, E. 1987. Towards a declarative phonology. Ms., University of Edinburgh.
Kohler, K. J. 1976. Die Instability wortfinaler Alveolarplosive im Deutschen: eine
elektropalatographische Untersuchung. Phonetica 33: 1-30.
1979a. Kommunikative Aspekte satzphonetischer Prozesse im Deutschen. In H.
Vater (ed.), Phonologische Probleme des Deutschen. Tubingen: Gunther Narr,
13-40.
1979b. Dimensions in the perception of fortis and lenis plosives. Phonetica 36:
332-43.
1990. Segmental reduction in connected speech in German: phonological facts and
phonetic explanations. In W. J. Hardcastle and A. Marchal (eds.), Speech
Production and Speech Modelling. Dordrecht: Kluwer, 62-92.
Kohler, K. J., W. A. van Dommelen, and G. Timmermann. 1981. Die Merkmalpaare
stimmhaft/stimmlos und fortis/lenis in der Konsonantenproduktion und -per-
zeption des heutigen Standard-Franzosisch. Institut fur Phonetik, Universitat
Kiel. Arbeitsberichte, 14.
Koutsoudas, A., G. Sanders and C. Noll. 1974. On the application of phonological
rules. Lg 50: 1-28.
Kozhevnikov, V. A. and L. A. Chistovich. 1965. Speech Articulation and Perception
(Joint Publications Research Service, 30). Washington, DC.
Krakow, R. A. 1989. The articulatory organization of syllables: a kinematic analysis
of labial and velar gestures. Doctoral dissertation, Yale University.
Krull, D. 1987. Second formant locus patterns as a measure of consonant-vowel
coarticulation. PERILUS V. Institute of Linguistics, University of Stockholm.
1989. Consonant-vowel coarticulation in continuous speech and in reference
words. STL-QPSR 1: 101-5.
Kruyt, J. G. 1985. Accents from speakers to listeners. An experimental study of the
production and perception of accent patterns in Dutch. Doctoral dissertation,
University of Leiden.
Kubozono, H. 1985. On the syntax and prosody of Japanese compounds. Work in
Progress 18: 60-85. Department of Linguistics, University of Edinburgh.
1988a. The organisation of Japanese prosody. Doctoral dissertation, University of
Edinburgh.
1988b. Constraints on phonological compound formation. English Linguistics 5:
150-69.
1988c. Dynamics of Japanese intonation. Ms., Nanzan University.
1989. Syntactic and rhythmic effects on downstep in Japanese. Phonology 6.1:
39-67.
437
References
Kucera, H. and W. N. Francis. 1967. Computational Analysis of Present-day
American English. Providence RI: Brown University Press.
Kuno, S. 1973. The Structure of the Japanese Language. Cambridge, MA: MIT
Press.
Kutik, E., W. E. Cooper, and S. Boyce. 1983. Declination of fundamental frequency
in speakers' production of parenthetical and main clauses. JASA 73: 1731-8.
Ladd, D. R. 1980. The Structure of Intonational Meaning: Evidence from English.
Bloomington: Indiana University Press.
1983a. Phonological features of intonational peaks. Lg 59: 721-59.
1983b. Levels versus configurations, revisited. In F. B. Agard, G. B. Kelley, A.
Makkai, and V. B. Makkai, (eds.), Essays in Honor of Charles F. Hockett.
Leiden: E. J. Brill.
1984. Declination: a review and some hypotheses. PY 1: 53-74.
1986a. Intonational phrasing: the case for recursive prosodic structure. PY 3:
311-40.
1986b. The representation of European downstep. Paper presented at the autumn
meeting of the LAGB, Edinburgh.
1987a. Description of research on the procedures for assigning Fo to utterances.
CSTR Text-to-Speech Status Report. Edinburgh: Centre for Speech Technology
Research.
1987b. A phonological model of intonation for use in speech synthesis by rule.
Proceedings of the European Conference on Speech Technology, Edinburgh.
438
References
Ladefoged, P. and I. Maddieson. 1989. Multiply articulated segments and the feature
hierarchy. UCLA Working Papers in Phonetics 72: 116-38.
Lahiri, A. and J. Hankamer. 1988. The timing of geminate consonants. JPhon 16:
327-38.
Lahiri, A. and J. Koreman. 1988. Syllable weight and quantity in Dutch. WCCFL 7:
217-28.
Lakoff, R. 1973. Language and woman's place. Language in Society 2: 45-79.
Langmeier, C , U. Luders, L. Schiefer, and B. Modi. 1987. An acoustic study on
murmured and "tight" phonation in Gujarati dialects - a preliminary report.
Proceedings of the Eleventh International Congress of Phonetic Sciences 1:
328-31.
Lapointe, S. G. 1977. Recursiveness and deletion. Linguistic Analysis 3.3: 227-65.
Lashley, K. S. 1930. Basic neural mechanisms in behavior. Psychological Review 37:
1-24.
Lass, R. 1976. English Phonology and Phonological Theory: Synchronic and Diachro-
nic Studies. Cambridge: Cambridge University Press.
1984a. Vowel system universals and typology: prologue to theory. PY 1: 75-112.
1984b. Phonology: an Introduction to Basic Concepts. Cambridge: Cambridge
University Press.
Lea, W. A. 1980. Prosodic aids to speech recognition. In W. A. Lea (ed.), Trends in
Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall.
Leben, W. R. 1973. Suprasegmental phonology. Doctoral dissertation, MIT.
1976. The tones in English intonation. Linguistic Analysis 2: 69-107.
1978. The representation of tone. In V. Fromkin (ed.), Tone: a Linguistic Survey.
New York: Academic Press.
Lehiste, I. 1970. Suprasegmentals. Cambridge, MA: MIT Press.
1972. The timing of utterances and linguistic boundaries. JASA 51: 2018-24.
1975. The phonetic structure of paragraphs. In A. Cohen and S. G. Nooteboom
(eds.), Structure and Process in Speech Perception. Heidelberg: Springer.
1980. Phonetic manifestation of syntactic structure in English. Annual Bulletin,
University of Tokyo RILP 14: 1-28.
Liberman, A. M. and I. G. Mattingly. 1985. The motor theory of speech perception
revised. Cognition 21: 1-36.
Liberman, M. Y. 1975. The Intonational System of English. Doctoral dissertation,
MIT. Distributed in 1978 by IULC.
Liberman, M. Y. and J. Pierrehumbert. 1984. Intonational invariance under changes
in pitch range and length. In M. Aronoff and R. T. Oehrle (eds.), Language
Sound Structure: Studies in Phonology Presented to Morris Halle. Cambridge,
MA: MIT Press, 157-233.
Liberman, M. Y. and A. Prince. 1977. On stress and linguistic rhythm. LI 8: 249-
336.
Licklider, J. C. R. and G. A. Miller. 1951. The perception of speech. In S. S. Stevens
(ed.), Handbook of Experimental Psychology. New York: John Wiley.
Lieberman, P. 1967. Intonation, Perception, and Language. Cambridge, MA: MIT
Press.
439
References
Lindau, M. and P. Ladefoged. 1986. Variability of feature specifications. In J. S.
Perkell and D. Klatt (eds.), Invariance and Variability of Speech Processes.
Hillsdale, NJ: Lawrence Erlbaum.
Lindblom, B. 1963. Spectrographic study of vowel reduction. JASA 35: 1773-81.
1983. Economy of speech gestures. In P. MacNeilage (ed.), The Production of
Speech. New York: Springer.
1984. Can the models of evolutionary biology be applied to phonetic problems? In
M. P. R. van den Broeke and A. Cohen (eds.), Proceedings of the Ninth
International Congress of Phonetic Sciences, Dordrecht: Foris.
1989. Phonetic invariance and the adaptive nature of speech. In B. A. G.
Elsdendoorn and H. Bouma (eds.), Working Models of Human Perception.
London: Academic Press.
Lindblom, B. and R. Lindgren. 1985. Speaker-listener interaction and phonetic
variation. PERILUS IV, Institute of Linguistics, University of Stockholm.
Lindblom, B. and J. Lubker. 1985. The speech homunculus and a problem of
phonetic linguistics. In V. Fromkin (ed.), Essays in Honor of Peter Ladefoged.
Orlando, FL: Academic Press.
Lindsey, G. 1985. Intonation and interrogation: tonal structure and the expression of
a pragmatic function in English and other languages. Doctoral dissertation,
UCLA.
Linell, P. 1979. Psychological Reality in Phonology. Cambridge: Cambridge Univer-
sity Press.
Local, J. K. 1990. Some rhythm, resonance and quality variations in urban Tyneside
speech. In S. Ramsaran (ed.), Studies in the Pronunciation of English: a Com-
memorative Volume in Honour of A. C. Gimson. London: Routledge, 282-92.
Local, J. K. and J. Kelly. 1986. Projection and "silences": notes on phonetic detail
and conversational structure. Human Studies 9: 185-204.
Lodge, K. R. 1984. Studies in the Phonology of Colloquial English. London: Croom
Helm.
Lofqvist, A., T. Baer, N. S. McGarr, and R. S. Story. 1989. The cricothyroid muscle
in voicing control. JASA 85: 1314-21.
Lubker, J. 1968. An EMG-cinefluorographic investigation of velar function during
normal speech production. Cleft Palate Journal 5.1.
1981. Temporal aspects of speech production: anticipatory labial coarticulation.
Phonetica 38: 51-65.
Lyons, J. 1962. Phonemic and non-phonemic phonology. International Journal of
American Linguistics, 28: 127-33.
McCarthy, J. J. 1979. Formal problems in Semitic phonology and morphology.
Doctoral dissertation, MIT.
1981. A prosodic theory of nonconcatenative morphology. LI 12.3: 373-418.
1989. Feature geometry and dependency. Phonetica 43: 84-108.
McCarthy, J. J. and A. Prince. 1986. Prosodic Morphonology. Manuscript to appear
with MIT Press.
McCawley, J. D. 1968. The Phonological Component of a Grammar of Japanese. The
Hague: Mouton.
440
References
Macchi, M. 1985. Segmental and suprasegmental features and lip and jaw articula-
tors. Doctoral dissertation, New York University.
1988. Labial articulation patterns associated with segmental features and syllables
in English. Phonetica 45: 109-21.
McClelland, J. L. and J. L. Elman. 1986. The TRACE model of speech perception.
Cognitive Psychology 18: 1-86.
McCroskey, R. L., Jr. 1957. Effect of speech on metabolism. Journal of Speech and
Hearing Disorders 22: 46-52.
MacKay, D. G. 1972. The structure of words and syllables: evidence from errors in
speech. Cognitive Psychology 3: 210-27.
Maddieson, I. 1984. Patterns of Sounds. Cambridge: Cambridge University Press.
Maeda, S. 1974. A characterization of fundamental frequency contours of speech.
MIT Quarterly Progress Report 114: 193-211.
1976. A characterization of American English intonation. Doctoral dissertation,
MIT.
Magen, H. 1989. An acoustic study of vowel-to-vowel coarticulation in English.
Doctoral dissertation, Yale University.
Makkai, V. B. 1972. Phonological Theory: Evolution and Current Practice. New York:
Holt, Rinehart, and Winston.
Malikouti-Drachman, A. and B. Drachman. 1980. Slogan chanting and speech
rhythm in Greek. In W. Dressier, O. Pfeiffer, and J. Rennison (eds.), Phonologica
1980. Innsbruck: Innsbrucker Beitrage zur Sprachwissenschaft.
Mandelbrot, B. 1954. Structure formelle des textes et communication. Word 10: 1-27.
Marslen-Wilson, W. D. 1984. Function and process in spoken word-recognition. In
H. Bouma and D. G. Bouwhuis (eds.), Attention and Performance X: Control of
Language Processes. Hillsdale, NJ: Lawrence Erlbaum.
1987. Functional parallelism in spoken word-recognition. In U. Frauenfelder and
L. K. Tyler (eds.), Spoken Word Recognition. Cambridge, MA: MIT Press.
Mascaro, J. 1983. Phonological levels and assimilatory processes. Ms., Universitat
Autonoma de Barcelona.
Massaro, D. W. 1972. Preperceptual images, processing time and perceptual units in
auditory perception. Psychological Review 79: 124-45.
Mehler, J. 1981. The role of syllables in speech processing. Philosophical Transactions
of the Royal Society B295: 333-52.
Mehler, J., J. Y. Dommergues, U. Frauenfelder, and J. Segui. 1981. The syllable's role
in speech segmentation. JVLVB 20: 298-305.
Mehrota, R. C. 1980. Hindi Phonology. Raipur.
Menn, L. and S. Boyce. 1982. Fundamental frequency and discourse structure. Lg &
Sp 25: 341-83.
Menzerath, P. and A. de Lacerda. 1933. Koartikulation, Steuerung und Lautabgren-
zung. Bonn.
Miller, J. E. and O. Fujimura. 1982. Graphic displays of combined presentations of
acoustic and articulatory information. The Bell System Technical Journal 61:
799-810.
Mills, C. B. 1980. Effects of context on reaction time to phonemes. JVLVB 19: 75-83.
441
References
Mohanan, K. P. 1983. The structure of the melody. Ms., MIT and University of
Singapore.
1986. The Theory of Lexical Phonology. Dordrecht: Reidel.
Monsen, R. B., A. M. Engebretson, and N. R. Vermula. 1978. Indirect assessment of
the contribution of sub-glottal pressure and vocal fold tension to changes
of fundamental frequency in English. JASA 64: 65-80.
Munhall, K., D. Ostry, and A. Parush. 1985. Characteristics of velocity profiles
of speech movements. Journal of Experimental Psychology: Human Perception
and Performance 2: 457-74.
Nakatani, L. and J. Schaffer. 1978. Hearing "words" without words: prosodic cues
for word perception. JASA 63: 234—45.
Nathan, G. S. 1983. The case for place - English rapid speech autosegmentally.
Chicago Linguistic Society 19: 309-16.
Nearey, T. M. 1980. On the physical interpretation of vowel quality: cinefluoro-
graphic and acoustic evidence. JPhon 8: 213-41.
Nelson, W. 1983. Physical principles for economies of speech movements. Biological
Cybernetics 46: 135-47.
Nespor, M. 1988. Rithmika charaktiristika tis Ellinikis (Rhythmic Characteristics of
Greek). Studies in Greek Linguistics: Proceedings of the 9th Annual Meeting
of the Department of Linguistics, Faculty of Philosophy, Aristotelian University
of Thessaloniki.
Nespor, M. and I. Vogel. 1982. Prosodic domains of external sandhi rules. In H. van
der Hulst and N. Smith (eds.) Advances in Non-Linear Phonology. Dordrecht:
Foris.
1983. Prosodic structure above the word. In A. Cutler and D. R. Ladd (eds.),
Prosody: Models and Measurements. Heidelberg: Springer.
1986. Prosodic Phonology. Dordrecht: Foris.
1989. On clashes and lapses. Phonology 6: 69-116.
Nittrouer, S., M. Studdert-Kennedy and R. S. McGowan. 1989. The emergence of
phonetic segments: evidence from the spectral structure of fricative-vowel
syllables spoken by children and adults. JSHR 32: 120-32.
Nittrouer, S., K. Munhall, J. A. S. Kelso, E. Tuller, and K. S. Harris. 1988. Patterns
of interarticulator phasing and their relationship to linguistic structure. JASA
84: 1653-61.
Nolan, F. J. 1986. The implications of partial assimilation and incomplete neutralisa-
tion. Cambridge Papers in Phonetics and Experimental Linguistics, 5.
Norris, D. G. and A. Cutler. 1988. The relative accessibility of phonemes and
syllables. Perception and Psychophysics 45: 485-93.
O'Connor, J. D. and G. F. Arnold. 1973. Intonation of Colloquial English, 2nd edn.
London: Longman.
Ohala, J. J. 1974. Experimental historical phonology. In J. M. Anderson and C. Jones
(eds.), Historical Linguistics, vol. II: Theory and Description in Phonology.
North-Holland, Amsterdam, 353-89.
1975. Phonetic explanations for nasal sound patterns. In C. A. Ferguson, L. M.
442
References
Hyman, and J. J. Ohala (eds.), Nasalfest: Papers from a Symposium on Nasals
and Nasalization. Stanford: Language Universals Project.
1976. A model of speech aerodynamics. Report of the Phonology Laboratory
(Berkeley) 1: 93-107.
1978. Phonological notations as models. In W. U. Dressier and W. Meid (eds.),
Proceedings of the Twelfth International Congress of Linguists, Vienna 1977.
Innsbruck: Innsbrucker Beitrage zur Sprachwissenschaft.
1979a. Universals of labial velars and de Saussure's chess analogy. Proceedings of
the Ninth International Congress of Phonetic Sciences, vol. II. Copenhagen:
Institute of Phonetics.
1979b. The contribution of acoustic phonetics to phonology. In B. Lindblom and
S. Ohman (eds.), Frontiers of Speech Communication Research. London: Aca-
demic Press.
1981a. Speech timing as a tool in phonology. Phonetica 43: 84-108.
1981b. The listener as a source of sound change. In C. S. Masek, R. A. Hendrick,
and M. F. Miller (eds.), Papers from the Parasession on Language and Behavior.
Chicago: Chicago Linguistic Society.
1982. Physiological mechanisms underlying tone and intonation. In H. Fujisaki
and E. Garding (eds.), Preprints, Working Group on Intonation, Thirteenth
International Congress of Linguists, Tokyo, 29 Aug.-4 Sept. 1982. Tokyo.
1983. The origin of sound patterns in vocal tract constraints. In P. F. MacNeilage
(ed.), The Production of Speech. New York: Springer.
1985a. Linguistics and automatic speech processing. In R. De Mori and C.-Y. Suen
(eds.), New Systems and Architectures for Automatic Speech Recognition and
Synthesis. Berlin: Springer.
1985b. Around flat. In V. Fromkin (ed.), Phonetic Linguistics. Essays in Honor of
Peter Ladefoged. Orlando, FL: Academic Press.
1986. Phonological evidence for top down processing in speech perception. In J. S.
Perkell and D. H. Klatt (eds.), Invariance and Variability in Speech Processes.
Hillsdale, NJ: Lawrence Erlbaum.
1987. Explanation in phonology: opinions and examples. In W. U. Dressier, H. C.
Luschiitzky, O. E. Pfeiffer, and J. R. Rennison (eds.), Phonologica 1984.
Cambridge: Cambridge University Press.
1989. Sound change is drawn from a pool of synchronic variation. In L. E. Breivik
and E. H. Jahr (eds.), Language Change: Do we Know its Causes Yet? (Trends in
Linguistics). Berlin: Mouton de Gruyter.
1990a. The phonetics and phonology of aspects of assimilation. In J. Kingston and
M. Beckman (eds.), Papers in Laboratory Phonology I. Between the Grammar and
Physics of Speech. Cambridge: Cambridge University Press, 258-75.
1990b. The generality of articulatory binding: comments on Kingston's "Articula-
tory binding". In J. Kingston and M. Beckman (eds.), Papers in Laboratory
Phonology I. Between the Grammar and Physics of Speech. Cambridge: Cam-
bridge University Press, 445-50.
forthcoming. The costs and benefits of phonological analysis. In P. Downing,
443
References
S. Lima, and M. Noonan (eds.), Literacy and Linguistics. Amsterdam: John
Benjamins.
Ohala, J. J. and B. W. Eukel. 1987. Explaining the intrinsic pitch of vowels. In R.
Channon and L. Shockey, (eds.), In Honor of Use Lehiste. Use Lehiste Puhendus-
teos. Dordrecht: Foris, 207-15.
Ohala, J. J. and D. Feder. 1987. Listeners' identity of speech sounds is influenced by
adjacent "restored" phonemes. Proceedings of the Eleventh International Con-
gress of Phonetic Sciences. 4: 120-3.
Ohala, J. J. and J. J. Jaeger (eds.), 1986. Experimental Phonology. Orlando, FL:
Academic Press.
Ohala, J. J. and H. Kawasaki. 1984. Phonetics and prosodic phonology. PY \\
113-27.
Ohala, J. J. and J. Lorentz. 1977. The story of [w]: an exercise in the phonetic
explanation for sound patterns. Berkeley Linguistic Society, Proceedings, Annual
Meeting 3: 577-99.
Ohala, J. J. and C. J. Riordan. 1980. Passive vocal tract enlargement during voiced
stops. Report of the Phonology Laboratory, Berkeley 5: 78-87.
Ohala, J. J., M. Amador, L. Araujo, S. Pearson, and M. Peet. 1984. Use of synthetic
speech parameters to estimate success of word recognition. JASA 75: S.93.
Ohala, M. 1979. Phonological features of Hindi stops. South Asian Languages
Analysis 1: 79-87.
1983. Aspects of Hindi Phonology. Delhi: Motilal Banarsidass.
Ohman, S. E. G. 1966a. Coarticulation in VCV utterances: spectrographic measure-
ments. JASA 39: 151-68.
1966b. Perception of segments of VCCV utterances. JASA 40: 979-88.
Olive, J. 1975. Fundamental frequency rules for the synthesis of simple declarative
sentences. JASA 57: 476-82.
Oiler, D. K. 1973. The effect of position in utterance on speech segment duration in
English. JASA 54: 1235^47.
O'Shaughnessy, D. 1979. Linguistic features in fundamental frequency patterns.
JPhonl: W9-A5.
O'Shaughnessy, D. and J. Allen. 1983. Linguistic modality effects on fundamental
frequency in speech. JASA 74, 1155-71.
Ostry, D. J. and K. G. Munhall. 1985. Control of rate and duration of speech
movements. JASA 11: 640-8.
Ostry, D. J., E. Keller, and A. Parush. 1983. Similarities in the control of speech
articulators and the limbs: kinematics of tongue dorsum movement in
speech. Journal of Experimental Psychology: Human Perception and Performance
9: 622-36.
Otsu, Y. 1980. Some aspects of rendaku in Japanese and related problems. Theoreti-
cal Issues in Japanese Linguistics (MIT Working Papers in Linguistics 2).
Palmer, F. R. 1970. Prosodic Analysis. Oxford: Oxford University Press.
Perkell, J. S. 1969. Physiology of Speech Production: Results and implications of a
444
References
quantitative cineradiographic study (Research Monograph 53). Cambridge, MA:
MIT Press.
Peters, S. 1973. On restricting deletion transformations. In M. Gross, M. H. Halle,
and M. Schutzenberger (eds.), The Formal Analysis of Language. The Hague:
Mouton.
Peters, S. and R. W. Ritchie 1972. On the generative power of transformational
grammars. Information Sciences 6: 49-83.
Peterson, G. E. and H. Barney. 1952. Control methods used in a study of vowels.
JASA 24: 175-84.
Peterson, G. E. and I. Lehiste. 1960. Duration of syllable nuclei in English. JASA 32:
693-703.
Pierrehumbert, J. 1980. The Phonetics and Phonology of English Intonation. Doctoral
dissertation, MIT; distributed 1988, Bloomington: IULC.
1981. Synthesizing intonation. JASA 70: 985-95.
forthcoming. A preliminary study of the consequences of intonation for the voice
source. STL-QPSR 4: 23-36.
Pierrehumbert, J. and M. E. Beckman. 1988. Japanese Tone Structure (Linguistic
Inquiry Monograph Series 15). Cambridge, MA: MIT Press.
Pierrehumbert, J. and J. Hirschberg. 1990. The meaning of intonation contours in the
interpretation of discourse. In P. Cohen, J. Morgan, and M. Pollack (eds.), Plans
and Intentions in Communication. Cambridge, MA: MIT Press, 271-312.
Pike, K. L. 1945. The Intonation of American English. Ann Arbor: University of
Michigan Press.
Pollard, C. and I. Sag. 1987. Information-based Syntax and Semantics. Stanford:
CSLI.
Poon, P. G. and C. A. Mateer. 1985. A study of Nepali stop consonants. Phonetica
42: 39^7.
Poser, W. J. 1984. The phonetics and phonology of tone and intonation in Japanese.
Doctoral dissertation, MIT.
Pulleyblank, D. 1986. Tone in Lexical Phonology. Dordrecht: Reidel.
1989. Non-linear phonology. Annual Review of Anthropology 18: 203-26.
Pullum, G. K. 1978. Rule Interaction and the Organization of a Grammar. New York:
Garland.
Recasens, D. 1987. An acoustic analysis of V-to-C and V-to-V coarticulatory effects
in Catalan and Spanish VCV sequences. JPhon 15: 299-312.
Repp, B. R. 1981. On levels of description in speech research. JASA 69.5: 1462-4.
1986. Some observations on the development of anticipatory coarticulation. JASA
79: 1616-19.
Rialland, A. and M. B. Badjime. 1989. Reanalyse des tons du Bambara: des tons du
nom a Torganisation generate du systeme. Studies in African Linguistics 20.1:
1-28.
Rietveld, A. C. M. and C. Gussenhoven. 1985. On the relation between pitch
excursion size and pitch prominence. JPhon 13: 299-308.
445
References
Riordan, C. 1977. Control of vocal tract length in speech. JASA 62: 998-1002.
Roach, P. 1983. English Phonetics and Phonology: a Practical Course. Cambridge:
Cambridge University Press.
Roca, I. 1986. Secondary stress and metrical rhythm. PY 3: 341-70.
Roudet, L. 1910. Elements de phonetique generate. Paris.
Rubin, P., T. Baer, and P. Mermelstein. 1981. An articulatory synthesizer for
perceptual research. JASA 70: 321-8.
Sagart, L., P. Halle, B. de Boysson-Bardies, and C. Arabia-Guidet. 1986. Tone
production in modern standard Chinese: an electromyographic investigation.
Paper presented at nineteenth International Conference on Sino-Tibetan Lan-
guages and Linguistics, Columbus, OH, 12-14 September 1986.
Sagey, E. 1986a. The representation of features and relations in non-linear phono-
logy. Doctoral dissertation, MIT.
1986b. On the representation of complex segments and their formulation in
Kinyarwanda. In E. Sezer and L. Wetzels (eds.), Studies in Compensatory
Lengthening. Dordrecht: Foris.
Saiasnrv A anH D Pisnni 1Q85 Tnfprartinn nf VnnwIpHcrp cnnrrpc in cr»r»1r^n wr*rH
References
Sereno, J. A., S. R. Baum, G. C. Marean, and P. Lieberman. 1987. Acoustic analysis
and perceptual data on anticipatory labial coarticulation in adults and children.
JASA 81: 512-19.
Setatos, M. 1974. Phonologia tis Kinis Neoellinikis {Phonology of Standard Greek).
Athens: Papazisis.
Sharf, D. J. and R. N. Ohde. 1981. Physiologic, acoustic and perceptual aspects of
coarticulation: implications for the remediation of articulatory disorders. In
N. J. Lass (ed.), Speech and Language: Advances in Basic Research and Practice,
vol. 5. New York: Academic Press, 153-247.
Shattuck-Hufnagel, S. and D. H. Klatt. 1979. Minimal use of features and marked-
ness in speech production: evidence from speech errors. JVLVB 18: 41-55.
Shattuck-Hufnagel, S., V. W. Zue, and J. Bernstein. 1978. An acoustic study of
palatalization of fricatives in American English. JASA 64: S92(A).
Shieber, S. M. 1986. An Introduction to Unification-based Approaches to Grammar.
Stanford: CSLI.
Sievers, E. 1901. Grundzuge der Phonetik. Leipzig: Breitkopf and Hartel.
Silverman, K. E. A. and J. Pierrehumbert. 1990. The timing of prenuclear high
accents in English. In J. Kingston and M. Beckman (eds.), Papers in Laboratory
Phonology I. Between the Grammar and Physics of Speech. Cambridge: Cam-
bridge University Press, 72-106.
Simada, Z. and H. Hirose. 1971. Physiological correlates of Japanese accent patterns.
Annual Bulletin of the Research Institute of Logopedics and Phoniatrics 5: 41-9.
Soundararaj, F. 1986. Acoustic phonetic correlates of prominence in Tamil words.
Work in Progress 19: 16-35. Department of Linguistics, University of
Edinburgh.
Sprigg, R. K. 1963. Prosodic analysis and phonological formulae, in Tibeto-Burman
linguistic comparison. In H. L. Shorto (ed.), Linguistic Comparison in South East
Asia and the Pacific. London: School of Oriental and African Studies.
1972. A polysystemic approach, in proto-Tibetan reconstruction, to tone and
syllable initial consonant clusters. Bulletin of the School of Oriental and African
Studies, 35. 3: 546-87.
Steele, S. 1986. Interaction of vowel Fo and prosody. Phonetica 43: 92-105.
1987. Nuclear accent Fo peak location: effects of rate, vowel, and number of
following syllables. JASA 80: Suppl. 1, S51.
Steele, S. and M. Y. Liberman. 1987. The shape and alignment of rising intonation.
JASA 81: S52.
Steriade, D. 1982. Greek prosodies and the nature of syllabification. Doctoral
dissertation, MIT.
Stetson, R. 1951. Motor Phonetics: a Study of Speech Movements in action. Amster-
dam: North-Holland.
Stevens, K. N. 1972. The quantal nature of speech: evidence from articulatory-
acoustic data. In E. E. David, Jr. and P. B. Denes (eds.), Human Communication:
a Unified View. New York: McGraw-Hill.
1989. On the quantal nature of speech. JPhon 17: 3^5.
448
References
Stevens, K. N. and S. J. Keyser 1989. Primary features and their enhancement in
consonants. Lg 65: 81-106.
Stevens, K. N., S. J. Keyser, and H. Kawasaki. 1986. Toward a phonetic and
phonological theory of redundant features. In J. S. Perkell and D. H. Klatt
(eds.), Invariance and Variability in Speech Processes. Hillsdale, NJ: Lawrence
Erlbaum.
Strange, W., R. R. Verbrugge, D. P. Shankweiler, and T. R. Edman. 1976. Consonant
environment specifies vowel identity. JASA 60: 213-24.
Sugito, M. and H. Hirose. 1978. An electromyographic study of the Kinki accent.
Annual Bulletin of the Research Institute of Logopedics and Phoniatrics 12: 35-51.
Summers, W. V. 1987. Effects of stress and final consonant voicing on vowel
production: articulatory and acoustic analyses. JASA 82: 847-63.
Summers, W. V., D. B. Pisoni, R. H. Bernacki, R. I. Pedlow, and M. A. Stokes. 1988.
Effects of noise on speech production: acoustic and perceptual analyses. JASA
84: 917-28.
Sussman, H., P. MacNeilage, and R. Hanson. 1973. Labial and mandibular dynamics
during the production of bilabial consonants: preliminary observations. JSHR
17: 397-420.
Sweet, H. 1877. A Handbook of Phonetics. Oxford: Oxford University Press.
Swinney, D. and P. Prather. 1980. Phoneme identification in a phoneme monitoring
experiment: the variable role of uncertainty about vowel contexts. Perception and
Psychophysics 27: 104-10.
Taft, M. 1978. Evidence that auditory word perceptions is not continuous: The DAZE
effect. Paper presented at the fifth Australian Experimental Psychology Confer-
ence, La Trobe University.
1979. Recognition of affixed words and the word frequency effect. Memory and
Cognition 7: 263-72.
Talkin, D. 1989. Voicing epoch determination with dynamic programming. JASA 85:
S149(A).
Thorsen, N. 1978. An acoustical analysis of Danish intonation. JPhon 6: 151-75.
1979. Interpreting raw fundamental frequency tracings of Danish. Phonetica 36:
57-78.
1980a. A study of the perception of sentence intonation: evidence from Danish.
JASA 67: 1014-30.
1980b. Intonation contours and stress group patterns in declarative sentences of
varying length in ASC Danish. ARIPUC 14: 1-29.
1981. Intonation contours and stress group patterns in declarative sentences of
varying length in ASC Danish - supplementary data. ARIPUC 15: 13-47.
1983. Standard Danish sentence intonation - phonetic data and their represen-
tation. Folia Linguistica 17: 187-220.
1984a. Variability and invariance in Danish stress group patterns. Phonetica 41:
88-102.
1984b. Intonation and text in standard Danish with special reference to the abstract
representation of intonation. In W. U. Dressier, H. C. Luschiitzky, O. E.
449
References
Pfeiffer, and J. R. Rennison (eds.), Phonologica 1984. Cambridge: Cambridge
University Press.
1985. Intonation and text in Standard Danish. JASA IT 1205-16.
1986. Sentence intonation in textual context: supplementary data. JASA 80:
1041-7.
Touati, P. 1987. Structures prosodiques du Suedois et du Francais. Lund: Lund
University Press.
Thrainsson, H. 1978. On the phonology of Icelandic aspiration. Nordic Journal of
Linguistics 1: 3-54.
Trager, G. L. and H. L. Smith. 1951. An Outline of English Structure (Studies in
Linguistics, Occasional Papers 3). Norman, OK: Battenburg Press.
Trim, J. L. M. 1959. Major and minor tone-groups in English. Le Maitre Phonetique
111:26-9.
Trubetzkoy, N. S. 1939. Grundzuge der Phonologic transl. Principles of Phonology.
C. A. M. Baltaxe, 1969. Berkeley: University of California Press.
Turnbaugh, K. R., P. R. Hoffman, R. G. Daniloff, and R. Absher. 1985. Stor>-vowel
coarticulation in 3-year-old, 5-year-old, and adult speakers. JASA 77: 1256-7.
Tyler, L. K. and J. Wessels. 1983. Quantifying contextual contributions to word
recognition processes. Perception and Psychophysics 304: 409-20.
Uldall, E. T. 1958. American "molar" r and "flapped" r. Revisto do Laboratorio de
Fonetica Experimental (Coimbra) 4: 103—6.
Uyeno, T., H. Hayashibe, K. Imai, H. Imagawa, and S. Kiritani. 1981. Syntactic
structures and prosody in Japanese: a case study on pitch contours and the
pauses at phrase boundaries. University of Tokyo, Research Institute of Logo-
paedics and Phoniatrics, Annual Bulletin 1: 91-108.
Vaissiere, J. 1988. Prediction of velum movement from phonological specifications.
Phonetica 45: 122-39.
van der Hulst, H. and N. Smith. 1982. The Structure of Phonological Representations.
Part I. Dordrecht: Foris.
Vatikiotis-Bateson, E. 1988. Linguistic Structure and Articulatory Dynamics. Doc-
toral dissertation, Indiana University. Distributed by IULC.
Vogten, L. 1985. LVS-manual. Speech processing programs on IPO-VAX 11/780.
Eindhoven: Institute for Perception Research.
Waibel, A. 1988. Prosody and Speech Recognition. London: Pitman; San Mateo:
Morgan Kaufman.
Wang, W. S-Y. and J. Crawford. 1960. Frequency studies of English consonants. Lg
&Sp 3: 131-9.
Warren, P. and W. D. Marslen-Wilson. 1987. Continuous uptake of acoustic cues in
spoken word recognition. Perception and Psychophysics 43: 262-75.
1988. Cues to lexical choice: discriminating place and voice. Perception and
Psychophysics 44: 21-30.
Warren, R. M. 1970. Perceptual restoration of missing speech sounds. Science 167:
392-3.
Wells, J. C. 1982. Accents of English 1: An Introduction. Cambridge: Cambridge
University Press.
450
References
Westbury, J. R. 1983. Enlargement of the supraglottal cavity and its relation to stop
consonant voicing. JASA 73: 1322-36.
Whitney, W. 1879. Sanskrit Grammar. Cambridge, MA: Harvard University Press.
Williams, B. 1985. Pitch and duration in Welsh stress perception: the implications for
intonation. JPhon 13: 381-406.
Williams, C. E. and K. N. Stevens. 1972. Emotions and speech: some acoustical
correlates. JASA 52: 1238-50.
Wood, S. 1982. X-ray and model studies of vowel articulation. Working Papers, Dept.
of Linguistics, Lund University 23.
Wright, J. T. 1986. The behavior of nasalized vowels in the perceptual vowel space. In
Ohala, J. J. and J. J. Jaeger (eds.), 1986. Experimental Phonology. Orlando, FL:
Academic Press.
Wright, S. and P. Kerswill. 1988. On the perception of connected speech processes.
Paper delivered to LAGB, Durham, Spring 1988.
Yaeger, M. 1975. Vowel harmony in Montreal French. JASA 57: S69.
Yip, M. 1989. Feature geometry and co-occurrence restrictions. Phonology 6: 349-74.
Zimmer, K. E. 1969. Psychological correlates of some Turkish morpheme structure
conditions. Lg 46: 309-21.
Zsiga, E. and D. Byrd. 1988. Phasing in consonant clusters: articulatory and acoustic
effects. Ms.
Zue, V. W. and S. Shattuck-Hufnagel. 1980. Palatalization of /s/ in American
English: when is a /§/ not a /§/? JASA 67: S27.
451
Name index
452
Name index
453
Name index
Iverson, G. K., 317 Lass, R., 155, 160, 198,203,219
Lea, W. A., 116
Jackendoff, R., 393 Leben, W. R., 166, 185
Jaeger, J. J., 185 Lehiste, L, 89, 172
Jakobson, R., 118, 199, 227, 279 Liberman, A. M., 26, 64, 83-^, 90, 178,
Jesperson, O., 66 398
Johnson, C. D., 191 Liberman, M. Y., 326, 332, 335, 347, 354^5,
Jones, D., 204 389, 391-3, 418
Joseph, B. D., 399^00, 414 Licklider, J. C. R., 168
Lieberman, P., 129, 327
Kahn, D., 122 Lindau, M., 31, 193
Kajiyama, M., 170, 175 Lindblom, B., 5, 65, 129, 143, 167-8, 286
Kakita, Y., 395 Lindgren, R., 129
Kawai, H., 372 Lindsey, G., 328
Kawasaki, H., 167-9, 182, 256 Linell, P., 191
Kay, B. A., 72 Linn, Q., 100
Kaye, J., 193, 196 Local, J., 4, 142, 190, 196, 204, 210, 213,
Keating, P., 26, 59, 123, 283 216, 224^8
Keller, E., 60, 69 Lodge, K. R., 205
Kelly, J., 5, 142, 190, 204, 210, 213, 226 Lofqvist, A., 171
Kelso, J. A. S., 10-11, 13-14, 18, 20, 27-8, Lorentz, J., 170, 176
60, 65, 68-70, 123 Lowenstamm, J., 196
Kenstowicz, M., 150 Lubker, J., 168
Kent, R. D., 136 Lyons, J., 142
Kerswill, P., 264, 267-8, 270, 278
Kewley-Port, D., 194 McCarthy, J. J., 150, 155, 159, 178, 187, 192,
Keyser, S. J., 150 230
King, M., 191 Macchi, ML, 87, 123
Kingston, J., 3, 60, 65, 121, 177 McCrosky, R. L., 169
Kiparsky, P., 236 McGowan, R. S., 129
Kiritani, S., 118 MacKay, D. G., 294
Klatt, D. H., 68, 116, 180, 190, 194-5 McNeil, D., 181
Klein, E., 200 MacNeilage, P. F., 14, 122
Kluender, K., 65 Maddieson, I., 176, 282
Kohler, K. J., 4, 142-3, 224, 225 Maeda, S., 326, 335, 388
Koreman, J., 230 Magen, H., 27, 56
Koutsoudas, A., 191 Malikouti-Drachman, A., 399-402, 406, 409,
Kozhevnikov, V. A., 141 414,416,418
Krakow, R. A., 181, 258 Mandelbrot, B., 173
Krull, D., 129 Marslen-Wilson, W. D., 2, 229, 231, 233,
Kubuzono, H., 125-6, 331, 345, 368-9, 372- 237-8, 255-7
3, 379-80, 382, 391-4 Mascaro, J., 5
Kucera, H., 239 Massaro, D. W., 294
Kuno, S., 369 Mattingly, I. G., 26, 280
Max-Planck Speech Laboratory, 239
de Lacerda, A., 139-40 Mehler, J., 293-^
Ladd, D. R., 94, 321, 325-6, 333^6, 342, Mehrota, R. C , 299
343, 346-9, 355, 385-6, 388, 392 Menn, L., 326
Ladefoged, P., 31, 35-6, 93, 137, 141, 159, Menzerath, P., 139-^0
191, 193-5, 282, 296-8, 304, 311-12, Mermelstein, P., 27
314 Miller, G. A., 168
Lahiri, A., 2, 229-30, 249, 252, 255-7, 274, Miller, J. E., 31
327 Mills, C. B., 293
Lapointe, S. G., 191 Mohanan, K. P., 159, 200, 203, 205
Lashley, K. S., 10 Monsen, R. B., 395
454
Name index
Munhall, K., 9, 13-15, 17, 19, 23, 30, 68-70, Rossi, M., 4
116, 121 Roudet, L., 66
Rubin, P., 27
Nakatani, L., 333
Nathan, G. S., 205 Sag, I., 217
Neary, T. M., 35 Sagart, L., 395
Nelson, W., 60 Sagey, E., 153, 160, 187, 198, 282
Nespor, M., 94, 125, 334, 390, 399-403, 405, Saltzman, E., 9-10, 13-15, 17, 19-20, 23, 27-
414,418 8, 30, 45, 60, 65, 69, 116, 122-3, 288
Nittrouer, S., 71, 129, 136 Samuel, A. G., 294
Nolan, F., 2, 4, 261, 267, 280-8 Sanders, A. G., 191
Noll, C , 191 Sato, H., 372, 380
Norris, D. G., 293 De Saussure, F., 227
Savin, H. B., 293
Ohala, J. J., 65, 137, 167-8, 170, 172, 176-9, Sawashima, M., 118, 171, 395
181-9, 225, 247, 255-6, 286-7 Schaffer, J., 333
Ohala, M., 296-8, 310-12 Schein, B., 162
Ohde, R. N., 129, 137 Scherzer, J., 294
Ohman, S., 26, 67, 173, 178, 201 Schiefer, L., 296-9, 301, 304^5, 311-18
O'Shaughnessy, D., 328 Schindler, F., 143
Ostry, D. J., 60, 68-70, 121 Scott, D., 213
Otsu, Y., 372 Segui, J., 238, 203-4
Selkirk, E., 5, 83, 94, 125, 313, 370, 372-3,
Paccia-Cooper, J-M., 213, 333-4 375, 379, 390, 393
Parush, A., 60, 69, 121 Sereno, J. A., 129
Perkell, S., 35, 67, 201 Setatos, M., 399-400, 403, 405, 414
Peters, P. S., 191 Sharf, D. J., 129, 137
Peters, S., 194 Shattuck-Hufnagel, S., 210
Peterson, G. E., 169 Shieber, S. M., 192, 197, 217
Pickering, B., 5 Shockey, L., 128, 138^1, 143-5
Pierrehumbert, J., 2-4, 64, 84, 90, 92-4, 97, Shuckers, G., 136
117-27, 193, 283, 324-7, 331-2, 334-5, Sievers, E., 66
342-3, 345, 347-8, 354-5, 368-9, 372-3, Silverman, K., 84, 92
375, 385-93, 396 Simada, Z., 395-6
Pike, K., 325 Smith, H. L., 398, 418
Plato, 180 Smith, N., 332
Pollard, C , 217 Sorensen, J., 335
Port, R., 279 Spriggs, R. K., 199
Poser, W., 368, 372-3, 385, 390, 393 Steele, S., 86, 178, 347
Prather, P., 293 Steriade, D., 153, 162
Prince, A., 230, 316, 332, 389, 398, 418 Stetson, R., 65
Pulleyblank, D., 153 Stevens, K. N., 136, 173, 175, 206-7
Pullum, G. K., 191 Strange, W., 173
Puppel, S., 398 Studdert-Kennedy, M., 129
Sudo, H., 368, 372
Rialland, A., 185 Sugito, M , 395
Recasens, D., 5, 56 Summers, W. V., 75, 143
Repp, B. R., 129 Sussman, H., 14, 122
Rietveld, T., 125-6, 326, 331, 334-5, 347, Sweet, H., 204
359-61, 363-4, 366-7, 379, 384-5, 388- Swinney, D. A., 293
91, 394
Riordan, C. J., 171, 301 Taft, M., 257
Ritchie, R. W., 191 Talkin, D., 2-3, 90, 92, 99, 117-27
Roach, P., 204, 219 Tateishi, K., 5, 370, 372-3, 379, 393
Roca, I., 398 Terken, J., 5
455
Name index
Thorsen, N., see Gronnum Vogten, L., 352
Thrainsson, H., 185 Voltaire, 176
Timmermann, G., 142
Touati, P., 331 Waibel, A., 116
Trager, G. L., 398, 418 Wall, R. E., 194
Trubetzkoy, N. S., 176, 193 Wang, W. S-Y., 176
Tuller, B., 13-14, 18, 65, 123 Warburton, I., 399, 414
Turnbaugh, K. R., 129 Warren, P., 233, 237-8, 257
Tyler, L. K., 236 Watson, I., 5
Wessels, J., 236
Uldall, E. T., 172 Westbury, J. R., 58
Uyeno, T., 380 Whitney, W., 162
Williams, B., 333
Vaissiere, J., 119 Wood, S., 31
Vatikiotis-Bateson, E., 86 Wright, J. T., 179, 270
Verbrugge, D. P., 173
Vergnaud, J-R., 151-2, 196 Yaeger, M., 178
Vermula, N. R., 395
Vogel, I., 2, 94, 124-5, 334, 379, 390, 399- Zimmer, K. E., 178
403, 405, 414, 418 Zue, V. W., 210
456
Subject index
457
Subject index
passim, 198, 202-3, 213-15, 218, 293; of syllables, 76
see also assimilation, coproduction speech timing, 68
accounts of; articulatory syllable, 141; see also length, geminates
feature spreading, 129, 141, 142, 154, Dutch, 325-6, 333n., ch. 14 passim,
158, 162 389-90
anticipatory effects, 56 dysarthic speech, 58
CV coarticulation, 129-30, 133, 138-9,
213-14 effector system, see task dynamics
domain of, 141, 203 electroglottography (EGG), 93, 97, 99, 118
in connected speech, 135 electromyography (EMG), 114, 267
measures of; allophone separation, 129, electropalatography (EPG), 206, 207, 209,
133, 136, 140; target-locus relationship, 213n., 263-73
129, 135, 140 emphasis, 348
VV coarticulation, 178 English, 10, 35, 204, 207, 237-9, 258, 278,
cocatenation, 201-2, ch. % passim 280-1, 283, 325, 328, 333n. 336, 338n.
coda, see syllable 387, 389-91, 398, 408
cohort model of spoken word recognition, English vowels, 35
229-33, 238n. Middle English, 169
compensation, see immediate compensation nasalized vowels in, 238-9, 246-8, 255-6
compositionality of phonological features, Old English, 169
199 epenthesis, 151, 153, 191, 224, 279
compound accent rules, 372 in Palestinian, 151
connected speech processes, 135, 142, 264 Ewe, 370n.
consonant harmony, 162-3 exponency, see phonetic exponency
contour segment, 150; see also segment
coordination, temporal, 13, 56-8, 169, 171— F o , see intonation, pitch accent, pitch range
2, 174, 176, 177, 181, 184; see also task features, 117-8, ch. 6 passim, 178-80, 196,
dynamics 199, 203, 231, 262, 281-3, 294, 297-8,
coproduction, 24, 26, 55, 57, 63, 67, 201, 311-18, 394
278; see also coarticulation articulator-bound, 159
CV-skeleton (CV-core, CV-tier), 150-2, 158, category-valued, 156
166-7, 177, 183, 230; see also emergence of, 93
autosegmental phonology interdependencies among, 178
intrinsic content of, 177
damping, 15-18, 43, 89; see also task n-ary valued, 156,316,318
dynamics natural classes of, 155, 161, 187, 296, 313
critical, 15, 18-20, 64, 87 phonetic corelates of, 93, 174—6, 184, 194-7
Danish, 345, 361, 366 suprasegmental, 189
declarative phonology, 198, 202, ch. % passim feature asynchrony, 186, 187, 189
declination, 117, 329-31, 346; see also feature dependence, 155
downstep feature geometry, ch. 6 passim, 178-9,
deletion, 202, 224 187
demisyllable, 294 feature specification theory, 230
dependency phonology, 191 feature spreading, see coarticulation
devoicing, 142 feature structures, 197, 200, 214-19, 224
diphone, 294 fiberscope, 118
displacement, see task dynamics final lengthening, 77-9, 83, 84, 89, 110, 116;
distinctive features, see features see also phrasing
downstep, 95, 331, ch. 14 passim, ch. 15 kinematics of, 77
passim; see also declination final lowering, 116, 326n, 354, 364, 389; see
accentual, 335, 343-7, 351-2, 358, 366-7 also phrasing
in Japanese, ch. 15 passim Firthian phonology, 184, 190, 192, 1 9 3 ^ ,
phrasal, 335, 345-7, 350, 356-8, 366-7 198, 218-9
duration focus, pragmatic, 390-2
as cue to stress, 333 foot, 142, 191
458
Subject index
French, 27, 294 integer coindexing, 218
fricatives, 129 intonation, 92, 116, ch. 13 passim, 327,
spectra of, 129 394
voiceless, 171 effects on source characteristics, 92
fundamental frequency, (Fo), see intonation, of Dutch, ch. 14 passim
pitch accent, pitch range notation of, 326, ch. 14 passim
intonational boundary, 109; see also
gating task, 236, 237-9, 240, 242, 244^5, boundary tones
250-1, 253, 257-8 intonational locus, ch. 14 passim
hysteresis effects in, 237n., 256 inverse filtering, 93, 95
geminate consonants, 150-1, 153, ch. 9 Italian, 287, 370n.
passim
geminate integrity, 151 Japanese, 324, 331, 333n., 345, ch. 15 passim
generalized phrase-structure grammar jaw, see mandible
(GPSG), 156
generative phonology, 155; see also SPE Kabardian, 283n.
German, 226, 326, 336 Kikagu, 185
GEST, 29f., 288 Kolani, 153
gesture, ch. 1 passim, 27, 7If., 288, 298; see
also task dynamics Latin, 287
opening, 73-9 length (feature), ch. 9 passim
closing, 73-9 lexical-functional grammar (LFG), 156
relation between observed and predicted lexical items, mental representation of, 229-
duration of, 79-81 31, ch. 9 passim, 255, 291
gestural amplitude, 68-70 lexical stress, see stress
gestural blending, 19, 30 long vowels, 150
gestural duration, 88-9, 120, 124 Ludikya, 186; see also word games
gestural magnitude, 105, 109, 111, 114, Luganda, 185
116, 120, 124, 169
gestural overlap, 14, 22, 28, 289; see also Maithili, 298
coarticulation Mandarin, 395
gestural phasing, 17, 18, 21, 44, 70, 72, 76 mandible, 10-13, 87-8, 114, 122-3, 140, 144
gestural score, 14, 17, 22, 27-9, 30f, 44-6, durations of movements, 88-9
52-5, 57, 60, 64, 165 Marathi, 316
gestural stiffness, 11, 16, 17, 19, 44, 68-70, metrical boost (MB), 379-86
76; lower limit on, 80-2 metrical phonology, 180, 257, 331, ch. 16
gestural structure, 28 passim
gestural truncation, 70-2 major phrase, 373, 379, 383-5, 392; see also
glottal configurations in [h] and [?], 93-4 phrasing
glottal cycle, 9 3 ^ , 98, 100 formation of, 372
graph theory, 157, 196-7, 210 mass-spring system, see task dynamics
Greek, 177, ch. 16 passim minor phrase, 377, 392; see also phrasing
Gujarati, 309n. formation of, 372
mora, 86, 324
Hausa, 387-8 stability of, 186
Hindi; stops in, 296, ch. 12 passim', see also morphology and word recognition, 257
stops morphophonologization, 274
motor equivalence, see task dynamics
iamb, 180 movement trajectories, see task dynamics
icebergs, 89, 174n. movement velocity, see task dynamics
Icelandic, 185, 187
Igbo,185-6 n-Retroflexion, 150, 162, 163
immediate compensation, 10, 13 nasal prosodies, 178
initial lengthening, 116 nasalization, 178n., 179, 182, 243, 247n.; see
insertion, see epenthesis also assimilation, nasality
459
Subject index
nasality, 234 prosody, 91, 180, ch. 13 passim, 399-400
in English and Bengali, ch. 9 passim prosodic effects across gesture type, 120-2
Nati, see n-Retroflexion prosodic structure, ch. 3 passim, 331-2,
Nepali, 309n. 338, 369, 394; effect on segmental
neutralization, 213, 231, 274 production, 112
neutral attractor, see task dynamics phrasal vs. word prosody, 91, 92, 94-7,
neutral vowel, 65; see also schwa 114-6, 121
nuclear tone, 325 Proto-Indo-European, 286
nucleus, see syllable psycholinguistics, 2, 229, 254, 293-5
Punjabi, 297
object, see task dynamics
obligatory contour principle, 166; see also rhythm, 398-402; see also stress
autosegmental phonology realization rules, 124, 215-16, 386
onset, see syllable redundancy, 173, ch. 9 passim
optoelectronic tracking, 72 register shift, 372, 380, 384-6; see also
overlap, 28, 30, 56, 57, 141, 289; see also downstep, upstep
coarticulation, task dynamics relaxation target, see task dynamics
reset, 335, 343-7, 349, 366, 380, 384f.; see
perturbation experiments, 11 also phrasing, pitch range, register shift
phase, 72, 87, 3 1 3 ^ as local boost, 343
phase space, see task dynamics as register shift, 343-7
phoneme monitoring task, 292 retroflex sounds, 162; see also n-Retroflexion
phonetic exponency, 194, 202-3, 218-19, 225 rewrite rules, 226
phonetic representations, 118, 193^4, 224-5, rhyme (rime), see syllable
278, 315, 388 rule interaction, 191
phonological domains, consistency of, 126 rule ordering, 191
phonological representations, 118, 149, 200, Russian, 278
225, 285, 315-8, 388, 394; see also
autosegmental phonology, Firthian Sanskrit, 317
phonology, metrical phonology schwa, 26, ch. 2 passim
abstractness of, 93, 229-30 epenthetic, 53
monostratal, 193 in English, ch. 2 passim
phonology-phonetics interface, 3, 24, 58, in French, 27
124-7, 193, 195, 205, 224-8, 394 intrusive, 59
phrase-final lengthening, see final lengthening simulations of, 51, 52, 65
phrase-initial lengthening, see initial targetless, ch. 2 passim, 66
lengthening unspecified, 52, 53
phrasing, 73, 337, 366 Scots, 160
effects on /h/, 109-14 secondary articulation, 179
effects on gestural magnitude, 111-4 secondary stress, ch. 16 passim
phrase boundary effects, 110-14, 336-8, segment, 4, 128, 141, ch. 6 passim, ch. 7
363; see also phrasal vs. word prosody passim, 198, 201, ch. 10 passim, ch. 11
Pig Latin, 292; see also word games passim
pitch accent, 322-6, 332, ch. 14 passim boundaries, 181
361-7, 387-90; see also accent contour segments, 150, 160, 282, 285
bitonal, 326 complex segments, 150, 160, 283, 285
pitch change, 360 segmental categories, 207
pitch range, 330, 348, 352, 356, 358, 387, hierachical view of segment structure, 161
392-7; see also register shift, declination relation to prosodic structure, 94, 117-8
point attractor, see task dynamics spatial and temporal targets in, 3
prenasalized stops, 150 steady-state segments, 172-6
prevoiced stops, ch. 12 passim segment transitions, 172—4, 176, 181
primary stress, 407-9; see also stress, Semitic verb morphology, 186
secondary stress shared feature convention, 166
prominence, 348, 387, 392-3 skeletal slots, 154; see also CV-skeleton
460
Subject index
skeleton, see CV-skeleton targets, phonetic, ch. 2 passim, ch. 3 passim,
slips of the tongue, see speech errors 322, 325
sonority, 68, 69, 83-5, 123-4 targets, tonal, ch. 14 passim
sonority space, 84, 86 scaling of, 347-51
sound change, 182, 286-7 task dynamics, 3, ch. 1 passim, ch. 2 passim,
The Sound Pattern of English (SPE), 116, 68-72; see also gesture, overlap
149, 155, 191-2, 195, 201, 296-7 abstract mass, 20
speech errors, 180, 292, 294 articulator weightings, 17
speech rate 23, 65, 136, 264, 268-9 articulator motions, 46
speech recognition, 294 articulator space, 12
speech style, 23, 129, 137, 144-5, 267, 362; body space, 12
see also casual speech coordination in, 23, 65
speech synthesis, 29, 101, 102, 190, 192, 225, displacement, 16, 70, 72-3, 76, 80, 122
268 effector system, 12, 13
speech technology, 116 evidence from acquisition of skilled
stiffness, see task dynamics movements, 22, 24, 143
stops, ch. 5 passim, 171, 175-6, ch. 12 passim mass-spring system, 15, 17, 19, 87-9
bursts, 131-3, 137, 140, 144, 172, 176 motor equivalence, 10
preaspirated stops in Icelandic, 185, 187 movement trajectories, 10, 16
stress, 4, 120-3, 151, 191, 392, ch. 16 passim; movement velocity, 70; lower limit on, 82
see also accent neutral attractor, 19, 22
accent, 332 object, 11
contrastive, 91 oscillation in task dynamics, 87
effects of phrasal stress on /h/, 103-9 phase space, 18
in English, 332 point attractor, 15
in Greek, 4, ch. 16 passim relaxation target, 66
lexical stress, 399^02, 405-6, 417-19 stiffness, 16, 17, 87
rhythmic stress, 416, 418-19; see also task dynamic model, 9, 27, 57, 69
rhythm task space, 12
word stress, 120-1 tract variables, ch. 1 passim, 23, 27, 28f.,
sub-lexical representations, 291, 295 122-3, 124; passive, 29
Swedish, 325, 328, 400 tempo, 68, 79-84; see also speech rate, final
syllabification, 294 lengthening
syllable, 73-5, 84, 90, 109, 120-3, 142-3, temporal modulation, 89
181, 191, 194, 293-4, 304, 324, 332-3, temporal overlap, see overlap
360-3, 399-401, 408-13, 417 timing slots, 166, 177; see also CV-skeleton
accented, 73-7, 84-6, 94, 324, 337-8, 341, tip of the tongue (TOT) recall, 180-1
359-60 Tokyo X-ray archive, 31
coda, 199, 213, 216-20, 228 tonal space, 329-31, 334
deaccented, 94 tone(s), 84, 116, 150-1, 185, 282, 326-9, ch.
durations of, 76, 77, 79f, 80-2 14 passim, ch. 15 passim; see also
nucleus, 204, 213, 214 downstep, upstep
onset, 119, 191, 200, 213, 216-18 as a grammatical function, 186
prosodic organization of, 83 boundary tones, 322, 326-7, 339, 363-4,
rhyme (rime), 180, 200, 214 389
unaccented, 77 in Igbo, 185, 186
syllable target, 294 in Kikuyu, 185
syntax, effects on downstep, 369, 379 and prosodic environment, 90, 91, 94
syntax-phonology interaction, 369, ch. 15 tone scaling, 84, 347-51, 395-7
passim; see also phrasing tone spreading, 178, 336, 341
of English monosyllables, 214 starred and unstarred, 326, 336
validation using synthetic signals, 101, 102 trace model of speech perception, 233n.
tract variables, see task dynamics
Tamil, 333n. transformational generative phonology
target parameter, 29, 139 (TGP), see SPE
461
Subject index
transillumination, 121 voice-onset time (VOT), 121-2, 131, 279,
trills, 179 296-8,311-12
trochee, 180 voicing, strategies for maintaining, 58
tunes, 324; see also CV-skeleton vowel centralization, 129-30, 133, 135, 137;
see also schwa, neutral vowel
underspecification, 26, 28, 54, 119, 200, 216- vowel harmony, 142, 152, 154, 178, 187
19, 198, 228, 255, 257 vowel space, 129, 169
unification grammar (UG), 156, 192
upstep, 348, 379, 382, 385 Welsh, 333
within-speaker variability, 290
variation word games, 180, 181, 294
interspeaker, 68, 119
intraspeaker, 290 X-ray micro-beam, 32^0, 46-50, 62-3, 267
situation-specific, 290
contextual variation of phonetic units, 55 Yoruba, 326
462