You are on page 1of 475

Laboratory phonology uses speech data to research questions about the

abstract categorical structures of phonology. This collection of papers


broadly addresses three such questions: what structures underlie the tem-
poral coordination of articulatory gestures; what is the proper role of
segments and features in phonological description; and what structures -
hierarchical or otherwise - relate morphosyntax to prosody? In order to
encourage the interdisciplinary understanding required for progress in this
field, each of the three groups of papers is preceded by a tutorial paper
(commissioned for this volume) on theories and findings presupposed by
some or all of the papers in the group. In addition, most of the papers are
followed by commentaries, written by noted researchers in phonetics and
phonology, which serve to bring important theoretical and methodological
issues into perspective.
Most of the material collected here is based on papers presented at the
Second Conference on Laboratory Phonology in Edinburgh, 1989. The
volume is therefore a sequel to Kingston and Beckman (eds.), Papers in
Laboratory Phonology /, also published by Cambridge University Press.
PAPERS IN LABORATORY PHONOLOGY
SERIES EDITORS: MARY E. BECKMAN AND JOHN KINGSTON

Papers in Laboratory Phonology II


Gesture, Segment, Prosody
Papers in Laboratory Phonology II
Gesture, Segment, Prosody

EDITED BY GERARD J. DOCHERTY


Department of Speech, University of Newcastle-upon-Tyne
AND D. ROBERT LADD
Department of Linguistics, University of Edinburgh

The right of the


University of Cambridge
to print and sell
all manner of books
was granted by
Henry Vlll in 1534.
The University has printed
and published continuously
since 1584.

CAMBRIDGE UNIVERSITY PRESS


CAMBRIDGE
NEW YORK PORT CHESTER MELBOURNE SYDNEY
Published by the Press Syndicate of the University of Cambridge
The Pitt Building, Trumpington Street, Cambridge CB2 1RP
40 West 20th Street, New York, NY 10011-4211, USA
10 Stamford Road, Oakleigh, Victoria 3166, Australia
© Cambridge University Press 1992
First published 1992
British Library cataloguing in publication data
Gesture, segment, prosody - (Papers in laboratory phonology
v.2)
1. Phonology
I. Docherty, Gerard J. II. Ladd. D. Robert 1947- III.
Series
414
Library of Congress cataloguing in publication data
Gesture, segment, prosody / edited by Gerard J. Docherty and D. Robert
Ladd.
p. cm. - (Papers in laboratory phonology; 2)
Based on papers presented at the Second Conference in Laboratory
Phonology, held in Edinburgh, 1989.
Includes bibliographical references and index.
ISBN 0 521 40127 5
1. Grammar, Comparative and general - Phonology - Congresses.
I. Docherty, Gerard J. II. Ladd, D. Robert, 1947-
III. Conference in Laboratory Phonology (2nd: 1989: Edinburgh,
Scotland) IV. Series.
P217.G47 1992
414-dc20 91-6386 CIP
ISBN 0 521 40127 5

Transferred to digital printing 2004

UP
Contents

List of contributors page x


Acknowledgments xiii
Introduction 1

Section A Gesture
1 An introduction to task dynamics
SARAH HAWKINS 9

2 "Targetless" schwa: an articulatory analysis


CATHERINE P. BROWMAN AND LOUIS GOLDSTEIN 26
Comments on chapter 2 SARAH HAWKINS 56
Comments on chapter 2 JOHN KINGSTON 60
Comments on chapter 2 WILLIAM G. BARRY 65
3 Prosodic structure and tempo in a sonority model of articulatory
dynamics
MARY BECKMAN, JAN EDWARDS, AND JANET FLETCHER 68
Comments on chapter 3 OSAMU FUJIMURA 87
4 Lenition of Ih/and glottal stop
JANET PIERREHUMBERT AND DAVID TALKIN 90
Comments on chapter 4 OSAMU FUJIMURA 117
Comments on chapters 3 and 4 LOUIS GOLDSTEIN 120
Comments on chapters 3 and 4 IRENE VOGEL 124
Contents
On types of coarticulation
NIGEL HEWLETT AND LINDA SHOCKEY 128
Comments on chapter 5 WILLIAM G. BARRY A N D
SARAH HAWKINS 138

Section B Segment
6 An introduction to feature geometry
MICHAEL BROE 149

7 The segment: primitive or derived?


JOHN J. OHALA 166
Comments on chapter 7 G. N. CLEMENTS 183

8 Modeling assimilation in nonsegmental, rule-free synthesis


JOHN LOCAL 190
Comments on chapter 8 KLAUS KOHLER 224
Comments on chapter # M A R I O R O S S I 227
9 Lexical processing and phonological representation
ADITI LAHIRI AND WILLIAM MARSLEN-WILSON 229
Comments on chapter 9 JOHN J. OHALA 255
Comments on chapter 9 CATHERINE P. BROWMAN 257
10 The descriptive role of segments: evidence from assimilation
FRANCIS NOLAN 261
Comments on chapter 10 BRUCE HAYES 280
Comments on chapter 10 JOHN J. OHALA 286
Comments on chapter 10 CATHERINE P. BROWMAN 287

11 Psychology and the segment


ANNE CUTLER 290

12 Trading relations in the perception of stops and their implications


for a phonological theory
LIESELOTTE SCHIEFER 296
Comments on chapter 12 ELISABETH SELKIRK 313

Section C Prosody
13 An introduction to intonationalphonology
D. ROBERT LADD 321

vin
Contents
14 Downstep in Dutch: implications for a model
ROB VAN DEN BERG, CARLOS GUSSENHOVEN, AND
TONI RIETVELD 335
Comments on chapter 14 NINA GRONNUM 359

15 Modeling syntactic effects on downstep in Japanese


HARUO KUBOZONO 368
Comments on chapters 14 and 15 MARY BECKMAN A N D
JANET PIERREHUMBERT 387

16 Secondary stress: evidence from Modern Greek


AMALIA ARVANITI 398

References 424
Name index 452
Subject index 457
Contributors

AMALIA A R V A N I T I Department of Linguistics, University of Cambridge

WILLIAM G. BARRY Department of Phonetics and Linguistics, University


College, London

MARY BECKMAN Department of Linguistics, Ohio State University

ROB VAN DEN BERG Instituut voor Fonetiek, Katholieke Universiteit,


Nijmegen

M I C H A E L BROE Department of Linguistics, University of Edinburgh

CATHERINE P. BROWMAN Haskins Laboratories

G. N. CLEMENTS Department of Modern Languages and Linguistics,


Cornell University

ANNE C U T L E R MRC Applied Psychology Unit

JAN E D W A R D S Hunter College of Health Sciences

JANET FLETCHER Speech, Hearing, and Language Research Centre,


Macquarie University

OSAMU F U J I M U R A Department of Speech and Hearing Science, Ohio


State University

LOUIS G O L D S T E I N Department of Linguistics, Yale University


Contributors
NINA GR0NNUM (formerly Thorsen) Institut for Fonetik, Copenhagen

C A R L O S GUSSENHOVEN Instituut Engels-Amerikaans, Katholieke


Universiteit, Nijmegen

SARAH H A W K I N S Department of Linguistics, University of Cambridge

BRUCE HAYES Department of Linguistics, University of California, Los


Angeles

NIGEL HEWLETT School of Speech Therapy, Queen Margaret College

JOHN KINGSTON Department of Linguistics, University of Massachusetts

K L A U S KOHLER Institut fur Phonetik, Christian-Albrechts-Universitat,


Kiel

HARUO KUBOZONO Department of British and American Studies,


Nanzan University

D. ROBERT LADD Department of Linguistics, University of Edinburgh

ADITI LAHIRI Max Planck Institut fur Psycholinguistik, Nijmegen

J O H N LOCAL Department of Language, University of York

WILLIAM MARSLEN-WILSON Department of Psychology, Birkbeck


College

F R A N C I S NOLAN Department of Linguistics, University of Cambridge

JOHN J. OHALA Department of Linguistics, University of Alberta and


Department of Linguistics, University of California,
Berkeley

JANET PIERREHUMBERT Department of Linguistics, Northwestern


University

TONI RIETVELD Instituut voor Fonetiek, Katholieke Universiteit,


Nijmegen
Contributors
MARIO ROSSI Institut de Phonetique, Universite de Provence

LIESELOTTE SCHIEFER Institut fur Phonetik und Sprachliche


Kommunikation, Universitdt Munchen

ELISABETH SELKIRK Department of Linguistics, University of


Massachusetts

L I N D A SHOCKEY Department of Linguistic Science, University of


Reading

DAVID TALKIN ATT Bell Labs

IRENE VOGEL Department of Linguistics, University of Delaware

xu
A cknowledgments

The Second Conference on Laboratory Phonology, on which this book is


based, was made possible by the financial and organizational support of a
number of people and institutions. We received outside financial support
from IBM (UK), British Telecom, and the Scottish Development Agency,
and - within the university - contributions both financial and material from
the Centre for Cognitive Science and the Centre for Speech Technology
Research. The advice and assistance of the university's very professional
Conference Centre was invaluable. We were also fortunate to have a number
of enthusiastic student assistants at the conference, who helped us and the
participants with everything from tea breaks and photocopies to taxis and
overseas phone calls: they were Hazel Sydeserff (prima inter pares), Tina
Barr, Keith Edwards, Edward Flemming, Yuko Kondo, Michael Johnston,
Mark Schmidt, and Eva Schulze-Berndt. We also thank Ethel Jack, the
Linguistics Department secretary and one of the unsung heroes of British
linguistics, for keeping track of our finances and generally ensuring that
matters stayed under control.
The task of making the collection of papers and commentaries from two
dozen different authors into a presentable manuscript was made immeasur-
ably easier by the patient assistance of Keith Edwards. Among other things,
he prepared the unified list of references and cross-checked the references in
the individual contributions; he also helped in creating the index. If we had
not had such a competent and dedicated editorial assistant it would certainly
have taken far longer for this volume to see the light of day. We are also
grateful for the advice and assistance we have received from Marion Smith
and Jenny Potts at Cambridge University Press, and for a grant from the
Centre for Speech Technology Research which helped defray the costs of
publication. For their services as referees we are grateful to Tom Baer, Nick
Clements, Jonathan Dalby, Bill Hardcastle, John Harris, Pat Keating,
A cknowledgmen ts

Ailbhe Ni Chasaide, Stefanie Shattuck-Hufnagel, Kim Silverman, and Ken


Stevens. We would also like to thank an anonymous Cambridge University
Press reviewer for very helpful comments on an earlier version of the
manuscript as a whole.
As with any conference, our greatest debt is to the participants, who held
forth and listened and discussed and ultimately made the conference what it
was. We are pleased to have been part of the development of what appears to
be a successful series of conferences, and, more importantly, what appears to
be a productive approach to learning about the sound patterns of language.
Gerard J. Docherty
D. Robert Ladd
Introduction

The Second Conference on Laboratory Phonology was organized by the


Department of Linguistics at the University of Edinburgh and took place in
Edinburgh from 30 June to 3 July 1989. The conference's primary aim was to
further the general intellectual agenda set by the ambitiously named First
Conference, which brought together researchers in phonological theory and
experimental phonetics to discuss the increasing convergence of their inter-
ests. An important secondary aim was to bring together researchers from
both sides of the Atlantic: whereas the first conference was almost exclusively
American, the second had significant delegations from several European
countries and Japan as well as the USA. This book is the record of the second
conference.
We say "record" rather than "proceedings" because the papers collected
here are neither all nor only those presented in Edinburgh. As in the first
conference, the main papers were circulated in advance and invited discus-
sants gave prepared comments at the conference; this format is reflected in
the organization of this volume as a series of chapters with commentaries.
However, all of the main papers were formally refereed after the conference,
and have been revised, in the light of both referees' comments and discussion
at the conference itself, for publication in their present form. Moreover, for a
variety of reasons, quite a number of contributions to the conference (listed
at the end of this introduction) do not appear in the volume, and as a result,
the volume's organization is rather different from that of the conference
program.
We have grouped the chapters into three main sections, which we have
called Gesture (on temporal coordination of articulatory gestures), Segment
(on the nature and classification of segments), and Prosody (on certain
aspects of the prosodic organization of speech). At the beginning of each
section we have included a tutorial chapter, presenting a synopsis of recent
Introduction

theoretical developments that are presupposed in some or all of the papers in


the section, which we hope will make the volume accessible to a wider range
of readers. This reorganization means that certain contributions appear in a
context rather different from that in which they were presented at the
conference: for example, the chapter by Cutler was originally prepared as a
general commentary on a series of papers dealing with "The Segment," and
the commentary by Vogel on Pierrehumbert and Talkin's paper was orig-
inally prepared as a general commentary on a series of papers dealing with
prosody and prosodic effects on segmental realization. We hope that in
imposing the organization we have chosen we have nevertheless succeeded in
preserving the sense of productive debate that was present at the conference.
Cutting across the organization into three subject areas we see three major
issues: the methodology and design of laboratory research on phonology; the
psychological reality of lexical and phonological representations; and the
nature of the phonology-phonetics "interface." On the question of methodo-
logy and design, the papers collected here seem to show a growing consensus.
Fujimura finds Pierrehumbert and Talkin's paper an exemplary case of how
laboratory phonology should be carried out, but the rigor exhibited in their
experimental descriptions is seen in several of the papers, such as Lahiri and
Marslen-Wilson's, Beckman, Edwards, and Fletcher's, Browman and Gold-
stein's, and Nolan's. We feel that papers such as these set the baseline for
future work in laboratory phonology. Fujimura, Vogel, and Cutler all note
that if phonology is to be successfully tested in the laboratory, scientific rigor
is essential. This applies to the use of terminology as well as to laboratory
practice; some specific problems in this area are raised in the contributions by
Vogel and by Hewlett and Shockey.
The second theme - the question of psychological reality of lexical
representations - is explicitly addressed only in the papers by Lahiri and
Marslen-Wilson and by Cutler, but we feel that it is an underlying issue
throughout the volume. Lahiri and Marslen-Wilson, in an approach that
departs from much recent work in laboratory phonology, use listeners'
ability to recognize sounds in a gating-paradigm experiment to argue for
abstractness in the lexical representation. Cutler reviews the considerable
body of evidence for the psycholinguistic relevance of the segment. Both
papers raise a fundamental problem: to what extent is it possible to relate
models of phonological and phonetic representation, which are at the center
of the debate in the rest of this volume, to what speakers actually do when
they are producing an utterance? Lahiri and Marslen-Wilson suggest that
psycholinguistic evidence can be used successfully to empirically evaluate the
claims of theoretical phonology, whereas Cutler suggests that phonology and
psycholinguistics may be fundamentally different exercises, and points to the
"orthogonality of existing psycholinguistic research to phonological issues."
Introduction

Some resolution of this question is obviously crucial for any approach to


phonology that looks to the laboratory and to detailed observations of
speech production for evidence and evaluation.
The central issue in laboratory phonology, however, is and remains the
long-standing problem of the relation between phonetics and phonology.
Where does one end and the other begin - or, to be more specific, can
phonetics be attributed entirely to neuromotor aspects of the vocal mecha-
nism, or is there a case for inclusion of a phonetic implementation compo-
nent in the grammar?
The papers on gestural coordination and "task dynamics" in the first
section of the volume directly address the general issue here. These papers
ask whether a gesture-based phonology couched in terms of task dynamics
(the theory of skilled movement developed at Haskins Laboratories and
presented here in a tutorial chapter by Hawkins) provide a basis for
understanding the phonology-phonetics interface. Browman and Goldstein,
and to some extent also Beckman, Edwards, and Fletcher, argue that this is a
powerful approach, but notes of caution are sounded by both Fujimura and
Kingston in their commentaries on these papers. Kingston, for example,
while applauding the descriptive elegance of Browman and Goldstein's
gestural phonology, raises serious doubts about its explanatory adequacy; he
claims that the model is too powerful and lacks the constraints that would
permit it to predict only those things that are found to occur in speech.
More specific aspects of the phonology-phonetics interface question are
also raised by the papers in the first section of the volume. Hawkins, in her
commentary on Browman and Goldstein, comments on the need to be
specific about which aspects of the realization of an utterance can be
attributed to the gestural score, and which to the model of motor control.
For example, should the gestural score contain all the language-particular
phonetic characteristics of an utterance? If so, this would involve incorporat-
ing a large amount of (redundant) phonetic detail into the gestural score. If
not, it has yet to be demonstrated how such aspects of phonetic realization
could arise from a task-dynamics model of motor control. A related issue is
the degree of specification associated with the spatial and temporal targets in
the gestural score. Browman and Goldstein propose that certain aspects of
the gestural score can be left unspecified and the detailed instantiation of any
particular target determined by the task-dynamics model. In a somewhat
different approach to a comparably complex problem of phonetic variability,
the papers by Beckman, Edwards, and Fletcher and by Pierrehumbert and
Talkin suggest that a notion of "prosodic modulation" - effects of phrase-
level and word-level prosodic structure on the laryngeal component in the
production of consonants - may make it unnecessary to posit a large number
of separate rules for superficially independent types of phonetic variability.
Introduction
In the second section of the volume the question of the phonology-
phonetics "interface" is attacked on two further fronts. First, papers by
Local and Nolan (with associated commentaries by Hayes, Browman,
Kohler, and Ohala) deal with assimilation: to what extent is assimilation the
result of a phonological "rule" rather than a phenomenon emerging from the
organization and coordination of articulator variables in the execution of an
utterance? The key data discussed in these papers are instrumental measure-
ments showing that assimilations are commonly no more than partial; the
theoretical models which are brought to bear on such findings by Local,
Hayes, and Kohler are extremely varied, and themselves make a number of
predictions which are open to future empirical investigation. The second
question addressed here, in the papers by Ohala and Local (and the
comments by Clements, Kohler, and Rossi), is whether it is justified to posit a
role for the segment in a phonological representation. This is not a new issue,
but as Broe points out in his tutorial chapter, the power of nonsegmental
representations in accounting for the sound pattern of languages has been
recognized by more and more investigators over the last couple of decades.
Finally, in the third section of the volume, we see instrumental phonetic
evidence treated as central in the search for appropriate phonological
descriptions - for example, in Arvaniti's paper on the phenomena to be
accounted for in any phonological description of Greek stress. We feel that
the central role accorded to instrumental data in these papers has somewhat
equivocal implications for laboratory phonology in general. It could be seen
as pointing the way to a new conception of the phonology-phonetics
interface, or it could simply show that segmental and prosodic phenomena
really are different. That is, phonological descriptions of prosodic pheno-
mena such as intonation have tended (in the absence of a pretheoretical
descriptive construct like the orthographically based segment) to differ from
one another in ways that cannot be resolved by reference to common
conceptions of phonology and phonetics. Only with the application of
instrumental evidence to questions of phonological organization rather than
speech production have we begun to approach some consensus; as Beckman
and Pierrehumbert note in their commentary, the modeling of high tones
(fundamental-frequency peaks) is "one of the success stories of laboratory
phonology." It remains to be seen how widely this success can be extended
into the realm of gestures and segments.
Introduction

Conference contributions not included in this volume


Papers
Lou Boves, "Why phonology and speech technology are different"
Jean-Marie Hombert, "Nasal consonants and the development of vowel
nasalization"
Brian Pickering and John Kelly, "Tracking long-term resonance effects"
Elisabeth Selkirk and Koichi Tateishi, "Syntax, phrasing, and prominence
in the intonation of Japanese"

Commentaries
Stephen R. Anderson
Gosta Bruce
Stephen D. Isard
Bjorn Lindblom
Joan Mascaro

Posters
Anne Cutler
Janet Fletcher
Sarah Hawkins
Jill House
Daniel Recasens
Jacques Terken
Ian Watson
Section A
Gesture
I
An introduction to task dynamics

SARAH HAWKINS

1.1 Motivation and overview


The aim of this paper is to describe for the nonspecialist the main features of
task dynamics so that research that uses it can be understood and evaluated
more easily.* Necessarily, there are some omissions and simplifications.
More complete accounts can be found in the references cited in the text,
especially Saltzman (1986) and Saltzman and Munhall (1989); Browman and
Goldstein (1989, 1990) offer clear descriptions that focus more on the
phonologically relevant aspects than the mathematical details. The task-
dynamic model is being developed at the same time as it is being used as a
research tool. Consistent with this paper's purpose as a general introduction
rather than a detailed critique, it mainly describes the current model, and
tends not to discuss intentions for how the model should ultimately work or
theoretical differences among investigators.
Task dynamics is a general model of skilled movement control that was
developed originally to explain nonspeech tasks such as reaching and
standing upright, and has more recently been applied to speech. It is based on
general biological and physical principles of coordinated movement, but is
couched in dynamical rather than anatomical or physiological terms. It
involves a relatively radical approach that is more abstract than many more
traditional systems, and has proved to be a particularly useful way of
analyzing speech production, partly because it breaks complex movements
down into a set of functionally independent tasks.
Task dynamics describes movement in terms of the tasks to be done, and
the dynamics involved in doing them. A single skilled movement may involve

*I thank Thomas Baer, Catherine Browman, and Elliot Saltzman for helpful comments on
earlier versions of this paper.
Gesture
several discrete, abstract tasks in this model. Speech requires a succession of
skilled movements, each of which is modeled as a number of tasks. For
example, to produce English [J], the precise action of the tongue is critical,
the lips are somewhat rounded and protruded, and the vocal folds are moved
well apart to ensure voicelessness and a strong flow of air. In addition to
controlling respiratory activity, therefore, there may be at least five distinct
tasks involved in saying an isolated [J]: keeping the velopharyngeal port
closed, and controlling the degree of tongue constriction and its location, the
degree of lip protrusion, the size of the lip aperture, and the size of the glottal
aperture. The same articulator may be involved in more than one task. In this
example, the jaw contributes to producing both the correct lip aperture and
the correct tongue constriction. To add to the complexity, when the [J] is
spoken as part of a normal utterance, each of its tasks must be carried out
while the same articulators arefinishingor starting tasks required for nearby
sounds.
The sort of complicated tasks we habitually do with little conscious effort -
reaching for an object, lifting a cup to the mouth without spilling its contents,
speaking - are difficult to explain using physiological models in which the
angles of the joints involved in the movement are controlled directly. These
models work well enough when there is only one joint involved (a "single-
degree-of-freedom" task), because such movements are arc-shaped. But even
simple tasks usually involve more than one joint. To bring a cup to the
mouth, for example, the shoulder, elbow and wrist joints are used, and the
resultant movement trajectory is not an arc, but a more-or-less straight line.
Although describing how a straight-line trajectory could be controlled
sounds like a simple problem, it turns out to be quite complicated. Task
dynamics offers one way of modeling quasi-straight lines (Saltzman and
Kelso 1987).
A further complication is the fact that in these skilled "multi-degree-of-
freedom" tasks (involving more than one joint) the movement trajectory
usually has similar characteristics no matter where it is located in space.
When reaching outwards for an object, the hand tends to move in a straight
line, regardless of where the target is located with respect to the body before
the movement begins, and hence regardless of whether the arm starts by
reaching away from the body, across the body, or straight ahead. This ability
to do the same task using quite different joint angles and muscle contractions
is a fundamental characteristic of skilled movement, and has been called
"motor equivalence" (Lashley 1930; Hebb 1949).
One important property related to motor equivalence is immediate
compensation, whereby if a particular movement is blocked, the muscles
adjust so that the movement trajectory continues without major disruption
to attain the final goal. Immediate compensation has been demonstrated in
10
1 Sarah Hawkins
"perturbation" experiments, in which a moving articulator - usually an
elbow, hand, lip, or jaw - is briefly tugged in an unpredictable way so that the
movement is disrupted (e.g. Folkins and Abbs 1975; Kelso et al. 1984). The
articulators involved in the movement immediately compensate for the tug
and tend to return the movement to its original trajectory by adjusting the
behavior of the untugged as well as the tugged articulators. Thus, if the jaw is
tugged downwards during an upward movement to make a [b] closure, the
lips will compensate by moving more than usual, so that the closure occurs at
about the same time in both tug and no-tug conditions. These adjustments
take place more or less immediately (15-30 msec, after the perturbation
begins), suggesting an automatic type of reorganization rather than one that
is under voluntary, attentional control. Since speaking is a voluntary action,
however, we have a reflexive type of behavior within a clearly nonreflexive
organization. These reflexive yet flexible types of behavior are called
"functional reflexes."
Task dynamics addresses itself directly to explaining both the observed
quasi-straight-line movement trajectories and immediate compensation of
skilled limb movements. Both properties can be reasonably simply explained
using a model that shapes the trajectories implicitly as a function of the
underlying dynamics, using as input only the end goal and a few parameters
such as "stiffness," which are discussed below. The model automatically
takes into account the conditions at the start of the movement. The resulting
model is elegant and uses simple constructs like masses and springs.
However, it is hard to understand at first because it uses these constructs in
highly abstract ways, and requires sophisticated mathematics to translate the
general dynamical principles into movements of individual parts of the body.
Descriptions typically involve a number of technical terms which can be
difficult to understand because the concepts they denote are not those that
are most tangible when we think about movement. These terms may seem to
be mere jargon at first sight, but they have in fact been carefully chosen to
reflect the concepts they describe. In this paper, I try to explain each term
when I first use it, and I tend to use it undefined afterwards. New terms are
printed in bold face when they are explained.

1.2 Basic constructs


The essence of task dynamics, making it distinct from other systems, is
implicit in its name. It describes movement in terms of the tasks to be done,
using dynamics that are specific to the task but not to the parts of the body
that are doing the task. A task is generally a gesture involving the control of a
single object, or abstract representation of the actual thing being controlled.
For example, the object represents in abstract terms the position of a hand
11
Gesture
relative to a target in a reaching task, or an articulatory constriction in a
speech task. Such "objects" are called task variables; they describe the type of
task required. In order to realize the movement, the abstract disembodied
task must be converted into a set of parameters appropriate for the part of
the body that will perform the task, and finally into movements of the actual
articulators. Currently, only one type of task (described below) is modeled
for speech, so most recent discussions make no effective distinction between
task variables, which are associated with the disembodied task, and the
variables associated with specific body parts, which are called tract variables,
as in vocal-tract variables (sometimes, local tract variables). In current
formulations, tract variables are the dimensions allowing particular vocal-
tract constrictions to be specified.
In a reaching task, it is the position of the target that is being reached for
that is most important, so the task is defined on a system of coordinates with
the task target at the common origin. The position of the abstract hand with
respect to the target is the variable that is being controlled. In speech,
similarly, the task is defined in terms of the location (place) and cross-
sectional area (degree) of an ideal constriction. This nonspecific task is then
transformed appropriately for a specific vocal-tract constriction. For
example, the degree of openness suitable for some lip configuration is
regarded in the task-dynamic model as a requirement to achieve a particular
lip aperture, not as a requirement for lips and jaw to be in a certain position.
The lips, in this case, are the terminal devices, or end effectors, since their
position directly defines the lip aperture; the upper and lower lips and the jaw
together form the effector system - the set of organs that includes the
terminal device and gets it to the right place at the right time. Lip aperture
itself is a tract variable.
Thus, whereas earlier models use coordinates of either body space or
articulator space to specify the mathematics of movement, and so describe
movement in terms of the position of a physical object with respect to the
body or in terms of the angles of joints respectively, task dynamics begins by
defining movement in terms of an abstract task space, using spatial coordi-
nates and equations of motion that are natural for the task, rather than for
the particular parts of the body that are performing the task. Coordinating
the different parts of the body to produce the movement is done entirely
within the model.
In summary, to carry out the transformations between task, body, and
articulator spaces, the task-dynamic model translates each task-specific
equation into equivalent equations that apply to motion of particular parts
of the body. The first transformation is from the abstract task space to a
more specific (but still relatively abstract) body space. The task-space
equation is defined solely in terms of the type of task: the place and degree of
12
1 Sarah Hawkins
an unspecified constriction. The transformation into body space specifies the
actual task: the tract variable involved, which for speech is a specific
constriction such as lip aperture or the location of the constriction formed by
the tongue dorsum. The second transformation is from body space to
articulator space, or from the tract variables to the articulators involved.
This transformation is complicated since in speech there are usually several
articulators comprising the effector system of a single tract variable. There
are hence many degrees of freedom, or many ways for the component
articulators to achieve the same movement trajectory. Detailed accounts of
all these transformations are given in Kelso, Saltzman, and Tuller (1986a,
1986b), Saltzman (1986), and Saltzman and Munhall (1989).
This way of structuring the model allows immediate compensation to be
accounted for elegantly. Control is specified at the tract-variable level. From
this level there is an automatic mapping to the equations that govern the
motions of individual articulators. Since the mapping is constantly updated,
small perturbations of an articulator will be immediately and automatically
adjusted for in response to the demands of the tract-variable goal governing
the effector system; no explicit commands for adjustment are necessary, so
compensation will take place with the very short time lags that are typically
observed.
Speech movements are characterized using abstract, discrete gestures. A
gesture is defined as one of a family of movement patterns that are
functionally equivalent ways of achieving the same goal, such as bilabial
closure. A major distinction between gesture and movement is that gestures
can only be defined with reference to the goal or task, whereas movements
need not be linked to specific tasks. One or more tract variables contribute to
a gesture. So, for speech, gestures are defined in terms of vocal-tract
constrictions. Velic and glottal gestures are modeled as deriving directly
from, respectively, the velic and glottal aperture tract variables. In each case,
a single dynamical equation specifies the gesture. Lingual gestures depend on
two tract variables, so each one is specified by two dynamical equations, one
for the place and one for the degree of constriction.
To make a gesture, the activities of all the articulators that contribute to
the relevant tract variables are coordinated. That is, the components of the
effector systems work as functional units, specified by the dynamical equa-
tions. The zero lip aperture needed to produce a [b] closure is achieved by
some combination of raising the jaw and lower lip and lowering the upper lip.
In a sequence like [aba], the jaw will be in a low (nonraised) position to
achieve the open aperture for the two [a]s, whereas it will be much higher for
the close vowels of [ibi]. Consequently, the lips are likely to show different
patterns of activity to achieve the aperture for [b] in the two sequences. The
task is the same for [b] in both [aba] and [ibi], and the tract variables (and

13
Gesture

hence the effector systems and terminal devices) are the same. But the
physical details of how the task is achieved differ (see e.g. Sussman,
MacNeilage, and Hanson 1973).
The observed movements are governed by the dynamics underlying a series
of discrete, abstract gestures that overlap in time and space. At its present
stage of development, the task-dynamic model does not specify the sequenc-
ing or relative timing of the gestures, although work is in progress to
incorporate such intergestural coordination into the model (Saltzman and
Munhall 1989). The periods of activation of the tract variables are thus at
present controlled by a gestural score which is either written by the
experimenter or generated by rule (Browman et al 1986). In addition to
giving information about the timing of gestures, the gestural score also
specifies a set of dynamic parameters (such as the spatial target) that govern
the behavior of each tract variable in a gesture. Thus the gestural score
contains information on the relative timing and dynamic parameters
associated with the gestures in a given utterance; this information is the input
to the task-dynamic model proper, which determines the behavior of
individual articulators. In this volume, Browman and Goldstein assign to the
gestural score all the language-particular phonetic/phonological structure
necessary to produce the utterance.
The system derives part of its success from its assumption that continuous
movement trajectories can be analyzed in terms of discrete gestures. This
assumption is consistent with the characterization of speech as successions of
concurrently occurring tasks described above. Two important consequences
of using gestures are, first, that the basic units of speech are modeled as
movements towards targets rather than as static target positions, and second,
that we can work with abstract, essentially invariant units in a way that
produces surface variability. While the gestures may be invariant, the
movements associated with them can be affected by other gestures, and thus
vary with context. Coarticulation is thus modeled as inherent within the
speech-production process.

1.3 Some details of how the model works


The original ideas for task dynamics grew out of work on movement control
in many laboratories, especially in the Soviet Union and the USA (see
Bernstein 1967; Greene 1971), and were later developed in the context of
action theory (e.g. Fowler et al 1980; Kelso et al 1980; Kelso and Tuller
1984). More recently, Saltzman (1986) has contributed the equations that
allow the model to be implemented for speech. (For applications to limb
movement, see Saltzman and Kelso [1987].) The equations involve math-

14
1 Sarah Hawkins

ematics whose details few nonspecialists are likely to understand - 1 certainly


do not. The important thing from the point of view of understanding the
significance of task dynamics is that the equations determine the possible
movement trajectories for each tract variable, or functional grouping of
articulators, and also how these trajectories will combine for each gesture
(grouping of tract variables) and for each context of overlapping gestures. In
general, motion associated with a single tract variable is governed by means
of a differential equation, and matrix transformations determine how the
movements for component articulators contribute to this motion.
The type of differential equation that is used in modeling movement
depends on the characteristics of that movement. For speech, the behavior of
a tract variable is viewed by task dynamics as a movement towards a spatial
target. This type of task is called a point-attractor task. It is modeled, for each
tract variable, with a differential equation that describes the behavior of a
mass connected to a spring and a damper. Movements of the mass represent
changes in the value of the tract variable (e.g. changes in lip aperture). The
mass is entirely abstract and is considered to be of constant size, with an
arbitrary value of one. The spring can be thought of as pulling the tract
variable towards it target value. The resting or equilibrium position of the
spring represents the target for the tract variable and the mass moves because
the spring tries to reach equilibrium. It is as if one end of the spring is
attached to the mass, and the other end is moved around in space by the
gestural score to successive target locations. When one end is held at a target
location, the mass will begin to move towards that location, because the
spring will begin to move towards its equilibrium position and drag the mass
with it.
When an undamped mass-spring system is set into motion, the result is a
sinusoidal oscillation. Damping introduces decaying oscillations or even non-
oscillatory movement, and can have an enormous effect on movement
patterns. The type of damping introduced into the mass-spring system affects
the duration and trajectory of a movement. Saltzman and Munhall (1989:
346, 378) imply that the damping coefficient for a tract variable (as well as all
other parameters except mass), will vary with the phonetic category of the
sound being produced, at least when estimated from actual speech data.
While the nature of damping is being actively investigated, in practice only
one form, critical damping, is used for nonlaryngeal gestures at present.
(Laryngeal gestures are undamped.) In a critically damped mass-spring
system, the mass does not oscillate sinusoidally, but only asymptotes towards
the equilibrium (or target) position. In other words, the mass moves
increasingly slowly towards the target and never quite reaches it; there is no
physical oscillation around the target, and, except under certain conditions,

15
Gesture
no "target overshoot." One consequence of using critically damped trajec-
tories is that the controlled component of each gesture is realized only as a
movement towards the target.
The assumption of constant mass and the tendency not to vary the degree
of damping mean that we only need consider how changes in the state of the
spring affect the pattern of movement of a tract variable. The rate at which
the mass moves towards the target is determined by how much the spring is
stretched, and by how stiff the spring is. The amount of stretch, called
displacement, is the difference between the current location of the mass and
the new target location. The greater the displacement, the greater the peak
velocity (maximum speed) of movement towards equilibrium, since, under
otherwise constant conditions, peak velocity is proportional to displacement.
A stiff spring will move back to its resting position faster than a less stiff
spring. Thus, changes in the spring's stiffness affect not only the duration of a
movement, but also the ratio of its peak velocity to peak displacement. This
ratio is often used nowadays in work on movement.
Displacement in the model can be quite reasonably related to absolute
physical displacement, for example to the degree of tongue body/jaw
displacement in a vocalic opening gesture. Stiffness, in contrast, represents a
control strategy relevant to the behavior of groups of muscles, and is not
directly equatable to the physiological stiffness of individual muscles
(although Browman and Goldstein [1989] raise the possibility that there may
be a relationship between the value of the stiffness parameter and bio-
mechanical stiffness).
In phonetic terms, changes in stiffness affect the duration of articulatory
movement. Less stiffness results in slower movement towards the target. The
phonetic result depends partly on whether there are also changes in the
durations and relative timing of the gestures. For example, if there were no
change in gestural activation time when stiffness was reduced, then the slower
movement could result in gestures effectively undershooting their targets. In
general, stiffness can be changed within an utterance to affect, for example,
the degree of stress of a syllable, and it can be changed to affect the overall
rate of speech. Changes in displacement (stretch) with no change in stiffness,
on the other hand, affect the maximum speed of movement towards the
target, but not the overall duration of the movement.
To generate movement trajectories, the task-dynamic model therefore
needs to know, for each relevant task variable, the current state of the system
(the current position and velocity of the mass), the new target position, and
values for the two parameters representing, respectively, the stiffness of the
hypothetical spring associated with the task variable and the type of friction
(damping) in the system. The relationships between these various parameters

16
1 Sarah Hawkins
are described by the following differential equation for a damped mass-
spring system.

mx + bx + k(x - xo) = 0

where: m= mass associated with the task variable


b= damping of the system
k= stiffness of the spring
xo = equilibrium position of the spring (the target)
x= instantaneous value of the task variable (current location of the
mass)
x = instantaneous velocity of the task variable
x = instantaneous acceleration of the task variable
(x - xo) = instantaneous displacement of the task variable

Since, in the model, the mass m has a value of 1.0, and the damping
ratio, b/(2*[mk]1/2), is also usually set to 1.0, then once the stiffness k and
target xo are specified, the equation can be solved for the motion over time
of the task variable x. (Velocity and acceleration are the first and second
time derivatives, respectively, of x, and so can be calculated when the time
function for x is known.) Solving for x at each successive point in time
determines the trajectory and rate of movement of the tract variable. Any
transient perturbation of the ongoing movement, as long as it is not too
great, is immediately and automatically compensated for because the
point-attractor equation specifies the movement characteristics of the
tract variable rather than of individual articulators.
Matrix transformations in the task-dynamic system determine how
much each component articulator contributes to the movement of the
tract variable. These equations use a set of articulator weightings which
specify the relative contributions of the component articulators of the
tract variable to a given gesture. These weightings comprise an additional
set of parameters in the task-dynamic model. They are gesture-specific,
and so are included in the gestural score.
The gestural score thus specifies all the gestural parameters: the
equilibrium or target position, xo; the stiffness, k; the damping ratio that,
together with the stiffness, determines the damping factor b; and the
articulator weightings. It also specifies how successive gestures are co-
ordinated in time.
As mentioned above, the issue of how successive gestures are coordi-
nated in time is difficult and is currently being worked on (Saltzman and
Munhall 1989). A strategy that has been used is to specify the phase in one
gesture with respect to which a second gesture is coordinated. The

17
Gesture
definition of phase for these purposes has involved the concept of phase
space - a two-dimensional space in which velocity and displacement are
the coordinates (Kelso and Tuller 1984). Phase space allows a phase to be
assigned to any kind of movement, but for movements that are essentially
sinusoidal, phase has its traditional meaning. So, for example, starting a
second gesture at 180 degrees in the sinusoidal movement cycle of a first
gesture would mean that the second one began when the first had just
completed half of its full cycle. But critically damped movements do not
lend themselves well to this kind of strategy: under most conditions, they
never reach a phase of even 90 degrees, as defined in phase space.
Browman and Goldstein's present solution to this problem uses the
familiar notion of phase of a sinusoidal movement, but in an unusual way.
They assume that each critically damped gesture can also be described in
terms of a cycle of an underlying undamped sinusoid. The period of this
undamped cycle is calculated from the stiffness associated with the
particular gesture; it represents the underlying natural frequency of the
gesture, whose realization is critically damped. Two gestures are coordi-
nated in time by specifying a phase in the underlying undamped cycle for
each one, and then making those (underlying) phases coincide in time. For
example, two gestures might be coordinated such that the point in one
gesture that is represented by an underlying phase of 180 degrees
coincided in time with the point represented by an underlying phase of 240
degrees in the other. This approach differs from the use of phase
relationships described above in that phase in a gesture does not depend
on the actual movement associated with the gesture, but rather on the
stiffness (i.e. underlying natural frequency) for that gesture. This ap-
proach, together with an illustration of critical damping, is described in
Browman and Goldstein (1990: 346-8).
Coordinating gestures in terms of underlying phases means that the
timing of each gesture is specified intrinsically rather than by using an
external clock to time movements. Changing the phases specified by the
gestural score can therefore affect the number of gestures per unit time,
but not the rate at which each gesture is made. If no changes in stiffness
are introduced, then changing the phase specifications will change the
amount of overlap between gestures. In speech, this will affect the amount
of coarticulation, or coproduction as it is often called (Fowler 1980),
which can affect the style of speech (e.g. the degree of casualness) and,
indirectly, the overall rate of speech.
The discussion so far has described how movement starts - by the
gestural score feeding the task-dynamic model with information about
gestural targets, stiffnesses, and relative timing - but what causes a
movement to stop has not been mentioned. There is no direct command
18
1 Sarah Hawkins
given to stop the movement resulting from a gesture. The gestural score
specifies that a gesture is either activated, in which case controlled
movement towards the target is initiated, or else not activated. When a
gesture is no longer activated, the movements of the articulators involved
are governed by either of two factors: first, an articulator may participate
in a subsequent gesture; second, each articulator has its own inherent rest
position, described as a neutral attractor, and moves "passively" towards
this rest position whenever it is not involved in an "actively" controlled
tract variable that is participating in a gesture. This rest position should
not be confused with the resting or equilibrium position that is the target
of an actively controlled tract variable and is specified in the gestural
score. The inherent rest position is specific to an articulator, not a tract
variable, and is specified by standard equations in the task-dynamic
model. It may be language-specific (Saltzman and Munhall 1989) and
hence correspond to the "base of articulation" - schwa for English - in
which case it seems possible that it could also contribute towards
articulatory setting and thus be specific to a regional accent or an
individual.
A factor that has not yet been mentioned is how concurrent gestures
combine. The gestural score specifies the periods when the gestures are
activated. The task-dynamic model governs how the action of the articula-
tors is coordinated within a single gesture, making use of the articulator
weightings. When two or more gestures are concurrently active, they may
share a tract variable, or they may involve different tract variables but
affect a common articulator. In both cases, the influences of the overlap-
ping gestures are said to be blended. For blending within a shared tract
variable, the parameters associated with each gesture are combined either
by simple averaging, weighted averaging, or addition. (See Saltzman and
Munhall [1989] for more detailed discussion of gestural blending both
within and across tract variables.)

1.4 Evaluative remarks


Given the purpose of this paper, a lengthy critique is not appropriate, but
since the description so far has been uncritical, some brief evaluative
comments may be helpful. A particularly rich source of evaluation is the
theme issue of the Journal of Phonetics (1986, 14 [1]) dedicated to event
perception and action theory, although there have since been changes to
task dynamic theory, particularly as it applies to speech. For a more recent
account, see Saltzman and Munhall (1989).
The damped mass-spring model is attractive because it is simple, and
general enough to be applicable to any form of movement. Its use
19
Gesture

represents a significant step forward in explaining skilled movement. But


in order to implement the model in a manageable way, a number of
simplifications and (somewhat) ad hoc decisions have been introduced. I
have a lot of sympathy with this approach, since it often allows us to
pursue the important questions instead of getting bogged down by details.
Eventually, however, the simplifications and ad hoc decisions have to be
dealt with; it is as well not to lose sight of them, so that the model can be
modified or replaced when their disadvantages outweigh their advantages.
One simplification seems to me to be that of abstract mass, and
consequently the relationship between mass and stiffness, since the two
can compensate for one another while both are abstract. This assumption
may require eventual modification. Consider what the tongue tip-blade
does to produce an alveolar trill on the one hand, and a laminal stop on
the other. For the trill, the relationships among the physical mass of the
tip, its biomechanical stiffness, and the aerodynamic forces are critical.
For the stop, the relationships between these factors are less critical
(though still important), but the tip and blade, acting now as a functional
unit, must have a much greater mass than the tip can have for the trill. As
far as I can see, task dynamics in its present form cannot produce a trill,
and I am not sure that it can produce a laminal as opposed to an apical
stop.
The point is that mass is always abstract in the task-dynamic system,
but for some articulations the real mass of an articulator is crucial. If it
turns out that physical mass does have to be used to account for sounds
like trills and apical vs. laminal stops, then it would be reasonable to
reevaluate the relationship between abstract and actual physical mass.
Including physical mass as well as abstract mass will force investigators to
take a close look at some of the other parameters in the movement
equations. The most obvious is how stiffness will be used as a control
variable. As long as stiffness and mass are both abstract, variation in
stiffness can substitute for variation in mass. But if stiffness in the model
comes to be related to biomechanical stiffness, as Browman and Goldstein
(1989) have suggested it might be, then the value of the biomechanical
mass becomes very important. Saltzman and Kelso (1987) describe for
skilled limb activities how abstract and biomechanical dynamics might be
related to one another.
An example of an apparently ad hoc solution to a problem is the choice
of critical damping for all nonglottal gestures. There seem to be two main
motivations for using critical damping: first, critical damping is a straight-
forward way of implementing the point-attractor equation to produce the
asymptotic type of movement typical of target-directed tasks; second, it

20
1 Sarah Hawkins
represents a simple compromise between the faster but overshooting
underdamped case, and the slower but non-overshooting overdamped
case. Critical damping is straightforward to use because, as mentioned
above, the damping factor b is specified indirectly via the damping ratio,
b/(2*[mk]1/2), which is constrained to equal 1.0 in a critically damped
system. But since it includes the independent stiffness parameter k, then if
the ratio is constrained, b is a function of k. This dependence of damping
on stiffness may be reasonable for movement of a single articulator, as
used for the decay to the neutral rest position of individual uncontrolled
articulators, but it seems less reasonable for modeling gestures. Moreover,
although the damping factor is crucially important, the choice of critical
damping is not much discussed and seems relatively arbitrary in that other
types of damping can achieve similar asymptotic trajectories. (Fujimura
[1990: 377-81] notes that there are completely different ways of achieving
the same type of asymptotic trajectories.) I would be interested to know
why other choices have been rejected. One question is whether the same
damping factor should be used for all gestures. Is the trajectory close to
the target necessarily the same for a stop as for a fricative, for example? I
doubt it.
The method of coordinating gestures by specifying relative phases poses
problems for a model based on critically damped sinusoids. As mentioned
above, a critically damped system does not lend itself to a description in
terms of phase angles. Browman and Goldstein's current solution of
mapping the critically damped trajectory against an undamped sinusoid is
a straightforward empirical solution that has the advantage of being
readily understandable. It is arguably justifiable since they are concerned
more with developing a comprehensive theory of phonology-phonetics,
for which they need an implementable model, than with developing the
task-dynamic model itself. But the introduction of an additional level of
abstractness in the model - the undamped gestural cycle - to explain the
relative timing of gestures seems to me to need independent justification if
it is to be taken seriously.
Another issue is the relationship between the control of speech move-
ments and the vegetative activity of the vocal tract. The rest position
towards which an articulator moves when it is not actively controlled in
speech is unlikely to be the same as the rest position during quiet
breathing, for example. Similarly, widening of the glottis to resume
normal breathing after phonation is an active gesture, not a gradual
movement towards a rest position. These properties could easily be
included in an ad hoc way in the current model, but they raise questions of
how the model should account for different task-dynamic systems acting

21
Gesture
on the same articulators - in this case the realization of volitional
linguistic intention coordinated with automatic behavior that is governed
by neural chemoreceptors and brainstem activity.
The attempt to answer questions such as how the task-dynamic model
accounts for different systems affecting the same articulators partly
involves questions about the role of the gestural score: is the gestural score
purely linguistic, and if so why; and to what extent does it (or something
like it) include linguistic intentions, as opposed to implementing those
intentions? One reason why these questions will be hard to answer is that
they include some of the most basic issues in phonology and phonetics.
The status of the neutral attractor (rest position of each articulator) is a
case in point. Since the neutral attractor is separately determined for each
articulator, it is controlled within the task-dynamic model in the current
system. But if the various neutral attractors define the base of articulation,
and if, as seems reasonable, the base of articulation is language-specific
and hence is not independent of the phonology of the language, then it
should be specified in the gestural score if Browman and Goldstein are
correct in stating (this volume) that all language-specific phonetic/phono-
logical structure is found there.
A related issue is how the model accounts for learning - the acquisition
of speech motor control. Developmental phonologists have tended to
separate issues of phonological competence from motoric skill. But the
role that Browman and Goldstein, for example, assign to the gestural
score suggests that these authors could take a very different approach to
the acquisition of phonology. The relationship between the organization
of phonology during development and in the adult is almost certainly not
simple, but to attempt to account for developmental phonology within the
task-dynamic (or articulatory phonology) model could help clarify certain
aspects of the model. It could indicate, for example, the extent to which
the gestural score can reasonably be considered to encompass phonologi-
cal primitives, and whether learned, articulator-specific skills like articula-
tory setting or base of articulation are best controlled by the gestural score
or within the task-dynamic system. Browman and Goldstein (1989) have
begun to address these issues.
Finally, there is the question of how much the model offers explanation
rather than description. The large number of variables and parameters
makes it likely that some observed movement patterns can be modelled in
more than one way. Can the same movement trajectory arise from
different parameter values, types of blending, or activation periods and
degree of gestural overlap? If this happens, how does one choose between
alternatives? At the tract variable level, should there be only one possible
way to model a given movement trajectory? In other words, if there are
22
1 Sarah Hawkins

several alternatives, is the diversity appropriate, or does it reduce the


model's explanatory power?
A related problem is that the model is as yet incomplete. This raises the
issue of whether revisions will affect in important ways the conclusions
derived from work with the current model. In particular, it is clear that the
current set of eight or nine tract variables is likely to need revision,
although those used are among the most fundamental. Browman and
Goldstein (1989) name some tongue and laryngeal variables that should
be added; in my discussion to their paper in this volume, I suggest that
aerodynamic variables should also eventually be included. Aerodynamic
control may be difficult to integrate into the task-dynamic model, which is
presently couched entirely in terms of the degree and location of constric-
tions. More importantly, control of aerodynamics involves articulatory as
well as respiratory factors and therefore may influence articulatory
movement patterns. It is an open question whether including such new
variables will invalidate any of the present results. The current set of
variables seems to capture important properties of speech production, and
it seems reasonable to assume that many of the insights obtained by using
them are valid. But as the questions addressed with the model become
more detailed, it becomes important to remember that the model is not yet
complete and that it will be revised.
Any model of a poorly understood complex system can be faulted for
oversimplifying and for ad hoc decisions. The important question at this
stage is what task dynamics has achieved. It is too early to give a final
answer, but there is no question that task dynamics is making significant
contributions. It sets out to account systematically for observed character-
istics of coordinated movement. To do so, it asks what the organizing
principles of coordinated movement are, framing its answers in dynamical
rather than physiological terms. Its explicit mathematical basis means that
it is testable, and its generality makes it applicable to any form of skilled
movement. Although future models might be connected more closely to
physiology, a useful first step is to concentrate on the basic organizational
principles - on the dynamics governing the functional groupings of
coordinated articulator movement.
In the analysis of speech, task-dynamics unifies the traditional issues of
coarticulation, speech rate, and speech style into a single framework. By
providing a vocabulary and a mechanism for investigating the similarities,
the differences are brought into sharper focus. This same vocabulary and
framework promises a fresh approach to the discussion of linguistic units.
In the recent work of Browman and Goldstein and of Saltzman and
Munhall, for example, wefindwelcome unification and operationalization
of terms in phonology on the one hand and phonetics on the other.
23
Gesture
Whether or not the approach embodied in task dynamics becomes
generally accepted, the debate on the issues it raises promises to be both
lively and valuable.

Concluding remarks
Task dynamics offers a systematic, general account of the control of
skilled movement. As befits the data, it is a complicated system involving
many parameters and variables. In consequence, there is tension between
the need to explore the system itself, and the need to use it to explore
problems within (in our case) phonology and phonetics. I restrict these
final comments to the application of task dynamics to phonological and
phonetic processes of speech production, rather than to details of the
execution of skilled movement in general.
As a system in itself, I am not convinced that task dynamics will solve
traditional problems of phonetics and phonology. It is not clear, for
example, that its solutions for linguistic intentions, linguistic units, and the
invariance-variability issue will prove more satisfactory than other solu-
tions. Moreover, there are arguments for seeking a model of speech motor
control that is couched more in physiological rather than dynamic terms,
and that accounts for the learning of skilled movements as well as for their
execution once they are established. Nevertheless, partly because it is
nonspecific, ambitious enough to address wide-ranging issues, and explicit
enough to allow alternative solutions to be tried within the system, task
dynamics is well worth developing because it brings these issues into focus.
The connections it draws between the basic organizing principles of skilled
movement and potential linguistic units raise especially interesting ques-
tions. It may be superseded by other models, but those future models are
likely to owe some of their properties to research within the task-dynamic
approach. Amongst the attributes I find particularly attractive are the
emphasis on speech production as a dynamic process, and the treatment
of coarticulation, rephrased as coproduction, as an inherent property of
gestural dynamics, so that changes in rate and style require relatively
simple changes in global parameter values, rather than demanding new
targets and computation of new trajectories for each new type of
utterance.
It is easier to evaluate the contribution of task dynamics in exploring
general problems of phonology and phonetics. The fact that the task-
dynamic model is used so effectively testifies to its value. Browman and
Goldstein, for example, use the model to synthesize particular speech
patterns, and then use the principles embodied in the model to draw out
the implications for the organization of phonetics and phonology. But

24
1 Sarah Hawkins
beyond the need to get the right general effects, it seems to me that the
details of the implementation are not very important in this approach. The
value of the model in addressing questions like those posed by Browman
and Goldstein is therefore as much in its explicitness and relative ease of
use as in its details. The point, then, at this early stage, is that it does not
really matter if particular details of the task dynamics are wrong. The
value of the task-dynamic model is that it enables a diverse set of problems
in phonology and phonetics to be studied within one framework in a way
that has not been done before. The excitement in this work is that it offers
the promise of new ways of thinking about phonetic and phonological
theory. Insofar as task dynamics allows description of diverse phenomena
in terms of general physical laws, then it provides insights that are as near
as we can currently get to explanation.

25
2
"Targetless" schwa: an articulatory analysis

CATHERINE P. BROWMAN and LOUIS GOLDSTEIN

2.1 Introduction
One of the major goals for a theory of phonetic and phonological structure
is to be able to account for the (apparent) contextual variation of phonologi-
cal units in as general and simple a way as possible.* While it is always
possible to state some pattern of variation using a special "low-level" rule
that changes the specification of some unit, recent approaches have
attempted to avoid stipulating such rules, and instead propose that variation
is often the consequence of how the phonological units, properly defined, are
organized. Two types of organization have been suggested that lead to the
natural emergence of certain types of variation: one is that invariantly
specified phonetic units may overlap in time, i.e., they may be coproduced
(e.g., Fowler 1977, 1981a; Bell-Berti and Harris 1981; Liberman and Matt-
ingly 1985; Browman and Goldstein 1990), so that the overall tract shape and
acoustic consequences of these coproduced units will reflect their combined
influence; a second is that a given phonetic unit may be unspecified for some
dimension(s) (e.g., Ohman 1966b; Keating 1988a), so that the apparent
variation along that dimension is due to continuous trajectories between
neighboring units' specifications for that dimension.
A particularly interesting case of contextual variation involves reduced
(schwa) vowels in English. Investigations have shown that these vowels are
particularly malleable: they take on the acoustic (Fowler 1981a) and articula-
tory (e.g., Alfonso and Baer 1982) properties of neighboring vowels. While
Fowler (1981a) has analyzed this variation as emerging from the coproduc-
tion of the reduced vowels and a neighboring stressed vowel, it might also be

*Our thanks to Ailbhe Ni Chasaide, Carol Fowler, and Doug Whalen for criticizing versions of
this paper. This work was supported by NSF grant BNS 8820099 and NIH grants HD-01994
and NS-13617 to Haskins Laboratories.

26
2 Catherine P. Browman and Louis Goldstein

the case that schwa is completely unspecified for tongue position. This would
be consistent with analyses of formant trajectories for medial schwa in
trisyllabic sequences (Magen 1989) that have shown that F2 moves (roughly
continuously) from a value dominated by the preceding vowel (at onset) to
one dominated by the following vowel (at offset). Such an analysis would
also be consistent with the phonological analysis of schwa in French
(Anderson 1982) as an empty nucleus slot. It is possible (although this is not
Anderson's analysis), that the empty nucleus is never filled in by any
specification, but rather there is a specified "interval" of time between two
full vowels in which the tongue continuously moves from one vowel to
another.
The computational gestural model being developed at Haskins Laborator-
ies (e.g. Browman et al. 1986; Browman and Goldstein, 1990; Saltzman et al.
1988a) can serve as a useful vehicle for testing these (and other) hypotheses
about the phonetic/phonological structure of utterances with such reduced
schwa vowels. As we will see, it is possible to provide a simple, abstract
representation of such utterances in terms of gestures and their organization
that can yield the variable patterns of articulatory behavior and acoustic
consequences that are observed in these utterances.
The basic phonetic/phonological unit within our model is the gesture,
which involves the formation (and release) of a linguistically significant
constriction within a particular vocal-tract subsystem. Each gesture is
modeled as a dynamical system (or set of systems) that regulates the time-
varying coordination of individual articulators in performing these constric-
tion tasks (Saltzman 1986). The dimensions along which the vocal-tract goals
for constrictions can be specified are called tract variables, and are shown in
the left-hand column of figure 2.1. Oral constriction gestures are defined in
terms of pairs of these tract variables, one for constriction location, one for
constriction degree. The right-hand side of the figure shows the individual
articulatory variables whose motions contribute to the corresponding tract
variable.
The computational system sketched in figure 2.2 (Browman and Gold-
stein, 1990; Saltzman et al. 1988a) provides a representation for arbitrary
(English) input utterances in terms of such gestural units and their organiza-
tion over time, called the gestural score. The layout of the gestural score is
based on the principles of intergestural phasing (Browman and Goldstein
1990) specified in the linguistic gestural model. The gestural score is input to
the task-dynamic model (Saltzman 1986; Saltzman and Kelso 1987), which
calculates the patterns of articulator motion that result from the set of active
gestural units. The articulatory movements produced by the task-dynamic
model are then input to an articulatory synthesizer (Rubin, Baer, and
Mermelstein 1981) to calculate an output speech waveform. The operation of

27
Gesture

Tract variable Articulators involved


LP lip protrusion; upper and lower lips, jaw
LA lip aperture upper and lower lips, jaw

TTCL tongue-tip constriction location tongue-tip, tongue-body, jaw


TTCD tongue-tip constriction degree tongue-tip, tongue-body, jaw

TBCL tongue-body constriction location tongue-body, jaw


TBCD tongue-body constriction degree tongue-body, jaw

VEL velic aperture velum

GLO glottal aperture \ glottis

upper lip
velum
lower lip

GLO

glottis

Figure 2.1 Tract variables and associated articulators

the task-dynamic model is assumed to be "universal." (In fact, it is not even


specific to speech, having originally been developed [Saltzman and Kelso
1987] to describe coordinated reaching movements.) Thus, all of the lan-
guage-particular phonetic/phonological structure must reside in the gestural
score - in the dynamic parameter values of individual gestures, or in their
relative timing. Given this constraint, it is possible to test the adequacy of
some particular hypothesis about phonetic structure, as embodied in a
particular gestural score, by using the model to generate the articulatory
motions and comparing these to observed articulatory data.
The computational model can thus be seen as a tool for evaluating the
articulatory (and acoustic) consequences of hypothesized aspects of gestural
structure. In particular, it is well suited for evaluating the consequences of the
organizational properties discussed above: (1) underspecification and (2)
temporal overlap. Gestural structures are inherently underspecified in the
sense that there are intervals of time during which the value of a given tract
variable is not being controlled by the system; only when a gesture defined
along that tract variable is active is such control in place. This underspecifi-
28
2 Catherine P. Browman and Louis Goldstein
Intended utterance Output speech

Linguistic Articulatory
gestural synthesizer
model

Figure 2.2 Overview of GEST: gestural computational model

cation can be seen in figure 2.3, which shows the gestural score for the
utterance /pam/. Here, the shaded boxes indicate the gestures, and are
superimposed on the tract-variable time functions produced when the
gestural score is input to the task-dynamic model. The horizontal dimension
of the shaded boxes indicates the intervals of time during which each of the
gestural units is active, while the height of the boxes corresponds to the
"target" or equilibrium position parameter of a given gesture's dynamical
control regime. See Hawkins (this volume) for a more complete description
of the model and its parameters.
Note that during the activation interval of the initial bilabial closure
gesture, Lip Aperture (LA - vertical distance between the two lips) gradually
decreases, until it approaches the regime's target. However, even after the
regime is turned off, LA shows changes over time. Such "passive" tract-
variable changes result from two sources: (1) the participation of one of the
(uncontrolled) tract variable's articulators in some other tract variable which
is under active gestural control, and (2) an articulator-specific "neutral" or
"rest" regime, that takes control of any articulator which is not currently
active in any gesture. For example, in the LA case shown here, the jaw
contributes to the Tongue-Body constriction degree (TBCD) gesture (for the
vowel) by lowering, and this has the side effect of increasing LA. In addition,
the upper and lower lips are not involved in any active gesture, and so move
towards their neutral positions with respect to the upper and lower teeth,
thus further contributing to an increase in LA. Thus, the geometric structure
of the model itself (together with the set of articulator-neutral values)
predicts a specific, well-behaved time function for a given tract variable, even

29
Gesture
Input String:/1paam/;

Velic
aperture

Tongue-body

I
constriction
degree ph
narrow

Lip
aperture

Glottal
aperture

100 200 300 400


Time (msec.)

Figure 2.3 Gestural score and generated motion variables for /pam/. The input is specified in
ARPAbet, so /pam/ = ARPAbet input string /paam/. Within each panel, the height of the box
indicates degree of opening (aperture) of the relevant constriction: the higher the curve (or box)
the greater the amount of opening

when it is not being controlled. Uncontrolled behavior need not be stipulated


in any way. This feature of the model is important to being able to test the
hypothesis that schwa may not involve an active gesture at all.
The second useful aspect of the model is the ability to predict consequences
of the temporal overlap of gestures, i.e., intervals during which there is more
than one concurrently active gesture. Browman and Goldstein (1990) have
shown that the model predicts different consequences of temporal overlap,
depending on whether the overlapping gestures involve the same or different
tract variables, and that these different consequences can actually be
observed in allophonic variations and "casual speech" alternations. Of
particular importance to analyzing schwa is the shared tract variable case,
since we will be interested in the effects of overlap between an active schwa
gesture (if any) and the preceding or following vowel gesture, all of which
would involve the Tongue-Body tract variables (TBCD, and Tongue-Body
constriction location - TBCL). In this case, the dynamic parameter values for
the overlapping gestures are "blended," according to a competitive blending
dynamics (Saltzman et al. 1988a; Saltzman and Munhall 1989). In the
examples we will be examining, the blending will have the effect of averaging
the parameter values. Thus, if both gestures were coextensive for their entire
30
2 Catherine P. Browman and Louis Goldstein

activation intervals, neither target value would be achieved, rather, the value
of the tract variable at the end would be the average of their targets.
In this paper, our strategy is to analyze movements of the tongue in
utterances with schwa to determine if the patterns observed provide evidence
for a specific schwa tongue target. Based on this analysis, specific hypotheses
about the gestural overlap in utterances with schwa are then tested by means
of computer simulations using the gestural model described above.

2.2 Analysis of articulatory data


Using data from the Tokyo X-ray archive (Miller and Fujimura 1982), we
analyzed /pVlp9'pV2p9/ utterances produced by a speaker of American
English, where VI and V2 were all possible combinations of /i, e, a, A, U/.
Utterances were read in short lists of seven or eight items, each of which had
the same VI and different V2s. One token (never the initial or final item in a
list) of each of the twenty-five utterance types was analyzed. The microbeam
data tracks the motion of five pellets in the mid-sagittal plane. Pellets were
located on the lower lip (L), the lower incisor for jaw movement (J), and the
midline of the tongue: one approximately at the tongue blade (B), one at the
middle of the tongue dorsum (M), and one at the rear of the tongue dorsum
(R).
Ideally, we would use the information in tongue-pellet trajectories to infer
a time-varying representation of the tongue in terms of the dimensions in
which vowel-gesture targets are defined, e.g., for our model (or for Wood
1982), location and degree of tongue-body constriction. (For Ladefoged and
Lindau [1989], the specifications would be rather in terms of formant
frequencies linked to the factors of front-raising and back-raising for the
tongue.) Since this kind of transformation cannot currently be performed
with confidence, we decided to describe the vowels directly in terms of the
tongue-pellet positions. As the tongue-blade pellet (B) was observed to be
largely redundant in these utterances (not surprising, since they involve only
vowels and bilabial consonants), we chose to measure the horizontal (X) and
vertical (Y) positions for the M and R pellets for each vowel. While not ideal,
the procedure at least restricts its a priori assumption about the parameteri-
zation of the tongue shape to that inherent in the measurement technique.
The first step was to find appropriate time points at which to measure the
position of the pellets for each vowel. The time course of each tongue-pellet
dimension (MX, MY, RX, RY) was analyzed by means of an algorithm that
detected displacement extrema (peaks and valleys). To the extent that there is
a characteristic pellet value associated with a given vowel, we may expect to
see such a displacement extremum, that is, movement towards some value,
then away again. The algorithm employed a noise level of one X-ray grid unit

31
Gesture
(approximately 0.33 mm); thus, movements of a single unit in one direction
and back again did not constitute extrema. Only the interval that included
the full vowels and the medial schwa was analyzed; final schwas were not
analyzed. In general, an extremum was found that coincided with each full
vowel, for each pellet dimension, while such an extremum was missing for
schwa in over half the cases. The pellet positions at these extrema were used
as the basic measurements for each vowel. In cases where a particular pellet
dimension had no extremum associated with a vowel, a reference point was
chosen that corresponded to the time of an extremum of one of the other
pellets. In general, MY was the source of these reference points for full
vowels, and RY was the source for schwa, as these were dimensions that
showed the fewest missing extrema. After the application of this algorithm,
each vowel in each utterance was categorized by the value at a single
reference point for each of the four pellet dimensions. Since points were
chosen by looking only at data from the tongue pellets themselves, these are
referred to as the "tongue" reference points.
To illustrate this procedure, figure 2.4a shows the time courses of the M, R,
and L pellets (only vertical for L) for the utterance /pips'pips/ with the
extrema marked with dashed lines. The acoustic waveform is displayed at the
top. For orientation, note that there are four displacement peaks marked for
LY, corresponding to the raising of the lower lip for the four bilabial-closure
gestures for the consonants. Between these peaks three valleys are marked,
corresponding to the opening of the lips for the three vowels. For MX, MY,
and RX, an extremum was found associated with each of the full vowels and
the medial schwa. For RY, a peak was found for schwa, but not for VI.
While there is a valley detected following the peak for schwa, it occurs during
the consonant closure interval, and therefore is not treated as associated with
V2. Figure 2.4b shows the same utterance with the complete set of "tongue"
reference points used to characterize each vowel. Reference points that have
been copied from other pellets (MY in both cases) are shown as solid lines.
Note that the consonant-closure interval extremum has been deleted.
Figure 2.5 shows the same displays for the utterance /pips'papa/. Note
that, in (a), extrema are missing for schwa for MX, MY, and RX. This is
typical of cases in which there is a large pellet displacement between VI and
V2. The trajectory associated with such a displacement moves from VI to V2,
with no intervening extremum (or even, in some cases, no "flattening" of the
curve).
As can be seen in figures 2.4 and 2.5, the reference points during the schwa
tend to be relatively late in its acoustic duration. As we will be evaluating the
relative contributions of VI and V2 in determining the pellet positions for
schwa, we decided also to use a reference point earlier in the schwa. To
obtain such a point, we used the valley associated with the lower lip for the
32
2 Catherine P. Browman and Louis Goldstein

(a)
Waveform

Tongue middle
horizontal
(MX)
Tongue middle
vertical
(MY)
Tongue rear
horizontal
(RX)
Tongue rear
vertical
(RY)
Lower lip
vertical
(LY)
700
Time (msec.)

(b) p i p 3 p i p 3
Waveform
fffliF

I ^""-—
Tongue middle
horizontal
(MX) - ^ ~~
Tongue middle v
vertical ——^
(MY)
Tongue rear
horizontal _ _ _ _ _ _ ^ - • " " " " "
^ — — ~ ~
(RX)
Tongue rear
vertical
(RY) "T"""--- ———" . -—--^
Lower lip
vertical
(LY) " 1 !
700
Time (msec.)

Figure 2.4 Pellet time traces for /pips'pips/. The higher the trace, the higher (vertical) or more
fronted (horizontal) the corresponding movement, (a) Position extrema indicated by dashed
lines, (b) "Tongue" reference points indicated by dashed and solid lines (for Middle and Rear
pellets)

33
Gesture

(a)
Waveform

Tongue middle
horizontal
(MX)
Tongue middle
vertical
(MY)
Tongue rear
horizontal
(RX)
Tongue rear
vertical
(RY)
Lower lip
vertical
(LY)
700
Time (msec.)

(b) h h
p i p a p a p a
Waveform

Tongue middle
horizontal
(MX)

Tongue middle
vertical
(MY) -
___
___
[ -
Tongue rear
horizontal
(RX) •—-— _j __--—~ ~

Tongue rear
-—-^
—-__ ____—\~
vertical
(RY) i
Lower lip
vertical
(LY) 1 M M ~""M^ 1 1 700
Time (msec.)

Figure 2.5 Pellet time traces for /pipa'papg/. The higher the trace, the higher (vertical) or more
fronted (horizontal) the corresponding movement, (a) Position extrema indicated by dashed
lines, (b) "Tongue" reference points indicated by dashed and solid lines (for Middle and Rear
pellets)

34
2 Catherine P. Browman and Louis Goldstein
schwa - that is, approximately the point at which the lip opening is maximal.
This point, called the "lip" reference, typically occurs earlier in the (acoustic)
vowel duration than the "tongue" reference point, as can be seen in figures
2.4 and 2.5. Another advantage of the "lip" reference point is that all tongue
pellets are measured at the same moment in time. Choosing points at
different times for different dimensions might result in an apparent differen-
tial influence of VI and V2 across dimensions. Two different reference points
were established only for the schwa, and not for the full vowels. That is, since
the full vowels provided possible environmental influences on the schwa, the
measure of that influence needed to be constant for comparisons of the "lip"
and "tongue" schwa points. Therefore, in analyses to follow, when "lip" and
"tongue" reference points are compared, these points differ only for the
schwa. In all cases, full vowel reference points are those determined using the
tongue extremum algorithm described above.

2.2.1 Results
Figure 2.6 shows the positions of the M (on the right) and R (on the left)
pellets for the full vowels plotted in the mid-sagittal plane such that the
speaker is assumed to be facing to the right. The ten points for a given vowel
are enclosed in an ellipse indicating their principal components (two stan-
dard deviations along each axis). The tongue shapes implied by these pellet
positions are consistent with cinefluorographic data for English vowels (e.g.,
Perkell 1969; Harshman, Ladefoged, and Goldstein, 1977; Nearey 1980). For
example, /i/ is known to involve a shape in which the front of the tongue is
bunched forward and up towards the hard palate, compared, for example, to
/e/, which has a relatively unconstricted shape. This fronting can be seen in
both pellets. In fact, over all vowels, the horizontal components of the
motion of the two pellets are highly correlated (r = 0.939 in the full vowel
data, between RX and MX over the twenty-five utterances). The raising for
I'll can be seen in M (on the right), but not in R, for which /i/ is low - lower,
for example, than /a/. The low position of the back of the tongue dorsum for
I'll can, in fact, be seen in mid-sagittal cinefluorographic data. Superimposed
tongue surfaces for different English vowels (e.g. Ladefoged 1982) reveal that
the curves for /i/ and /a/ cross somewhere in the upper pharyngeal region, so
that in front of this point, /i/ is higher than /a/, while behind this point, /a/ is
higher. This suggests that the R pellet in the current experiment is far enough
back to be behind this cross-over point, /u/ involves raising of the rear of the
tongue dorsum (toward the soft palate), which is here reflected in the raising
of both the R and M pellets. In general, the vertical components of the two
pellets are uncorrelated across the set of vowels as a whole (r = 0.020),

35
Gesture
300,

260i

/ x x7

A
V
220
140 180 220

Figure 2.6 Pellet positions for full vowels, displayed in mid-sagittal plane with head facing to the
right: Middle pellets on the right, Rear pellets on the left. The ellipses indicate two standard
deviations along axes determined by principal-component analysis. Symbols I = IPA /i/, U =
/u/, E = /e/, X = /A/, and A = /a/. Units are X-ray units (= 0.33 mm)

reflecting, perhaps, the operation of two independent factors such as "front-


raising" and "back-raising" (Ladefoged 1980).
The pellet positions for schwa, using the "tongue" reference points, are
shown in the same mid-sagittal plane infigure2.7, with the full vowel ellipses
added for reference. The points are labeled by the identity of the following
vowel (V2) in (a) and by the preceding vowel (VI) in (b). Figure 2.8 shows the
parallel figure for schwa measurements at the "lip" reference point. In both
figures, note that the range of variation for schwa is less than the range of
variation across the entire vowel space, but greater than the variation for any
single full vowel. Variation in MY is particularly large compared to MY
variation for any full vowel. Also, while the distribution of the R pellet
positions appears to center around the value for unreduced /A/, which might
be thought to be a target for schwa, this is clearly not the case for the M
pellet, where the schwa values seem to center around the region just above
/e/. For both pellets, the schwa values are found in the center of the region
occupied by the full vowels. In fact, this relationship turns out to be quite
precise.
Figure 2.9 shows the mean pellet positions for each full vowel and for
schwa ("lip" and "tongue" reference points give the same overall means), as
well as the grand mean of pellet positions across all full vowels, marked by a
circle. The mean pellet positions for the schwa lie almost exactly on top of the

36
2 Catherine P. Browman and Louis Goldstein
(a) 300

220
140 220

Figure 2.7 Pellet positions for schwa at "tongue" reference points, displayed in right-facing
mid-sagittal plane as in figure 2.6. The ellipses are from the full vowels (figure 2.6), for
comparison. Symbols I = IPA /i/, U = /u/, E = /e/, X = /A/, and A = /a/. Units are X-ray
units ( = 0.33 mm), (a) Schwa pellet positions labeled by the identity of the following vowel (V2).
(b) Schwa pellet positions labeled by the identity of the preceding vowel (VI)

grand mean for both the M and R pellets. This pattern of distribution of
schwa points is exactly what would be expected if there were no independent
target for schwa but rather a continuous tongue trajectory from VI to V2.
Given all possible combinations of trajectory endpoints (VI and V2), we
would expect the mean value of a point located at (roughly) the midpoint of
37
Gesture

(a) 300

260-

220
140

(b) 300

260

220
140 180 220
Figure 2.8 Pellet positions for schwa at "lip" reference points, displayed as in figure 2.7
(including ellipses from figure 2.6). Symbols I = IPA /i/, U = /u/, E = /e/, X = /A/, and A = /a/.
Units are X-ray units ( = 0.33 mm). (a) Schwa pellet positions labeled by the identity of the
following vowel (V2). (b) Schwa pellet positions labeled by the identity of the preceding vowel
(VI)

these twenty-five trajectories to have the same value as the mean of the
endpoints themselves.
If it is indeed the case that the schwa can be described as a targetless point
on the continuous trajectory from VI to V2, then we would expect that the
schwa pellet positions could be predicted from knowledge of VI and V2

38
2 Catherine P. Browman and Louis Goldstein

300.

280. [B
UE>
•U
260.
a • %
A •. <
240.

220. aB

POO
120 140 160 180 200 220 240

Figure 2.9 Mean pellet positions for full vowels and schwa, displayed in right-facing mid-
sagittal plane as in figure 2.6. The grand mean of all the full vowels is indicated by a circled
square. Units are X-ray units ( = 0.33 mm)

positions alone, with no independent contribution of schwa. To test this, we


performed stepwise multiple linear regression analyses on all possible subsets
of the predictors VI position, V2 position, and an independent schwa factor,
to determine which (linear) combinations of these three predictors best
predicted the position of a given pellet dimension during schwa. The analysis
finds the values of the b coefficients and the constant k, in equations like (1)
below, that give the best prediction of the actual schwa values.
(1) schwa(predicted) = bl*Vl + b2*V2 + k
The stepwise procedure means that variables are added into an equation such
as (1) one at a time, in the order of their importance to the prediction. The
procedure was done separately for equations with and without the constant
term k (using BMDP2R and 9R). For those analyses containing the constant
term (which is the y-intercept), k represents an independent schwa contribu-
tion to the pellet position - when it is the only predictor term, it is the mean
for schwa. Those analyses without the constant term (performed using 9R)
enabled the contributions of VI and V2 to be determined in the absence of
this schwa component. Analyses were performed separately for each pellet
dimension, and for the "tongue" and "lip" reference points.
The results for the "tongue" points are shown in the right-hand columns of
table 2.1. For each pellet, the various combinations of terms included in the
39
Gesture
equation are rank-ordered according to the standard error of the schwa
prediction for that combination, the smallest error shown at the top. In all
cases, the equation with all three terms gave the least error (which is
necessarily true). Interestingly, however, for MX, RX, and RY, the predic-
tion using the constant and V2 differed only trivially from that using all three
variables. This indicates that, for these pellets, VI does not contribute
substantially to schwa pellet positions at the "tongue" reference point. In
addition, it indicates that an independent schwa component is important to
the prediction, because V2 alone, or in combination with VI, gives worse
prediction than V2 plus k.
For MY, all three terms seem to be important - removing any one of them
increases the error. Moreover, the second-best prediction involves deleting
the V2 term, rather than VI. The reduced efficacy of V2 (and increased
efficacy of VI) in predicting the MY value of schwa may be due, in part, to
the peak determination algorithm employed. When VI or V2 was /a/ or /A/,
the criteria selected a point for MY that tended to be much later in the vowel
than the point chosen for the other pellet dimensions (figure 2.5b gives an
example of this). Thus, for V2, the point chosen for MY is much further in
time from the schwa point than is the case for the other dimensions, while for
VI, the point chosen is often closer in time to the schwa point.
The overall pattern of results can be seen graphically in figure 2.10. Each
panel shows the relation between "tongue" pellet positions for schwa and the
full vowels: VI in the top row and V2 in the bottom row, with a different
pellet represented in each column. The points in the top row represent the
pellet positions for the utterances with the indicated initial vowel (averaged
across five utterances, each with a different final vowel), while the bottom
row shows the average for the five utterances with the indicated final vowel.
The differences between the effects of VI (top row) and V2 (bottom row) on
schwa can be observed primarily in the systematicity of the relations. The
relation between schwa and V2 is quite systematic - for every pellet, the lines
do not cross in any of the panels of the bottom row - while for VI (in the top
row), the relationship is only systematic for RY (and somewhat for MY,
where there is some crossing, but large effects).
Turning now to the "lip" reference points, regression results for these
points are found in the left-hand column of table 2.1. The best prediction
again involves all three terms, but here, in every case except RX, the best two-
term prediction does substantially worse. Thus VI, which had relatively little
impact at the "tongue" point, does contribute to the schwa position at this
earlier "lip" point. In fact, for these three pellets, the second-best prediction
combination always involves VI (with either V2 or k as the second term).
This pattern of results can be confirmed graphically in figure 2.11. Compar-
ing the VI effects sketched in the top row of panels in figures 2.10 and 2.11,
40
2 Catherine P. Browman and Louis Goldstein

Table 2.1 Regression results for X-ray data

"Lip" reference point "Tongue" reference point

Terms Standard error Terms Standard error

MX k + vl+v2 4.6 MX k + v2 + vl 4.5


k + vl 5.5 k + v2 4.6
vl+v2 5.6 k 6.2
k 6.3 v2 + vl 6.4
vl 8.6 v2 10.8
MY k + vl+v2 4.2 MY k + vl+v2 4.7
k + vl 5.0 k + vl 6.4
vl+v2 7.6 k 8.7
k 10.3 vl+v2 9.1
vl 11.9 vl 15.1
RX k + v2 + vl 3.8 RX k + v2 + vl 4.0
k + v2 3.9 k + v2 4.0
k 4.8 k 5.8
vl+v2 7.0 v2 + vl 7.7
vl 11.7 v2 10.6
RY k + v2 + vl 4.1 RY k + v2 + vl 4.5
v2 + vl 4.9 k + v2 4.6
k + v2 5.1 v2 + vl 5.3
k 6.8 v2 6.2
v2 7.0 k 7.1

notice that, although the differences are small, there is more spread between
the schwa pellets at the "lip" point (figure 2.11) than at the "tongue" point
(figure 2.10). This indicates that the schwa pellet was more affected by VI at
the "lip" point. There is also somewhat less cross-over for MX and RX in the
"lip" figure, indicating increased systematicity of the VI effect.
In summary, it appears that the tongue position associated with medial
schwa cannot be treated simply as an intermediate point on a direct tongue
trajectory from VI to V2. Instead, there is evidence that this V1-V2
trajectory is warped by an independent schwa component. The importance of
this warping can be seen, in particular, in utterances where VI and V2 are
identical (or have identical values on a particular pellet dimension). For
example, returning to the utterance /pips'pips/ in figure 2.4, we can clearly
see (in MX, MY, and RX) that there is definitely movement of the tongue
away from the position for /i/ between the VI and V2. This effect is most
pronounced for /i/. For example, for MY, the prediction error for the

41
Gesture

u- • 1
"Tongue" reference "A
a° °
e

MX MY RX RY

250

V1 3 V2 V1 3 V2 V1 V2 V1 a V2

250 280 180 280


/
1
/ /
/ ^>
N 1
==•
200 230 130 230
V1 a V2 V1 a V2 V1 a V2 V1 a V2

Figure 2.10 Relation between full vowel pellet positions and "tongue" pellet positions for
schwa. The top row displays the pellet positions for utterances with the indicated initial vowels,
averaged across five utterances (each with a different final vowel). The bottom row displays the
averaged pellet positions for utterances with the indicated final vowels. Units are X-ray units ( =
0.33 mm)

equation without a constant is worse for /pips'pips/ than for any other
utterance (followed closely by utterances combining /i/ and /u/; MY is very
similar for /i/ and /u/). Yet, it may be inappropriate to consider this warping
to be the result of a target specific to schwa, since, as we saw earlier, the mean
tongue position for schwa is indistinguishable from the mean position of the
tongue across all vowels. Rather the schwa seems to involve a warping of the
trajectory toward an overall average or neutral tongue position. Finally, we
saw that VI and V2 affect schwa position differentially at two points in time.
The influence of the VI endpoint is strong and consistent at the "lip" point,
relatively early in the schwa, while V2 influence is strong throughout. In the
next section, we propose a particular model of gestural structure for these
utterances, and show that it can account for the various patterns that we
have observed.

2.3 Analysis of simulations


Within the linguistic gestural model of Browman and Goldstein (1990), we
expect to be able to model the schwa effects we have observed as resulting
from a structure in which there is an active gesture for the medial schwa, but
42
2 Catherine P. Browman and Louis Goldstein

u- • 1
"Lip" reference "A
a° °
e

MX MY RX RY

250

2001

V1 a V2 V1 a V2 V1 a V2 V1 s V2

Figure 2.11 Relation between full vowel pellet positions and "lip" pellet positions for schwa.
The top row displays the pellet positions for utterances with the indicated initial vowels,
averaged across five utterances (each with a different final vowel). The bottom row displays the
averaged pellet positions for utterances with the indicated final vowels. Units are X-ray units
( = 0.33 mm)

complete temporal overlap of this gesture and the gesture for the following
vowel. The blending caused by this overlap should yield the V2 effect on
schwa, while the VI effects should emerge as a passive consequence of the
differing initial conditions for movements out of different preceding vowels.
An example of this type of organization is shown in figure 2.12, which is
the gestural score we hypothesized for the utterance /pips'paps/. As in figure
2.3, each box indicates the activation interval of a particular gestural control
regime, that is, an interval of time during which the behavior of the particular
tract variable is controlled by a second-order dynamical system with a fixed
"target" (equilibrium position), frequency, and damping. The height of the
box represents the tract-variable "target." Four LA closure-and-release
gestures are shown, corresponding to the four consonants. The closure-and-
release components of these gestures are shown as separate boxes, with the
closure components having the smaller target for LA, i.e., smaller interlip
distance. In addition, four tongue-body gestures are shown, one for each of
the vowels - VI, schwa, V2, schwa. Each of these gestures involves
simultaneous activation of two tongue-body tract variables, one for constric-
tion location and one for constriction degree. The control regimes for the VI
and medial schwa gestures are contiguous and nonoverlapping, whereas the
V2 gesture begins at the same point as the medial schwa and thus completely

43
Gesture
p i p 9 p a pa

VEL

TTCL

TTCD

TBCL

TBCD

LA

LP

GLO

100 200 300 400 500 600 700 800 900

Time (msec.)

Figure 2.12 Gestural score for /pipa'papa/. Tract variable channels displayed, from top to
bottom, are: velum, tongue-tip constriction location and constriction degree, tongue-body
constriction location and constriction degree, lip aperture, lip protrusion, and glottis. Horizon-
tal extent of each box indicates duration of gestural activation; the shaded boxes indicate
activation for schwa. For constriction-degree tract variables (VEL, TTCD, TBCD, LA, GLO),
the higher the top of the box, the greater the amount of opening (aperture). The constriction-
location tract variables (TTCL, TBCL) are defined in terms of angular position along the curved
vocal tract surface. The higher the top of the box, the greater the angle, and further back and
down (towards the pharynx) the constriction

overlaps it. In other words, during the acoustic realization of the schwa
(approximately), the schwa and V2 gestural control regimes both control the
tongue movements; the schwa relinquishes active control during the follow-
ing consonant, leaving only the V2 tongue gesture active in the next syllable.
While the postulation of an explicit schwa gesture overlapped by V2 was
motivated by the particular results of section 2.2, the general layout of
gestures in these utterances (their durations and overlap) was based on
stiffness and phasing principles embodied in the linguistic model (Browman
and Goldstein, 1990).
Gestural scores for each of the twenty-five utterances were produced. The
activation intervals were identical in all cases; the scores differed only in the
TBCL and TBCD target parameters for the different vowels. Targets used
for the full vowels were those in our tract-variable dictionary. For the schwa,
the target values (for TBCL and TBCD) were calculated as the mean of the
targets for the five full vowels. The gestural scores were input to the task-

44
2 Catherine P. Browman and Louis Goldstein

Waveform

Tongue center
horizontal
(CX)
Tongue center
vertical
(CY)

Lower lip
vertical

100 200 300 400 500 600 700 800 900

Time (msec.)

Figure 2.13 Gestural score for /pipa'pipa/. Generated movements (curves) are shown for the
tongue center and lower lip. The higher the curve, the higher (vertical) or more fronted
(horizontal) the corresponding movement. Boxes indicate gestural activation; the shaded boxes
indicate activation for schwa. CX is superimposed on TBCL, CY on TBCD, lower lip on LA.
Note that the boxes indicate the degree of opening and angular position of the constriction (as
described infigure2.12), rather than the vertical and horizontal displacement of articulators, as
shown in the curves

dynamic model (Saltzman 1986), producing motions of the model articula-


tors of the articulatory synthesizer (see figure 2.1). For example, for utterance
/pipa'pipa/, figure 2.13 shows the resulting motions (with respect to a fixed
reference on the head) of two of the articulators - the center of the tongue-
body circle (C), and the lower lip, superimposed on the gestural score.
Motion of the tongue is shown in both horizontal and vertical dimensions,
while only vertical motion of the lower lip is shown. Note that the lower lip
moves up for lip closure (during the regimes with the small LA value). Figure
2.14 shows the results for /pipa'papa/.
The articulator motions in the simulations can be compared to those of the
data in the previous section (figures 2.4 and 2.5). One difference between the
model and the data stems from the fact that the major portion of the tongue
dorsum is modeled as an arc of circle, and therefore all points on this part of
the dorsum move together. Thus, it is not possible to model the differential
patterns of motion exhibited by the middle (M) and rear (R) of the dorsum in
the X-ray data. In general, the motion of CX is qualitatively similar to both
MX and RX (which, recall, are highly correlated). For example, both the
data and the simulation show a small backward movement for the schwa in
/pipa'pipa/; in /pips'papa/, both show a larger backwards movement for
schwa, with the target for /a/ reached shortly thereafter, early in the acoustic
realization of V2. The motion of CY in the simulations tends to be similar to
that of MY in the data. For example, in /pips'papa/, CY moves down from

45
Gesture

Waveform

Tongue center
horizontal
(CX)
Tongue center
vertical
(CY)

Lower lip
vertical

100 200 300 400 500 600 700 800 900

Time (msec.)

Figure 2.14 Gestural score for /pipa'papa/. Generated movements (curves) are shown for the
tongue center and lower lip. The higher the curve, the higher (vertical) or more fronted
(horizontal) the corresponding movement. Boxes indicate gestural activation; the shaded boxes
indicate activation for schwa. Superimposition of boxes and curves as in figure 2.13

/i/ to schwa to /a/, and the target for /a/ tends to be achieved relatively late,
compared to CX. Movements corresponding to RY motions are not found in
the displacement of the tongue-body circle, but would probably be reflected
by a point on the part of the model tongue's surface that is further back than
that section lying on the arc of a circle.
The model articulator motions were analyzed in the same manner as the X-
ray data, once the time points for measurement were determined. Since for
the X-ray data we assumed that displacement extrema indicated the effective
target for the gesture, we chose the effective targets in the simulated data as
the points to measure. Thus, points during VI and V2 were chosen that
corresponded to the point at which the vowel gestures (approximately)
reached their targets and were turned off (right-hand edges of the tongue
boxes in figures 2.13 and 2.14). For schwa, the "tongue" reference point was
chosen at the point where the schwa gesture was turned off, while the "lip"
reference was chosen at the lowest point of the lip during schwa (the same
criterion as for the X-ray data).
The distribution of the model full vowels in the mid-sagittal plane (CX x
CY) is shown in figure 2.15. Since the vowel gestures are turned off only after
they come very close to their targets, there is very little variation across the
ten tokens of each vowel. The distribution of schwa at the "tongue" reference
point is shown infigure2.16, labeled by the identity of V2 (in a) and VI (in b),
with the full vowel ellipses added for comparison. At this reference point that
occurs relatively late, the vowels are clustered almost completely by V2, and
the tongue center has moved a substantial portion of the way towards the

46
2 Catherine P. Browman and Louis Goldstein

1300

1200
850 950 1050

Figure 2.15 Tongue-center (C) positions for model full vowels, displayed in mid-sagittal plane
with head facing to the right. The ellipses indicate two standard deviations along axes
determined by principal-component analysis. Symbols I = IPA /i/, U = /u/, E = /e/, X = /A/,
and A = /a/. Units are ASY units ( = 0.09 mm), that is, units in the vocal tract model, measured
with respect to the fixed structures.

following full vowel. The distribution of schwa values at the "lip" reference
point is shown infigure2.17, labeled by the identity of V2 (in a), and of VI (in
b). Comparing figure 2.17(a) with figure 2.16(a), we can see that there is
considerably more scatter at the "lip" point than at the later "tongue" point.
We tested whether the simulations captured the regularities of the X-ray
data by running the same set of regression analyses on the simulations as
were performed on the X-ray data. The results are shown in table 2.2, which
has the same format as the X-ray data results in table 2.1. Similar patterns
are found for the simulations as for the data. At the "tongue" reference
point, for both CX and CY the best two-term prediction involves the schwa
component (constant) and V2, and this prediction is nearly as good as that
using all three terms. Recall that this was the case for all pellet dimensions
except for MY, whose differences were attributed to differences in the time
point at which this dimension was measured. (In the simulations, CX and CY
were always measured at the same point in time.) These results can be seen
graphically in figure 2.18, where the top row of panels shows the relation
between VI and schwa, and the bottom row shows the relation between V2

47
Gesture
1400

1300

1200
850 950 1050
(b) 1400

1300

1200 1 1 1 I I
850 950 1050
Figure 2.16 Tongue-center (C) positions for model schwa at "tongue" reference points,
displayed in right-facing mid-sagittal plane as in figure 2.15. The ellipses are from the model full
vowels (figure 2.15), for comparison. Symbols I = IPA /i/, U = /u/, E = /e/, X = /A/, and
A = /a/. Units are ASY units ( = 0.09 mm), (a) Model schwa positions labeled by the identity
of the following vowel (V2). (b) Model schwa positions labeled by the identity of the preceding
vowel (VI)

48
2 Catherine P. Browman and Louis Goldstein

(a) 1400

1300 -

1200
850 950 1050

(P) 1400

1300 -

1200
850 950 1050

Figure 2.17 Tongue-center (C) positions for model schwa at "lip" reference points, displayed
as in figure 2.16 (including ellipses from figure 2.15). Symbols I = IPA /i/, U = /u/, E = /e/,
X = /A/, and A = /a/. Units are ASY units ( = 0.09 mm). (a) Model schwa positions labeled
by the identity of the following vowel (V2). (b) Model schwa positions labeled by the identity
of the preceding vowel (VI)

49
Gesture
Table 2.2 Regression results of simulations

"Lip" reference point "Tongue" reference point

Terms Standard error Terms Standard error

ex k + vl+v2 7.5 CX k + v2 + vl 5.2


vl+v2 9.1 k + v2 5.6
k + vl 20.4 v2 + vl 13.2
k 29.2 v2 20.0
vl 33.3 k 28.6
CY k + vl+v2 4.7 CY k + v2 + vl 3.2
vl+v2 13.5 k + v2 3.9
k + vl 19.7 v2 + vl 18.8
k 29.1 v2 28.0
vl 41.6 k 30.4

and schwa. The same systematic relation between schwa and V2 can be seen
in the bottom row as in figure 2.10 for the X-ray data, that is, no crossover.
(The lack of systematic relations between VI and schwa in the X-ray data,
indicated by the cross-overs in the top row of figure 2.10, is captured in the
simulations in figure 2.18 by the lack of variation for the schwa in the top
row.) Thus, the simulations capture the major statistical relation between the
schwa and the surrounding full vowels at the "tongue" reference point,
although the patterns are more extreme in the simulations than in the
data.
At the earlier "lip" reference point, the simulations also capture the
patterns shown by the data. For both CX and CY, the three-term predictions
in table 2.2 show substantially less error than the best two-term prediction.
This was also the case for the data in table 2.1 (except for RX), where VI, V2
and a schwa component (constant) all contributed to the prediction of the
value during schwa. This can also be seen in the graphs in figure 2.19, which
shows a systematic relationship with schwa for both VI and V2.
In summary, for the simulations, just as for the X-ray data, VI contributed
to the pellet position at the "lip" reference, but not to the pellet position at
the "tongue" point, while V2 and an independent schwa component contri-
buted at both points. Thus, our hypothesized gestural structure accounts for
the major regularities observed in the data (although not for all aspects of the
data, such as its noisiness or differential behavior among pellets). The
gestural-control regime for V2 begins simultaneously with that for schwa and
overlaps it throughout its active interval. This accounts for the fact that V2
and schwa effects can be observed throughout the schwa, as both gestures
50
2 Catherine P. Browman and Louis Goldstein

"Tongue" reference

CY
cx
1000 1350

900 1250

1000 1350

900 1250

V2 V2

Figure 2.18 Relation between model full vowel tongue-center positions and tongue-center
positions at "tongue" reference point for model schwas. The top row displays the tongue-center
positions for utterances with the indicated initial vowels, averaged across five utterances (each
with a different final vowel). The bottom row displays the averaged tongue-center positions for
utterances with the indicated final vowels. Units are ASY units (= 0.09 mm)

unfold together. However, VI effects are passive consequences of the initial


conditions when the schwa and V2 gestures are "turned on," and thus, their
effects disappear as the tongue position is attracted to the "target" (equili-
brium position) associated with the schwa and V2 regimes.

2.3.1 Other simulations


While the X-ray data from the subject analyzed here argue against the
strongest form of the hypothesis that schwa has no tongue target, we decided
nevertheless to perform two sets of simulations incorporating the strong
form of an "unspecified" schwa to see exactly where and how they would fail
to reproduce the subject's data. In addition, if the synthesized speech were
found to be correctly perceived by listeners, it would suggest that this
gestural organization is at least a possible one for these utterances, and might
be found for some speakers. In the first set of simulations, one of which is
Gesture

"Lip" reference

CY
cx

1000 1350

900 1250

1000 1350

900 1250

V2 V2

Figure 2.19 Relation between model full vowel tongue-center positions and tongue-center
positions at "lip" reference point for model schwas. The top row displays the tongue-center
positions for utterances with the indicated initial vowels, averaged across five utterances (each
with a different final vowel). The bottom row displays the averaged tongue-center positions for
utterances with the indicated final vowels. Units are ASY units ( = 0.09 mm)

exemplified in figure 2.20, the gestural scores took the same form as in figure
2.12, except that the schwa tongue-body gestures were removed. Thus, active
control of V2 began at the end of VI, and, without a schwa gesture, the
tongue trajectory moved directly from VI to V2. During the acoustic interval
corresponding to schwa, the tongue moved along this V1-V2 trajectory. The
resulting simulations in most cases showed a good qualitative fit to the data,
and produced utterances whose medial vowels were perceived as schwas. The
problems arose in utterances in which VI and V2 were the same (particularly
when they were high vowels). Figure 2.20 portrays the simulation for
/pipa'pipa/: the motion variables generated can be compared with the data in
figure 2.4. The "dip" between VI and V2 was not produced in the simulation,
and, in addition, the medial vowel sounded like /i/ rather than schwa. This
organization does not, then, seem possible for utterances where both VI and
V2 are high vowels.
We investigated the worst utterance (/pipa'pipa/) from the above set of

52
2 Catherine P. Browman and Louis Goldstein

P i P " a" P i P
Waveform

Tongue center
horizontal
(CX)
Tongue center _ - ^
vertical
(CY)

Lower lip y

vertical

300 400 500 600 700 800

Time (msec.)

Figure 2.20 Gestural score plus generated movements for /pip_'pip_/, with no activations for
schwa. The acoustic interval between the second and third bilabial gestures is perceived as an /i/.
Generated movements (curves) are shown for the tongue center and lower lip. The higher the
curve, the higher (vertical) or more fronted (horizontal) the corresponding movement. Super-
imposition of boxes and curves as in figure 2.13

simulations further, generating a shorter acoustic interval for the second


vowel (the putative schwa) by decreasing the interval (relative phasing)
between the bilabial gestures on either side of it. An example of a score with
the bilabial closure gestures closer together is shown in figure 2.21. At
relatively short durations as in thefigure(roughly < 50 msec), the percept of
the second vowel changed from /i/ to schwa. Thus, the completely targetless
organization may be workable in cases where the surrounding consonants
are only slightly separated. In fact, this suggests a possible historical source
for epenthetic schwa vowels that break up heterosyllabic clusters. They could
arise from speakers increasing the distance between the cluster consonants
slightly, until they no longer overlap. At that point, our simulations suggest
that the resulting structure would be perceived as including a schwa-like
vowel.
The second set of simulations involving an "unspecified" schwa used the
same gestural organization as that portrayed in the score in figure 2.20,
except that the V2 gesture was delayed so that it did not begin right at the
offset of the VI gesture. Rather, the V2 regime began approximately at the
beginning of the third bilabial-closure gesture, as in figure 2.22. Thus, there
was an interval of time during which no tongue-body gesture was active, that
is, during which there was no active control of the tongue-body tract
variables. The motion of the tongue-body center during this interval, then,
was determined solely by the neutral positions, relative to the jaw, associated
with the tongue-body articulators, and by the motion of the jaw, which was
implicated in the ongoing bilabial closure and release gestures. The results,
53
Gesture

Waveform

Tongue center
horizontal
(CX)
Tongue center
vertical
(CY)

Lower lip
vertical

200 300 400 500 600 700


Time (msec.)

Figure 2.21 The same gestural score for /pip_'pip_/ as in figure 2.20, but with the second and
third bilabial gestures closer together than in figure 2.20. The acoustic interval between the
second and third bilabial gestures is perceived as a schwa. Generated movements (curves) are
shown for the tongue center and lower lip. The higher the curve, the higher (vertical) or more
fronted (horizontal) the corresponding movement. Superimposition of boxes and curves as in
figure 2.13

displayed in figure 2.22, showed that the previous problem with /pip9'pip9/
was solved, since during the unspecified interval between the two full vowels,
the tongue-body lowered (from /i/ position) and produced a perceptible
schwa. Unfortunately, this "dip" between VI and V2 was seen for all
combinations of VI and V2, which was not the case in the X-ray data. For
example, this dip can be seen for /papa'papa/ in figure 2.23; in the X-ray
data, however, the tongue raised slightly during the schwa, rather than
lowering. (The "dip" occurred in all the simulations because the neutral
position contributing to the tongue-body movement was that of the tongue-
body articulators rather than that of the tongue-body tract variables;
consequently the dip was relative to the jaw, which, in turn, was lowering as
part of the labial release). In addition, because the onset for V2 was so late, it
would not be possible for V2 to affect the schwa at the "lip" reference point,
as was observed in the X-ray data. Thus, this hypothesis also failed to
capture important aspects of the data. The best hypothesis remains the one
tested first - where schwa has a target of sorts, but is still "colorless," in that
its target is the mean of all the vowels, and is completely overlapped by the
following vowel.

2.4 Conclusion
We have demonstrated how an explicit gestural model of phonetic structure,
embodying the possibilities of underspecification ("targetlessness") and

54
2 Catherine P. Browman and Louis Goldstein

Waveform

Tongue center
horizontal
(CX)
Tongue center
vertical
(CY)

Lower lip
vertical

100 200 300 400 500 600 700 800

Time (msec.)

Figure 2.22 The same gestural score for /pip_'pip_/ as in figure 2.20, but with the onset of the
second full vowel /i/ delayed. The acoustic interval between the second and third bilabial
gestures is perceived as a schwa. Generated movements (curves) are shown for the tongue center
and lower lip. The higher the curve, the higher (vertical) or more fronted (horizontal) the
corresponding movement. Superimposition of boxes and curves as in figure 2.13

P a i) 9 p a p
Waveform

Tongue center
horizontal ^~
(CX)
Tongue center
—-^
vertical
(CY)
^-~—
-^—_ —
Lower lip
vertical
y^
\
- r| |
200 300 400 500 800
Time (msec.)

Figure 2.23 The same gestural score as in figure 2.22, except with tongue targets appropriate for
the utterance /pap_'pap_/. The acoustic interval between the second and third bilabial gestures is
perceived as a schwa. Generated movements (curves) are shown for the tongue center and lower
lip. The higher the curve, the higher (vertical) or more fronted (horizontal) the corresponding
movement. Superimposition of boxes and curves as in figure 2.13

temporal overlap ("coproduction"), can be used to investigate the contextual


variation of phonetic units, such as schwa, in speech. For the particular
speaker and utterances that we analyzed, there was clearly some warping of
the V1-V2 trajectory towards a neutral position for an intervening schwa.
The analyses showed that this neutral position has to be defined in the space
55
Gesture
of tract variables (the linguistically relevant goal space), rather than being the
consequence of neutral positions for individual articulators. Therefore, a
target position for schwa was specified, although this target is completely
predictable from the rest of the system; it corresponds to the mean tongue
tract-variable position for all the full vowels.
The temporally overlapping structure of the gestural score played a key
role in accounting for the time course of VI and V2 effects on schwa. These
effects were well modeled by a gestural score in which active control for
schwa was completely overlapped by that for V2. This overlap gave rise to
the observed anticipatory effects, while the carry-over effects were passive
consequences of the initial conditions of the articulators when schwa and V2
begin. (This fits well with studies that have shown qualitative asymmetries in
the nature of carry-over and anticipatory effects [see Recasens 1987].)
How well the details of the gestural score will generalize to other speakers
and other prosodic contexts remains to be investigated. There is known to be
much individual variation in the strength of anticipatory vs. carry-over
coarticulation in utterances like those employed here, and also in the effect of
stress (Fowler 1981a; Magen 1989). In addition, reduced vowels with
different phonological/morphological characteristics, as in the plural (e.g.
"roses") and past tense (e.g. "budded") may show different behavior, either
with respect to overlap or targetlessness. The kind of modeling developed
here provides a way of analyzing the complex quantitative data of articula-
tion so that phonological issues such as these can be addressed.

Comments on Chapter 2
SARAH HAWKINS
The speaker's task is traditionally conceptualized as one of producing
successive articulatory or acoustic targets, with the transitions between them
being planned as part of the production process.* A major goal of studies of
coarticulation is then to identify the factors that allow or prevent coarticula-
tory spread of features, and so influence whether or not targets are reached.
In contrast, Browman and Goldstein offer a model of phonology that is
couched in gestural terms, where gestures are abstractions rather than
movement trajectories. In their model, coarticulation is the inevitable

*The structure of this discussion is influenced by the fact that it originally formed part of a joint
commentary covering this paper and the paper by Hewlett and Shockey. Since the latter's paper
was subsequently considerably revised, mention of it has been removed and a separate
discussion prepared.

56
2 Comments
consequence of coproduction of articulatory gestures. Coarticulation is
planned only in the sense that the gestural score is planned, and traditional
notions of target modification, intertarget smoothing, and look-ahead pro-
cesses are irrelevant as explanations, although the observed properties they
are intended to explain are still, of course, of central concern.
Similarly, coarticulation is traditionally seen as a task of balancing
constraints imposed by the motoric system and the perceptual system - of
balancing ease of articulation with the listener's need for acoustic clarity.
These two opposing needs must be balanced within constraints imposed by a
third factor, the phonology of the particular language. Work on coarticula-
tion often tries to distinguish these three types of constraint.
For me, one of the exciting things about Browman and Goldstein's work is
that they are being so successful in linking, as opposed to separating,
motoric, perceptual, and phonological constraints. In their approach, the
motoric constraints are all accounted for by the characteristics of the task-
dynamic model. But the task-dynamic model is much more than an expres-
sion of universal biomechanical constraints. Crucially, the task-dynamic
model also organizes the coordinative structures. These are flexible, func-
tional groupings of articulators whose organization is not an inevitable
process of maturation, but must be learned by every child. Coordinative
structures involve universal properties and probably some language-specific
properties. Although Browman and Goldstein assign all language-specific
information to the gestural score, I suspect that the sort of things that are
hard to unlearn, like native accent and perhaps articulatory setting, may be
better modeled as part of the coordinative structures within the task
dynamics. Thus the phonological constraints reside primarily in the gestural
score, but also in its implementation in the task-dynamic model.
Browman and Goldstein are less explicitly concerned with modeling
perceptual constraints than phonological and motoric ones, but they are, of
course, concerned with what the output of their system sounds like. Hence
perceptual constraints dictate much of the organization of the gestural score.
The limits set on the temporal relationships between components of the
gestural score for any given utterance represent in part the perceptual
constraints. Variation in temporal overlap of gestures within these limits will
affect how the speech sounds. But the amount of variation possible in the
gestural score must also be governed by the properties and limits on
performance of the parameters in the task-dynamic model, for it is the task-
dynamic model that limits the rate at which each gesture can be realized. So
the perceptual system and the task-dynamic model can be regarded as in
principle imposing limits on possible choices in temporal variation, as
represented in the gestural score. (In practice, these limits are determined
from measurement of movement data.) Greater overlap will result in greater

57
Gesture

measurable coarticulation; too little or too much overlap might sound like
some dysarthric or hearing-impaired speakers. Browman and Goldstein's
work on schwa is a good demonstration of the importance of appropriate
temporal alignment of gestures. It also demonstrates the importance to
acceptable speech production of getting the right relationships between the
gestural targets and their temporal coordination.
Thus Browman and Goldstein offer a model in which perception and
production, and universal and language-specific aspects of the phonology,
are conceptually distinguishable yet interwoven in practice. This, to my way
of thinking, is as it should be.
The crucial issue in work on coarticulation, however, is not so much to say
what constraints affect which processes, as to consider what the controlled
variables are. Browman and Goldstein model the most fundamental con-
trolled variables: tongue constriction, lip aperture, velar constriction, and so
on. There are likely to be others. Some, like fundamental frequency, are not
strongly associated with coarticulation but are basic to phonology and
phonetics, and some, like aerodynamic variables, are very complex.
Let us consider an example from aerodynamics. Westbury (1983) has
shown allophonic differences in voiced stops that depend on position in
utterance and that all achieve cavity enlargement to maintain voicing. The
details of what happens vary widely and depend upon the place of articula-
tion of the stop, and its phonetic context. For example, for initial /b/, the
larynx is lowered, the tongue root moves backwards, and the tongue dorsum
and tip both move down. For final /b/, the larynx height does not change, the
tongue root moves forward, and the dorsum and tip move slightly upwards.
In addition, the rate of cavity enlargement, and the time function, also vary
between contexts. Does it make sense to try to include these differences? If
the task-dynamic system is primarily universal, then details of the sort
Westbury has shown are likely to be in the gestural score. But to include them
would make the score very complicated. Do we want that much detail in the
phonology, and if so, how should it be included? Browman and Goldstein
have elsewhere (1990) suggested a tiered system, and if that solution is
pursued, we could lose much of the distinction between phonetics and
phonology. While I can see many advantages in losing that distinction, we
could, on the other hand, end up with a gestural score of such detail that
some of the things phonologists want to do might become undesirably
clumsy. The description of phonological alternations is a case in point.
So to incorporate these extra details, we will need to consider the structure
and function of the gestural score very carefully. This will include consider-
ation of whether the gestural score really is the phonology-phonetics, or
whether it is the interface between them. In other words, do we see in the
gestural score the phonological primitives, or their output? Browman and
58
2 Comments
Goldstein say it is the former. I believe they are right to stick to their strong
hypothesis now, even though it may need to be modified later.
Another issue that interests me in Browman and Goldstein's model is
variability. As they note, the values of schwa that they produce are much less
variable than in real speech. There are a number of ways that variability
could be introduced. One, for schwa in particular, is that its target should not
be the simple average of all the vowels in the language, as Browman and
Goldstein suggest, but rather a weighted average, with higher weighting
given to the immediately preceding speech. How long this preceding domain
might be I do not know, but its length may depend on the variety of the
preceding articulations. Since schwa is schwa basically because it is centra-
lized relative to its context, schwa following a lot of high articulations could
be different from schwa in the same immediate context but following a
mixture of low and high articulations.
A second possibility, not specific to schwa, is to introduce errors. The
model will ultimately need a process that generates errors in order to produce
real-speech phenomena like spoonerisms. Perhaps the same type of system
could produce articulatory slop, although I think this is rather unlikely.
If the variability we are seeking for schwa is a type of articulatory slop, it
could also be produced by variability in the temporal domain. In Browman
and Goldstein's terms, the phase relations between gestures may be less
tightly tied together than at present.
A fourth possibility is that the targets in the gestural score could be less
precisely specified. Some notion of acceptable range might add the desired
variability. This idea is like Keating's (1988a) windows, except that her
windows determine an articulator trajectory, whereas Browman and Gold-
stein's targets are realized via the task-dynamic model, which adds its own
characteristics.
Let me finish by saying that one of the nice things about Browman and
Goldstein's work is how much it tells us that we know already. Finding out
what we already know is something researchers usually hope to avoid. But in
this case we "know" a great number of facts of acoustics, movement, and
phonology, but we do not know how they fit together. Browman and
Goldstein's observations on intrusive schwa, for example,fitwith my own on
children's speech (Hawkins 1984: 345). To provide links between disparate
observations seems to me to achieve a degree of insight that we sorely need in
this field.

59
Gesture

Comments on Chapter 2
JOHN KINGSTON
Introduction
Models are valued more for what they predict, particularly what they predict
not to occur, than what they describe. While the capacity of Browman and
Goldstein's gestural model to describe articulatory events has been demon-
strated in a variety of papers (see Browman et al. 1984; Browman and
Goldstein 1985, 1986, 1990; Browman et al. 1986), and there is every reason
to hope that it will continue to achieve descriptive success, I am less sanguine
about its predictive potential. The foundation of my pessimism is that
gestural scores are not thus far constructed in terms of independent princi-
ples which would motivate some patterns of gestural occurrence and
coordination, while excluding others.
"Independent principles" are either such as constrain nonspeech and
speech movement alike, or such as arise from the listener's demands on the
speaker. That such principles originate outside the narrowly construed events
of speaking themselves guards models built on them from being hamstrung
by the ad hoc peculiarities of speech movements. The scores' content is
constrained by the limited repertoire of gestures used, but because gestures'
magnitude may be reduced in casual speech, even to the point of deletion
(Browman and Goldstein 1990), the variety of gestures in actual scores is
indefinitely large. Further constraints on the interpretation of scores come
from the task dynamics, which are governed by principles that constrain
other classes of movements (see Kelso et al. 1980; Nelson 1983; Ostry, Keller,
and Parush 1983; Saltzman and Kelso 1987). The task dynamics rather than
the gestural score also specify which articulatory movements will produce a
particular gesture. The gestural score thus represents the model's articulatory
goals, while the specific paths to these goals are determined by entirely
dynamical means. Gestural coordination is not, however, constrained by the
task dynamics and so must be stipulated, and again the number of possible
patterns is indefinitely large. Despite this indefiniteness in the content and
coordination of scores, examining the articulation of schwa should be
informative about what is in a score and how the gestures are coordinated,
even if in the end Browman and Goldstein's account does not extend beyond
description.
The next two sections of this commentary examine Browman and Gold-
stein's claim that English schwa has an articulation of its own. This
examination is based on an extension of their statistical analysis and leads to
a partial rejection of their claim. In the final section, the distinction between
predictive vs. descriptive models is taken up again.

60
2 Comments
Table 2.3 Variances for lip and tongue reference
positions

MX MY RX RY

Lip 910.2 2290.7 364.4 1111.2


Tongue 558.6 959.0 325.9 843.4

Does schwa have its own target?


Browman and Goldstein found that the positions of two tongue pellets
(MX-MY and RX-RY) in a schwa between flanking full vowels closely
match the grand mean of pellet positions in the full vowels, implying that
during schwa the tongue simply has whatever position is dictated by the
transition between the full vowels that flank it. Therefore, when flanking
vowels are identical, the tongue should not deviate from the full vowel
positions during schwa. However, Browman and Goldstein's data show the
tongue does move away from these positions and back again during the
schwa, implying it does have its own target.
Giving schwa its own target is supported by the stepwise regression in
which including a constant factor representing effects independent of either
of the flanking vowels yielded a smaller residual variance. Schwa's target
only looks transitional because it is very close to the grand mean of the
tongue positions of all the full vowels. The stepwise regression also revealed
that the tongue position during schwa was determined more by V2 than VI,
perhaps because V2 was more prominent than VI.
Influences on the tongue position for schwa were assessed by comparing
standard errors for multiple-regression models containing different combi-
nations of terms for VI, and V2, and k, the independent schwa factor.
"Standard error" is the standard error of estimate (SE), a measure of the
residual variance not accounted for by the terms in the regression models.
The standard error of estimate is the term on the right, the square root of the
residual mean square, in (1) (Cohen and Cohen 1983: 104);

(1) Formula for the standard error of estimate


(q is the number of terms in the regression model) which shows that SB's
magnitude is not only a function of the proportion of variance not accounted
for, 1 — R\ but also of the overall magnitude of variance in the dependent
measure, E (Y — Y)2. Since the magnitude of this latter variance will differ
61
Gesture
2
Table 2.4 Shrunken R sfor lip reference positions

k + Vl+V2 k + Vl k + V2 V1+V2 k VI V2

MX 0.879 0.848 0.846 0.818 0.752


MY 0.957 0.945 0.917 0.882 0.864
RX 0.750 0.731 0.517 0.654 0.157
RY 0.912 0.885 0.880 0.839 0.834

Table 2.5 Shrunken R2s for tongue reference positions

k + Vl+V2 k + Vl k + V2 V1+V2 k VI V2

MX 0.806 0.793 0.712 0.709 0.491


MY 0.882 0.832 0.761 0.761 0.586
RX 0.631 0.631 0.406 0.533 0.145
RY 0.872 0.863 0.842 0.778 0.778

between dependent variables, the absolute magnitude of the SEs for models
with different dependent variables cannot be compared. Accordingly, to
evaluate how well the various regression models fare across the pellet
positions, a measure of variance should be used that is independent of the
effect of different variances among the dependent variables, i.e. R\ rather
than SE. More to the point, the R2s can be employed in significance tests of
differences between models of the same .dependent variable with different
numbers of terms. The equation in (1) can be solved for R2, but only if one
knows L(Y — Y)2 (solving this equation for R2 shows that the values listed
by Browman and Goldstein must be the squared standard error of estimate).
This variance was obtained from Browman and Goldstein's figures, which
plot the four tongue coordinates; measurements were to the nearest division
along each axis, and their precision is thus ± 1 mm for MY and RY,
±0.55 mm for RX, and ±0.625 mm for MX (measurement error in either
direction is roughly equal to half a division for each of the pellet coordi-
nates). The variances obtained do differ substantially for the four measures
(see table 2.3), with the variances for vertical position consistently larger than
for horizontal position at both reference points. The resulting shrunken R2s
for the various regression models at lip and tongue reference positions are
shown in tables 2.4 and 2.5 (the gaps in these tables are for regression models
not considered by the stepwise procedure). Shrunken R2s are given in these
tables because they are a better estimate of the proportion of variance
62
2 Comments
accounted for in the population from which the sample is taken when the ratio
of independent variables q to n is large, as here, and when independent
variables are selected post hoc, as in the step wise regression. (Shrunken R2s
were calculated according to formula (3.6.4) in Cohen and Cohen (1983:
106-7), in which q was always the total number of independent variables
from which selections were made by the stepwise procedure, i.e. 3.) The
various models were tested for whether adding a term to the equation
significantly increased the variance, assuming Model I error (see Cohen and
Cohen 1983: 145-7). Comparisons were made of k + Vl + V2 with k + VI or
k + V2 and with V1+V2.
The resulting F-statistics confirmed Browman and Goldstein's contention
that adding VI to the k + V2 model does not increment the variance
significantly at MX, RX, or RY at the tongue reference positions (for
k + Vl+V2vs. k + V2:MXF (219) = 0.637,p>0.05; RX F(2l9) = 0,/>>0.05;
and RY F(219) = 0.668, p > 0.05) and also supports their observation that
both VI and V2 increment R2 substantially for MY at the tongue reference
positions (for k +VI + V2 vs. k +VI, F(2l9) = 4.025, p< 0.05).
However, their claim that for the lip reference position, the two-term
models k + Vl or k + V2 account for substantially less variance than the
three-term model k + Vl + V2 is not supported, for any dependent variable
(for k + Vl +V2 vs. k + Vl: MX F(219) = 2.434, p > 0.05 and MY F(2l9) =
2.651, p > 0.05, and for k + Vl+V2 vs. k + V2: RXF(2>19) = 0.722,/? > 0.05
and RY F{219) = 2.915, p > 0.05). Comparisons of the other two-term
model, V1+V2, with k + Vl+V2 yielded significant differences for MY
(F(2l9) = 8.837,/? < 0.01) and RX(F(219) = 8.854,/? < 0.01), but for neither
dependent variable was V1+V2 the second-best model. At MX and RY,
the differences in amount of variance accounted for by VI +V2 vs. k + Vl
(MX) or k + V2 (RY) are very small (less than half of 1 percent in each
case), so choosing the second-best model is impossible. In any case, there
is no significant increment in the variance in the three-term model,
k + Vl+V2, with respect to the V1+V2 two-term model at MX (F(219) =
2.591, p > 0.05) or RY (F(2>19) = 2.483, p > 0.05). Thus at the lip refe-
rence position, schwa does not coarticulate strongly with V2 at MX
or MY, nor does it coarticulate strongly with VI at RX or RY. There is
evidence for an independent schwa target at MY and RX, but not MX or
RY. Use of R2s rather than SEs to evaluate the regression models has thus
weakened Browman and Goldstein's claims regarding both schwas having
a target of its own and the extent to which it is coproduced with flanking
vowels.

63
Gesture

Are all schwas the same?


Whether schwa has an independent target depends on how much the schwa
factor contributes to R2 when the tokens with identical flanking vowels are
omitted. If schwa's target is the grand mean of all the full vowel articulations,
then the schwa factor should contribute substantially less with this omission,
since the tongue should pass through that position on its path between
differing but not identical full vowels. Schwas may be made in more than one
way, however: between unlike vowels, schwa may simply be a transitional
segment, but between like vowels, a return to a more neutral position might
have to be achieved, either by passive recoil if schwas are analogous to the
"trough" observed between segments which require some active articulatory
gesture (Gay 1977, 1978; cf. Boyce 1986) or by means of an active gesture as
Browman and Goldstein argue. (Given the critical damping of gestures in the
task dynamics, one might expect passive recoil to achieve the desired result,
but then why is a specified target needed for schwa?) On the other hand, if the
schwa factor contributes nearly the same amount to R2 in regression models
where the identical vowel tokens are set aside, then there is much less need for
multiple mechanisms. Finally, one may ask whether schwas are articulated
with the same gesture when there is no flanking full vowel, on one side or the
other, as in the first schwa of Pamela or the second schwa in Tatamagouchi.
A need more fundamental than looking at novel instances of the phenome-
non is for principles external to the phenomena on which the modeling is
based, which would explain why one gestural score is employed and not
others. I point to where such external, explanatory principles may be found
in the next section of this commentary.

Description vs. explanation


The difficulty I have with the tests of their gestural model that Browman and
Goldstein present is that they stop when an adequate descriptive fit to the
observed articulatory trajectories was obtained. Lacking is a theory which
would predict the particular gestural scores that most closely matched the
observed articulations, on general principles. (The lack of predictive capacity
is, unfortunately, not a problem unique to Browman and Goldstein's model;
for example, we now have excellent descriptive accounts of how downtrends
in F o are achieved (Pierrehumbert 1980; Liberman and Pierrehumbert 1984;
Pierrehumbert and Beckman 1988), but still very little idea of why down-
trends are achieved with the mechanisms identified, why these mechanisms
are employed in all languages with downtrends, or even why downtrends are
so ubiquitous.) If V2's gesture overlaps more with the preceding schwa than
Vl's because it is more prominent, why should prominence have this effect on

64
2 Comments
gestural overlap? On the other hand, if more overlap is always observed
between the schwa and the following full vowel, why should anticipatory
coarticulation be more extensive than carry-over? And in this case, what does
greater anticipatory coarticulation indicate about the relationship between
the organization of gestures and the trochaic structure of stress feet in
English? All of these are questions that we might expect an explanatory or
predictive theory of gestural coordination to answer.
The gestural theory developed by Browman and Goldstein may have all
the pieces needed to construct a machine that will produce speech, indeed, it
is already able to produce particular speech events, but as yet there is no
general structure into which these pieces may be put which would produce
just those kinds of speech events that do occur and none of those that do not.
Browman and Goldstein's gestural theory is not incapable of incorporating
general principles which would predict just those patterns of coordination
that occur; the nature of such principles is hinted at by Kelso, Saltzman, and
Tuller's (1986a) replication of Stetson's (1951) demonstration of a shift from
a VC to CY pattern of articulatory coordination as rate increased. Kelso,
Saltzman, and Tuller suggest that the shift reflects the greater stability of CV
over VC coordination, but it could just as well be that place and perhaps
other properties of consonants are more reliably perceived in the transition
from C to V than from V to C (see Ohala 1990 and the references cited there,
as well as Kingston 1990 for a different view). If this latter explanation is
correct, then the search for the principles underlying the composition of
gestural scores must look beyond the facts of articulation, to examine the
effect the speaker is trying to convey to the listener and in turn what
articulatory liberties the listener allows the speaker (see Lindblom 1983,
Diehl and Kluender 1989, and Kingston and Diehl forthcoming for more
discussion of this point).

Comments on Chapter 2
WILLIAM BARRY
In connection with Browman and Goldstein's conclusion that schwa is
"weak but not completely targetless," I should like to suggest that they reach
it because their concept of schwa is not totally coherent with the model
within which the phenomenon "neutral vowel" is being examined. The two
"nontarget" simulations that are described represent two definitions:

1 A slot in the temporal structure which is empty with regard to vowel quality,
the vowel quality being determined completely by the preceding and

65
Gesture
following vowel targets. This conflicts, in spirit at least, with the basic
concept of a task-dynamic system, which explicitly evokes the physiologi-
cally based "coordinative structures" of motor control (Browman and
Goldstein 1986). A phonologically targetless schwa could still not escape the
residual dynamic forces of the articulatory muscular system, i.e. it would be
subject to the relaxation forces of that system.
2 A relaxation target. The relaxation of the tongue-height parameter in the
second simulation is an implicit recognition of the objection raised in
point 1, but it still clashes with the "coordinative" assumption of articula-
tory control, which argues against one gesture being relaxed independent of
other relevant gestural vowel parameters.

If an overall "relaxation target" is accepted, then, from a myofunctional


perspective there is no means of distinguishing the hypothesized "targetless"
schwa from the schwa-target as defined in the paper. Any muscle in a
functional system can only be accorded a "neutral" or "relaxation" value as
a function of the forces brought to bear on it by other muscles within the
system. These forces will differ with each functional system. The rest position
for quiet respiration (velum lowered, lips together, mandible slightly low-
ered, tongue tip on alveolar ridge) is different from the preparatory position
found prior to any speech act independent of the character of the utterance
onset (lips slightly apart, velum raised, jaw slightly open, laryngeal
adduction).
The relaxation position may, therefore, be seen as a product of the
muscular tensions required by any functional system, and implicit support
for this view is given by Browman and Goldstein's finding that the mean
back and front tongue height for schwa is almost identical with the mean
tongue heights for all the other vowels. In other words, the mean vowel
specifying the schwa "target" used by Browman and Goldstein is identical
with the relaxation position of the vocalic functional system since it reflects
the balance of forces between the muscle-tension targets specified for all the
other vowels within the system. This accords nicely with the accepted
differences in neutral vowel found between languages, and allows a substan-
tive definition of the concept of "basis of articulation" which has been
recognized qualitatively for so long (Franke 1889; Sievers 1901; Jespersen
1904, 1920; Roudet 1910).
This implies that, phonologically, schwa can in fact be regarded as
undefined or "targetless," a status in keeping with its optional realization in
many cases before sonorant consonants, and its lack of function when
produced epenthetically.
One difficult question is the physiological definition and delimitation of a
functional system, as it is mainly the scientific area of inquiry and the level of
descriptive delicacy which defines a function. Since the same muscles are used

66
2 Comments
for many different functions, a total physiological independence of one
functional system from another using the same muscles cannot be expected.
A critical differentiation within speech, for example, is between possible
vocalic vs. consonantal functional subsystems. It has long been postulated as
descriptively convenient and physiologically supportable that consonantal
gestures are superimposed on an underlying vocalic base (Ohman 1966a;
Perkell 1969; Hardcastle 1976). Browman and Goldstein's gestural score is
certainly in accordance with this view. A resolution of the problem within the
present discussion is not necessary, however, since the bilabial consonantal
context is maximally independent of the vocalic system and is kept constant.

67
3
Prosodic structure and tempo in
a sonority model of articulatory dynamics

MARY BECKMAN, JAN EDWARDS, and JANET FLETCHER

3.1 Introduction
One of the most difficult facts about speech to model is that it unfolds in
time.* The phonological structure of an utterance can be represented in
terms of a timeless organization of categorical properties and entities -
phonemes in sequence, syllables grouped into stress feet, and the like. But a
phonetic representation must account for the realization of such structures as
physical events. It must be able to describe, and ultimately to predict, the
time course of the articulators moving and the spectrum changing.
Early studies in acoustic phonetics demonstrated a plethora of influences
on speech timing, with seemingly complex interactions (e.g. Klatt 1976). The
measured acoustic durations of segments were shown to differ widely under
variation in overall tempo, in the specification of adjacent segments, in stress
placement or accentuation, in position relative to phrase boundaries, and so
on. Moreover, the articulatory kinematics implicated in any one linguistic
specification - tempo or stress, say - showed a complicated variation across
speakers and conditions (e.g. Gay 1981). The application of a general model
of limb movement (task dynamics) shows promise of resolving this variation
by relating the durational correlates of tempo and stress to the control of
dynamic parameters such as gestural stiffness and amplitude (e.g. Kelso et al.
1985; Ostry and Munhall 1985). However, the mapping between these

*Haskins Laboratories generously allowed us to use their optoelectronic tracking system to


record the jaw-movement data. These recordings were made and processed with the assistance of
Keith Johnson and Kenneth De Jong. Madalyn Ortiz made the measurements of gesture
durations, displacements, and velocities, and Maria Swora made the Fo tracks and supplied one
transcription of the intonation patterns. The work reported in this paper was supported by the
National Science Foundation under grant number IRI-8617873 to Jan Edwards and grants IRI-
861752 and IRI-8858109 to Mary Beckman. Support was also provided by the Ohio State
University in various ways, including a Postdoctoral Fellowship to Janet Fletcher.

68
3 M. Beckman, J. Edwards, and J. Fletcher
parameters and the underlying phonological specification of prosodic struc-
ture is not yet understood.
A comparison of the articulatory dynamics associated with several different
lengthening effects suggests an approach to this mapping. This paper explores
how the general task-dynamic model can be applied to the durational
correlates of accent, of intonation-phrase boundaries, and of slower overall
speaking tempo. It contrasts the descriptions of these three different effects in a
corpus of articulatory measurements of jaw movement patterns in [pap]
sequences. We will begin by giving an overview of the task-dynamic model
before presenting the data, and conclude by describing what the data suggest
concerning the nature of timing control. We will propose that, underlying the
more general physical representation of gestural stiffness and amplitude, there
must be an abstract language-specific phonetic representation of the time
course of sonority at the segmental and various prosodic levels.

3.2 The task-dynamic model


In the late 1970s, a group of speech scientists at Haskins Laboratories
proposed that speech production can be described using a task-dynamic
model originally developed to account for such things as the coordination of
flexor and extensor muscles in the control of gait (see, e.g., Fowler et al.
1980). In the decade since, safer techniques for recording articulator move-
ments have been refined, allowing large-scale studies in which multiple
repetitions of two or three syllable types can be examined for more
informative comparisons among different linguistic and paralinguistic
influences on durational structure (e.g. Ostry, Keller, and Parush 1983; Kelso
et al. 1985; Ostry and Munhall 1985). These studies showed patterns in the
relationships among kinematic measures in speech gestures that are similar
to those displayed by limb movements in walking, reaching, and the like,
lending plausibility to the proposed application to speech of the task-
dynamic model. More recently, the application has been made feasible by the
development of a mathematics capable of describing crowded gestures in
sequence (see Saltzman 1986; Saltzman and Munhall 1989) and by the
development of a system for representing segmental features directly in terms
of task-dynamic specifications of articulatory gestures (see Browman and
Goldstein 1986, 1988, 1990, this volume).
A fundamental assumption in this application of task dynamics is that
speech can be described as an orchestration of abstract gestures that specifies
(within a gesture) stiffness and displacement and (between gestures) relative
phase (see Hawkins, this volume, for an overview). To interpret articulatory
kinematics in terms of this model, then, we must look at relationships among
movement velocity, displacement, and duration for indications of the under-
69
Gesture
lying dynamic specifications of intragestural amplitude and stiffness and of
intergestural phase. (Note that we use the term displacement for the observed
kinematic measure, reserving amplitude for the underlying dynamic specifica-
tion in the model.) Consider first how the model would deal with linguistic
contrasts that essentially change gestural amplitude - for example, featural
specifications for different degrees of vowel constriction. If the model is
correct, the observed velocities of relevant articulators should correlate
positively with the observed displacements. Such a relationship is seen in
Ostry and Munhall's (1985) data on tongue-dorsum movement in /kV/
sequences. In this corpus, the size of the opening gesture into the vowel
varied from a small dorso-velar displacement for /ku/ to a large displacement
for /ka/. At the same time, movement velocity also varied from slow to fast,
and was strongly correlated with displacement, indicating a relatively
constant dorsal stiffness over the different vowel qualities. The task-dynamic
model provides an intuitively satisfying explanation for the variation in
gestural velocity. Without the reference to the underlying dynamic structure,
the inverse relationship between vowel height and velocity is counterintuitive
at best.
Consider next how the model would deal with a contrast that presumably
does not specify different amplitudes, such as overall tempo variation for the
same vowel height. Here we would predict a somewhat different relationship
between observed velocity and displacement. In Kelso et al.'s (1985) study of
lower lip gestures in /ba/ sequences, regression functions for peak velocity
against displacement generally had higher slopes for fast as compared to slow
tokens. As the authors point out, such differences in slope can be interpreted
as indicating different stiffnesses for the gestures at the different tempi. The
gesture is faster at fast tempo because it is stiffer. Because it is stiffer without
being primarily larger in amplitude, it should also be shorter, a prediction
which accords also with other studies.
Finally, consider how phase specifications can affect the observed kine-
matics of a gesture. For example, if an articulator is specifically involved in
the oral gestures for a stop and both adjacent vowels, as in the tongue-
dorsum raising and lowering gestures of an /aka/ sequence, then undershoot
of the stop closure might occur if the opening gesture is phased very early
relative to the closing gesture, resulting in an apparent replacement of /k/
with [x]. Browman and Goldstein (1990) have proposed that this sort of
variation in the phasing of gestures accounts for many consonant lenitions
and even deletions in casual speech. In cases of less extreme overlap, the early
phasing of a following gesture might obscure a gesture's underlying ampli-
tude specification without effecting a perceptible difference in segmental
quality. Bullock and Grossberg (1988) have suggested that such truncation is
the rule in the typically dense succession of gestures in fluent speech. Results

70
3 M. Beckman, J. Edwards, and J. Fletcher

0.040 0.060 0.080 0.100 0.120


Amplitude (mm) Displacement/velocity (sec.)

Figure 3.1 Predicted relationships among kinematic measures for sequences of gestures with
(b)-(d) differing stiffnesses, (e)-(g) differing displacements, and (h)-(j) differing intergestural
phasings

by Nittrouer et al. (1988) suggest that varying degrees of truncation


associated with differences in intrasyllabic phase relationships may underlie
minor effects on vowel quality of lexical stress contrasts.
Figure 3.1 summarizes the kinematic patterns that should result from
varying each of the three different dynamic specifications. In a pure stiffness
change, peak velocity would change but observed displacement should
remain unchanged, as shown in figure 3.1c. Duration should be inversely
proportional to the velocity change - smaller velocities going with longer

71
Gesture

Table 3.1 Test sentences

^obligatory intonation phrase break


la Popfopposing the question strongly, refused to answer it.
lb Poppa, posing the question loudly, refused to answer it.
JIO phrase break likely
2a Pop opposed the question strongly, and so refused to answer it.
2b Poppa posed the question loudly, and then refused to answer it.

durations. In this case, the ratio of the displacement to the peak velocity
should be a good linear predictor of the observed duration (fig. 3.Id). In a
pure amplitude change as well, peak velocity should change, but here
observed displacement should also change, in constant proportion to the
velocity change (fig. 3.If). In accordance with the constant displacement-
velocity ratio, the observed duration is predicted to be fairly constant (fig.
3.1g). Finally, in a phase change, peak velocity and displacement might
remain relatively unchanged (fig. 3.1i), but the observed duration would
change as the following gesture is phased earlier or later; it would be shorter
or longer than predicted by the displacement-velocity ratio (fig. 3.1j). If the
following gesture is phased early enough, the effective displacement might
also be measureably smaller for the same peak velocity ("truncated" tokens
in figs. 3.1i and 3.1j).
3.3 Methods
In our experiment, we measured the kinematics of movements into and out
of a low vowel between two labial stops. These [pap] sequences occurred
in the words pop versus poppa in two different sentence types, shown in
table 3.1. In the first type, the target word is set off as a separate intonation
phrase, and necessarily bears a nuclear accent. In the other type, the noun is
likely to be part of a longer intonation phrase with nuclear accent falling
later.
We had four subjects read these sentences at least five times in random
order at each of three self-selected tempi. We used an optoelectronic tracking
system (Kay et al. 1985) to record jaw height during these productions. We
looked at jaw height rather than, say, lower-lip height because of the jaw's
contribution both to overall openness in the vowel and to the labial closing
gesture for the adjacent stops. We defined jaw opening and closing gestures
as intervals between moments of zero velocity, as shown infigure3.2, and we
measured their durations, displacements, and peak velocities. We also made

72
3 M. Beckman, J. Edwards, and J. Fletcher

Figure 3.2 Sample jaw height and velocity traces showing segmentation points for vowel
opening gesture and consonant closing gesture in Poppa

Fo tracks of the sentences, and had two observers independently transcribe


the intonation patterns.
Figure 3.3 shows sample Fo tracks of utterances of sentences la and 2a
(table 3.1). The utterances show the expected phrasings, with an intonation-
phrase boundary after thzpop infigure3.3a but no intonational break of any
kind after the pop infigures3.3b and 3.3c. All four subjects produced these
two contrasting patterns of phrasing for all of the tokens of the contrasting
types. For sentences la and lb, this phrasing allows only one accentuation, a
nuclear accent on the pop or poppa, as illustrated by subject KDJ's produc-
tion in figure 3.3a. For sentences of type 2, however, the phrasing is
consistent with several different accent patterns. Infigure3.3b, for example,
there is a prenuclear accent on pop, whereas in figure 3.3c, the first accent
does not occur until the following verb, making pop completely unaccented.
Subjects KAJ and CDJ produced a variety of the possible accentuations for
this phrasing. Subjects JRE and KDJ, on the other hand, never produced a
prenuclear accent on the target word. They consistently put the first pitch
accent later, as in JRE's utterance in figure 3.3c. For these two subjects,
therefore, we can use the unaccented first syllable of poppa posed and the
nuclear-accented syllable of poppa, posing to examine the durational corre-
lates of accent.

3.4 The kinematics of accent


Figure 3.4 shows the mean durations, displacements, and peak velocities of
the opening and closing gestures for this accent contrast for subject JRE,
averaged by tempo. Examining the panels in thefigurerow by row, we see

73
Gesture
(a)
Hz
180 KAJ

140 r
100 LL H%
I
- Pop, opposing the question strongly,
• 1 . . , , , • • • , 2 , |

(b)
H*+L KAJ
180
- ^ H* + L
H*+L
140 \ H*
L
100
Pop opposed the question strongly
i i . . . . . 1 . i i . , 2 ,

(c)
280 JRE

\ H+L*
220 H%

140 Pop opposed the question strongly,


i i i i i "z ,
Time (sec.)
Figure 3.3 Sample Fo contours for target portions of sentences la and 2a (table 3.1) produced by
subjects KAJ and JRE

first that the gestures are longer in the nuclear-accented syllable in poppa,
posing. Note also the distribution of the durational increase; at all three
tempi, it affects both the opening and closing gesture of the syllable, although
it affects the closing gesture somewhat more. In the next row we see that the
gestures are larger in the accented syllable. Again, the increase in the
kinematic measure affects both the opening and the closing gesture; both
move about 2 mm further. Finally, in the last row we see the effect of accent
on the last kinematic measure. Here, by contrast, there is no consistent

74
3 M. Beckman, J. Edwards, and J. Fletcher
o — o Accented, • — • Unaccented

Subject JRE
Opening gesture Closing gesture
. 300 300-

fast normal slow fast normal slow

Figure 3.4 Mean durations, displacements, and peak velocities of opening and closing gestures
for nuclear-accented syllables (in Poppa, posing) vs. unaccented syllables (in Poppa posed) for
subject JRE

pattern. The opening gesture is faster in the accented syllable, but the closing
gesture is clearly not.
The overall pattern of means agrees with Summers's (1987) results for
accented and unaccented monosyllabic nonsense words. Interpreting this
pattern in terms of intragestural dynamics alone, we would be forced to
conclude that accent is not realized as a uniform change in a single
specification for the syllable as a whole. A uniform increase in the amplitude
specification for the accented syllable would be consistent with the greater

75
Gesture
accented O, unaccented #
550

150
0.100 0.150 0.200 0.250 0.300
Predicted syllable duration (sec.) =
Idisplacement/velocity (mm/[mm/sec])

Figure 3.5 Observed syllable durations against predicted syllable durations for the accented vs.
unaccented syllables in figure 3.4

displacements of both gestures and with the greater velocity of the opening
gesture, but not with the velocity of the closing gesture. A decrease in
stiffness for the accented syllable could explain the increased durations of the
two gestures, but it must be just enough to offset the velocity increase caused
by the increased displacement of the closing gesture.
If we turn to the intergestural dynamics, however, we can explain both the
displacement and the length differences in terms of a single specification
change: a different phasing for the closing gesture relative to the opening
gesture. That is, the opening gesture is longer in the accented syllable because
its gradual approach towards the asymptotic target displacement is not
interrupted until later by the onset of the closing gesture. Consequently, its
effective displacement is larger because it is not truncated before reaching its
target value in the vowel. The closing gesture, similarly, is longer because the
measured duration includes a quasi-steady-state portion where its rapid
initial rise is blended together with the gradual fall of the opening gesture's
asymptotic tail. And its displacement is larger because it starts at a greater
distance from its targeted endpoint in the following consonant.
Figure 3.5 shows some positive evidence in favor of this interpretation.
The value plotted along the y-axis in this figure is the observed overall
duration of each syllable token, calculated by adding the durations of the
opening and closing gestures. The value along the x-axis is a relative measure

76
3 M. Beckman, J. Edwards, and J. Fletcher

of the predicted duration, calculated by adding the displacement-velocity


ratios. Recalling the relationships described above in figure 3.1, we expect
that as long as the phasing of the closing gesture relative to the opening
gesture is held constant, the observed durations of the gestures should be in
constant proportion to their displacement-velocity ratios. Therefore, if the
closing gesture's phase is the same between the accented and unaccented
syllables in figure 3.5, all of the tokens should lie along the same regression
line, with the longer nuclear-accented syllables lying generally further to the
upper right. As the figure shows, however, the relationship between
measured and predicted durations is different for the two accent conditions.
For any given predicted duration, the measured duration of an accented
syllable is larger than that for an unaccented syllable.
This difference supports our interpretation of the means. An accented
syllable is longer not because its opening and closing gestures are less stiff,
but because its closing gesture is substantially later relative to its opening
gesture; the accented syllable is bigger, in the sense that the vocal tract is
maximally open for a longer period of time. We note further that this
horizontal increase in size is accompanied by an effective vertical increase;
the jaw moves further in the accented syllable because the opening gesture is
not truncated by the closing gesture. Since the sound pressure at the lips is a
function not just of the source pressure at the glottis but also of the general
openness of the vocal tract, an important consequence of the different jaw
dynamics is that the total acoustic intensity should be substantially larger in
the accented syllable, a prediction which accords with observations in earlier
acoustic studies (see Beckman 1986 for a review).

3.5 The kinematics of final lengthening


These kinematic patterns for the accentual contrast differ substantially from
those for the effect of phrasal position, as can be seen by comparingfigure3.4
with 3.6. This figure plots the durations, displacements, and peak velocities
of opening and closing gestures in the intonation-phrase-final [pap] of pop,
opposing vs. the nonfinal [pap] of poppa, posing for the same subject. Looking
first at the mean durations, we see that, compared to the durational increase
associated with accent, the greater length in intonation-phrase-final position
is not distributed so evenly over the syllable. It affects the opening gesture
considerably less and the closing gesture substantially more. The patterns for
the mean displacements also are different. Unlike the greater length of a
nuclear-accented syllable, intonation-phrase-final lengthening is not accom-
panied by any significant difference in articulator displacement. At all three
tempi, the jaw moves about the same distance for final as for nonfinal
syllables. The peak velocities for these gestures further illuminate these

77
Gesture
final o—o, non-final • •

Subject JRE
Opening gesture Closing gesture
300- 300-

I 250--
200
| +
c 150--
2

fast normal slow fast normal slow

Figure 3.6 Mean durations, displacements, and peak velocities of opening and closing gestures
for phrase-final syllables (in Pop, opposing) versus nonfinal syllables (in Poppa, posing) for
subject JRE

differences. The opening gesture of a final syllable is as fast as that of a


nonfinal syllable, as would be expected, given their similar durations and
displacements. However, the closing gestures are much slower for the longer
phrase-final syllables. The extreme difference in the closing gesture velocities
unaccompanied by any difference in displacement suggests a change in
articulator stiffness; the phrase-final gesture is longer because it is less stiff.
In sum, by contrast to the lengthening associated with accent, final

78
3 M. Beckman, J. Edwards, and J. Fletcher
final O, non-final
600

200
0.100 0.150 0.200 0.250 0.300
Predicted syllable duration (sec.) =
Xdisplacement/velocity (mm/[mm/sec])

Figure 3.7 Observed syllable durations against predicted syllable durations for the final vs.
nonfinal syllables in figure 3.6

lengthening makes a syllable primarily slower rather than bigger. That is,
phrase-final syllables are longer not because their closing gestures are phased
later, but rather because they are less stiff. In terms of the underlying
dynamics, then, we might describe final lengthening as an actual targeted
slowing down, localized to the last gesture at the edge of a phrase.
Applying the reasoning that we used to interpret figure 3.5 above, this
description predicts that the relationship between observed duration and
predicted duration should be the same for final and nonfinal closing gestures.
That is, in figure 3.7, which plots each syllable's duration against the sum of
its two displacement-velocity ratios, the phrase-final tokens should be part of
the same general trend, differing from nonfinal tokens only in lying further
towards the upper-right corner of the graph.
However, the figure does not show this predicted pattern. While the fast
tempo and normal tempo tokens are similar for the two phrasal positions,
the five slow-tempo phrase-final tokens are much longer than predicted by
their displacement-velocity ratios, making the regression curve steeper and
pulling it away from the curve for nonfinal tokens. The meaning of this
unexpected pattern becomes clear when we compare our description of final
lengthening as a local slowing down to the overall slowing down of tempo
change.

79
Gesture

final O O, non-final
600

200
fast normal slow
Figure 3.8 Mean syllable duration for fast, normal, and slow tempi productions of thefinaland
nonfinal syllables in figure 3.6

3.6 The kinematics of slow tempo


Our first motivation for varying tempo in this experiment was to provide a
range of durations for each prosodic condition to be able to look at
relationships among the different kinematic measurements. We also wanted
to determine whether there is any upper limit on lengthening due to, say, a
lower limit on gestural stiffness. We paid particular attention, therefore, to
the effects of tempo variation in the prosodic condition that gives us the
longest durational values: the nuclear-accented phrase-final syllables of pop,
opposing.
Figure 3.8 shows the mean overall syllable duration of this phrase-final
[pap] for the same subject shown above infigures3.4-3.7. For comparison to
tempo change in shorter syllables, the figure also shows means for the
nonfinal [pap] of poppa, posing. As acoustic studies would predict, there is a
highly significant effect of tempo on the overall duration of the syllable for
both phrasal positions. This is true for slowing down as well as for speeding
up. Speeding-up tempo resulted in shorter mean syllable durations in both
phrasal positions. Conversely, going from normal to slow tempo resulted in
longer mean syllable durations. Note that these tempo effects are sym-
metrical; slowing down tempo increases the syllable's duration as much as
speeding up tempo reduces its duration.
Figure 3.9 shows the mean durations, displacements, and peak velocities
for the opening and closing gestures in these syllables. (These are the same
80
3 M. Beckman, J. Edwards, and J. Fletcher
slow o — o , normal • •, fast A — A

Subject JRE
Opening gesture Closing gesture
300 300-

160-

130-

100-

70-

40- 1 1
final nonfinal final nonfinal
Figure 3.9 Mean durations, displacements, and peak velocities of opening and closing gestures
for fast, normal, and slow tempi productions of final and nonfinal syllables shown previously in
figure 3.8

data as in fig. 3.6, replotted to emphasize the effect of tempo.) For the
opening gesture, shown in the left-hand column, slowing down tempo
resulted in longer movement durations and lower peak velocities, unaccom-
panied by any change in movement size. Conversely, speeding up tempo
resulted in overall shorter movement durations and higher peak velocities,
again unaccompanied by any general increase in displacement. (The small
increase in displacement for the nonfinal syllables is statistically significant,
but very small when compared to the substantial increase in velocity.) These

81
Gesture
patterns suggest that, for JRE, the primary control parameter in tempo
variation is gestural stiffness. She slows down tempo by decreasing stiffness
to make the gestures slower and longer, whereas she speeds up tempo by
increasing stiffness to make the gestures faster and shorter. This general
pattern was true for the opening gestures of the three other subjects as well,
although it is obscured somewhat by their differing abilities to produce three
distinct rates.
The closing gestures shown in the right-hand column of Figure 3.9, by
contrast, do not show the same consistent symmetry between speeding up
and slowing down. In speeding up tempo, the closing gestures pattern like the
opening gestures; at fast tempo the gesture has shorter movement durations
and higher peak velocities for both thefinaland nonfinal syllables. In slowing
down tempo, however, there was an increase in movement duration and a
substantial decrease in movement velocity only for syllables in nonfinal
position; phrase-final closing gestures were neither longer nor significantly
slower at slow than at normal overall tempo. Subject CDJ showed this same
asymmetry.
The asymmetry can be summarized in either of the following two ways:
subjects JRE and CDJ had little or no difference between normal and slow
tempo durations and velocities for final syllables; or, these two subjects had
little or no difference between final and nonfinal closing gesture durations
and velocities at slow tempo. It is particularly noteworthy that, despite this
lack of any difference in closing gesture duration, the contrast in overall
syllable duration was preserved, as was shown above for JRE infigure3.8. It
is particularly telling that these two subjects had generally longer syllable
durations than did either subject KAJ or KDJ.
We interpret this lack of contrast in closing gesture duration for JRE and
CDJ as indicating some sort of lower limit on movement velocity or stiffness.
That this limit is not reflected in the overall syllable duration, on the other
hand, suggests that the subjects use some other mechanism - here probably a
later phasing of the closing gesture - to preserve the prosodic contrast in the
face of a limit on its usual dynamic specification. This different treatment of
slow-tempo final syllables would explain the steeper regression curve slope in
figure 3.7 above. Suppose that for fast and normal tempi, the nonfinal and
final syllables have the same phasing, and that the difference in observed
overall syllable duration results from the closing gestures being longer
because they are less stiff in the phrase-final position. For these two tempi,
the relationship between observed duration and predicted duration would be
the same. At slow tempo, on the other hand, the phrase-final gestures may
have reached the lower limit on gestural stiffness. Perhaps this is a physio-
logical limit on gestural speed, or perhaps the gesture cannot be slowed any
further without jeopardizing the identity of the [p] as a voiceless stop. In

82
3 M. Beckman, J. Edwards, and J. Fletcher
order to preserve the durational correlates of the prosodic contrast, however,
the closing gesture is phased later, making the observed period of the syllable
longer relative to its predicted period, and thus preserving the prosodic
contrast in the face of this apparent physiological or segmental limit.

3.7 The linguistic model


That the prosodic contrast is preserved by substituting a different dynamic
specification suggests that thefinallengthening has an invariant specification
at some level of description above the gestural dynamics. In other words, our
description of final lengthening as a local tempo change is accurate at the
level of articulatory dynamics only to a first approximation, and we must
look for an invariant specification of the effect at a yet more abstract level of
representation.
Should this more abstract level be equated with the phonological struc-
tures that represent categorical contrast and organization? We think not.
Although final lengthening is associated with phonologically distinct
phrasings, these distinctions are already represented categorically by the
hierarchical structures that describe the prosodic organization of syllables,
feet, and other phonological units. A direct phonological representation of
the phrase-final lengthening would be redundant to this independently
necessary prosodic structure. Moreover, we would like the abstract represen-
tation of final lengthening to motivate the differences between it and
accentual lengthening and the similarities between it and slowing down
tempo overall. A discrete phonological representation of lengthening at
phrase edges, such as that provided by grid representations (e.g. Liberman
1975; Selkirk 1984), would be intractable for capturing these differences and
similarities. A more promising approach is to try to describe the quantitative
properties of the lengthenings associated with nuclear accent, phrase-final
position, and overall tempo decrease in terms of some abstract phonetic
representation that can mediate between the prosodic hierarchy and the
gestural dynamics.
We propose that the relevant level of description involves the time course
of a substantive feature "sonority." We chose sonority as the relevant
phonetic feature because of its role in defining the syllable and in relating
segmental features to prosodic structures. Cross-linguistically, unmarked
syllables are ones whose associated segments provide clear sonority peaks.
The syllable is also a necessary unit for describing stress patterns and their
phonological properties in time, including the alignment of pitch accents.
The syllable is essential for examining the phonetic marks of larger intonat-
ional phrases, since in the prosodic hierarchy these units are necessarily
coterminous with syllables.

83
Gesture
Our understanding of sonority owes much to earlier work on tone scaling
(Liberman and Pierrehumbert 1984; Pierrehumbert and Beckman 1988). We
see the inherent phonological sonority of a segment as something analogous
to the phonological specification of a tone, except that the intrinsic scale is
derived from the manner features of a segment, as proposed by Clements
(1990a): stops have L (Low) sonority and open vowels have H (High)
sonority. These categorical values are realized within a quantitative space
that reflects both prosodic structure and paralinguistic properties such as
overall emphasis and tempo. That is, just as prosodic effects on F o are
specified by scaling H and L tones within an overall pitch range, prosodic
effects on phonetic sonority are specified by scaling segmental sonority values
within a sonority space.
This sonority space has two dimensions. One dimension is Silverman and
Pierrehumbert's (1990) notion of sonority: the impedance of the vocal-tract
looking forward from the glottis. We index this dimension by overall vocal
tract openness, which can be estimated by jaw height in our target [pap]
sequences. The other dimension of the sonority space is time; a vertical
increase in overall vocal-tract openness is necessarily coupled to a horizontal
increase in the temporal extent of the vertical specification. In this two-
dimensional space, prosodic constituents all can be represented as rectangles
having some value for height and width, as shown in figure 3.10b for the
words poppa and pop. The phonological properties of a constituent will help
to determine these values. For example, in the word poppa, the greater
prosodic strength of the stressed first syllable as compared to the unstressed
second syllable is realized by the larger sizes of the outer box for this syllable
and of the inner box for its bimoraic nucleus. (Figure 3.10a shows the
phonological representation for these constituents, using the moraic analysis
of heavy syllables first proposed by Hyman 1985.)
We understand the lengthening associated with accent, then, as part of an
increase in overall sonority for the accented syllable's nucleus. The larger
mean displacements of accented gestures reflect the vertical increase, and the
later phasing represents the coupled horizontal increase, as in figure 3.10c.
This sort of durational increase is analogous to an increase in local tonal
prominence for a nuclear pitch accent within the overall pitch range.
Final lengthening and slowing down tempi are fundamentally different
from this lengthening associated with accent, in that neither of these effects is
underlying a sonority change. Instead, both of these are specified as increases
in box width uncoupled to any change in box height; they are strictly
horizontal increases that pull the sides of a syllable away from its centre, as
shown in figure 3.10d and e.
The two strictly horizontal effects differ from each other in their locales.
Slowing down tempo overall is a more global effect that stretches out a

84
3 M. Beckman, J. Edwards, and J. Fletcher

(a) F F

(b)

Time-
(c)

1 13 1 p

\ / 9
a a

(f) m

Figure 3.10 (a) Prosodic representations and (b) sonority specifications for the words pop and
poppa. Effects on sonority specification of (c) accentual lengthening, (d) final lengthening, and
(e) slowing down tempo, (f) Prosodic representation and sonority representation of palm

85
Gesture
syllable on both sides of its moraic center. It is analogous in tone scaling to a
global change of pitch range for a phrase. Final lengthening, by contrast, is
local to the phrase edge. It is analogous in tone scaling to final lowering.
In proposing that these structures in a sonority-time space mediate
between the prosodic hierarchy and the gestural dynamics, we do not mean
to imply that they represent a stage of processing in a hypothetical derivatio-
nal sequence. Rather, we understand these structures as a picture of the
rhythmic framework for interpreting the dynamics of the segmental gestures
associated to a prosodic unit. For example, in our target [pap] sequences,
accentual lengthening primarily affected the phasing of the closing gesture
into the following consonant. We interpret this pattern as an increase in the
sonority of the accented syllable's moraic nucleus. Since the following [p] is
associated to the following syllable, the larger sonority decreases its overlap
with the vowel gesture. If the accented syllable were [pam] (palm), on the
other hand, we would not predict quite so late a phasing for the [m] gesture,
since the syllable-final [m] is associated to the second mora node in the
prosodic tree, as shown in figure 3.1 Of.
This understanding of the sonority space also supports the interpretation
of Fo alignment patterns. For example, Steele (1987) found that the relative
position of the Fo peak within the measured acoustic duration of a nuclear-
accented syllable remains constant under durational increases due to overall
tempo change. This is just what we would expect in our model of the
sonority-time space if the nuclear accent is aligned to the sonority peak for
the syllable nucleus. We would model this lengthening as a stretching of the
syllable to both sides of the moraic center. In phrase-final position, on the
other hand, Steele found that the Fo peak comes relatively earlier in the
vowel. Again, this is just what we would predict from our representation of
final lengthening as a stretching that is local to the phrase edge. This more
abstract representation of the rhythmic underpinnings of articulatory
dynamics thus allows us to understand the alignment between the F o pattern
and the segmental gestures.
In further work, we hope to extend this understanding to other segmental
sequences and other prosodic patterns in English. We also hope to build on
Vatikiotis-Bateson's (1988) excellent pioneering work in cross-linguistic
studies of gestural dynamics, to assess the generality of our conclusions to
analogous prosodic structures in other languages and to rhythmic structures
that do not exist in English, such as phonemic length distinctions.

86
3 Comments
Comments on chapter 3
OSAMU FUJIMURA
The paper by Beckman, Edwards, and Fletcher has two basic points: (1) the
sonority contour defines temporal organization; and (2) mandible height is
assumed to serve as a measure of sonority. In order to relate mandible
movement to temporal patterns, the authors propose to use the task-
dynamics model. They reason as follows. Since task dynamics, by adopting a
given system time constant ("stiffness," see below), defines a fixed relation
between articulatory movement excursion ("amplitude") and the duration of
each movement, measuring the relation between the two quantities should
test the validity of the model and reveal the role of adjusting control variables
of the model for different phonetic functions. For accented vs. unaccented
syllables, observed durations deviate from the prediction when a fixed
condition for switching from one movement to the next is assumed, while
under phrase-final situations, data conform with the prediction by assuming
a longer time constant of the system itself. (Actually, the accented syllable
does indicate some time elongation in the closing gesture as well.)
Based on such observations, the authors suggest that there are two
different mechanisms of temporal control: (1) "stiffness," which in this model
(as in Browman and Goldstein 1986, 1990) means the system time constant;
and (2) "phase," which is the timing of resetting of the system for a new
target position. This resetting is triggered by the movement across a preset
threshold position value, which is specified in terms of a percentage of the
total excursion. The choice between these available means of temporal
modulation depends on phonological functions of the control (amplitude
should not interact with duration in this linear model). This is a plausible
conclusion, and it is very interesting. There are some other observations that
can be compared with this; for example, Macchi (1988) demonstrated that
different articulators (the lower lip vs. the mandible) carry differentially
segmental and suprasegmental functions in lip-closing gestures.
I have some concern about the implication of this work with respect to the
model used. Note that the basic principle of oscillation in task dynamics most
naturally suggests that the rest position of the hypothetical spring-inertia
system is actually the neutral position of the articulatory mechanism. In a
sustained repetition of opening and closing movements for the same syllable,
for example, this would result in a periodic oscillatory motion which is
claimed to reflect the inherent nature of biological systems. In the current
model, however, (as in the model proposed by Browman and Goldstein
[1986, 1900]), the rest position of the mass, which represents the asymptote of
a critically damped system, is not the articulatory-neutral position but the
target position of the terminal gesture of each (demisyllabic) movement.

87
Gesture
Thus the target position must be respecified for each new movement. This
requires some principle that determines the point in the excursion at which
target resetting should take place. For example, if the system were a simple
undamped oscillatory system consisting of a mass and a spring, it could be
argued that one opening gesture is succeeded immediately by a closing
gesture after the completion of the opening movement (i.e. one quarter cycle
of oscillation); this model would result in a durational property absolutely
determined by the system characteristics, i.e. stiffness of the spring (since
inertia is assumed to equal unity), for any amplitude of the oscillation. In
Beckman, Edwards, and Fletcher's model, presumably because of the need to
manipulate the excursion for each syllable, a critically damped second-order
(mass-spring) system is assumed. This makes it absolutely necessary to
control the timing of resetting. However, this makes the assertion of task
dynamics - that the biological oscillatory system dictates the temporal
patterning of speech - somewhat irrelevant. Beckman, Edwards, and
Fletcher's model amounts to assuming a critically damped second-order
linear system as an impulse response of the system. This is a generally useful
mathematical approximation for implementing each demisyllabic movement
on the basis of a command. The phonetic control provides specifications of
timing and initial (or equivalently target) position, and modifies the system
time constant.
The duration of each (upgoing or downgoing) mandibular movement is
actually measured as the interval between the two extrema at the end of each
(demisyllabic) movement. Since the model does not directly provide such
smooth endpoints, but rather a discontinuous switching from an incomplete
excursion towards the target position to the next movement (presumably
starting with zero velocity), there has to be some ad hoc interpretation of the
observed smooth function relative to the theoretical break-point representing
the system resetting. Avoiding this difficulty, this study measures peak
velocity, excursion, and duration for each movement. Peak velocity is
measurable relatively accurately, assuming that the measurement does not
suffer from excessive noise. The interpretation of excursion, i.e. the observed
displacement, as the distance between the initial position and the terminating
position (at the onset of next movement) is problematic (according to the
model) because the latter cannot be related to the former unless a specific
method is provided of deriving the smooth time function to be observed.
A related problem is that the estimation of duration is not accurate.
Specifically, Beckman, Edwards, and Fletcher's method of evaluating end-
points is not straightforward for two reasons. (1) Measuring the time value
for either endpoint at an extremum is inherently inaccurate, due to the nature
of extrema. Slight noise and small bumps, etc. affect the time value
considerably. In particular, an error seems to transfer a portion of the
88
3 Comments
opening duration to the next closing duration according to Beckman,
Edwards, and Fletcher's algorithm. The use of the time derivative zero-
crossing is algorithmically simple, but the inherent difficulty is not resolved.
(2) Such measured durations cannot be compared accurately with prediction
of the theory, as discussed above. Therefore, while the data may be useful on
their own merit, they cannot evaluate the validity of the model assumed.
If the aim is to use specific properties of task dynamics and determine
which of its particular specifications are useful for speech analyses, then one
should match the entire time function by curve fitting, and forget about the
end regions (and with them amplitude and duration, which depend too much
on arbitrary assumptions). In doing so, one would probably face hard
decisions about the specific damping condition of the model. More impor-
tantly, the criteria for newly identifying the rest position of the system at each
excursion would have to be examined.
The finding that phrase-final phenomena are different from accent or
utterance-speed control is in conformity with previous ideas. The commonly
used term "phrase-final (or preboundary) elongation" (Lehiste 1980) implies
qualitatively and intuitively a time-scale expansion. The value of Beckman,
Edwards, and Fletcher's work should be in the quantitative characterization
of the way this modification is done in time. The same can be said about the
conclusion that in phrase-final position, the initial and final parts of the
syllable behave differently. One interesting question is whether such an
alteration of the system constant, i.e. the time scale, is given continuously
towards the end, or uniformly for the last phonological unit, word, foot,
syllable, or demisyllable, in phrase-final position. Beckman, Edwards, and
Fletcher suggest that if it is the latter, it may be smaller than a syllable, but a
definitive conclusion awaits further studies.
My own approach is different: iceberg measurement (e.g. Fujimura 1986)
uses the consonantal gesture of the critical articulator, not the mandible, for
each demisyllable. It depends on the fast-moving portions giving rather
accurate assessment of timing. Some of the results show purely empirically
determined phrasing patterns of articulatory movements, and uniform
incremental compression/stretching of the local utterance speed over the
pertinent phrase as a whole, depending on prominence as well as phrasing
control (Fujimura 1987).
I think temporal modulation in phrasal utterances is a crucial issue for
phonology and phonetics. I hope that the authors will improve their
techniques and provide definitive data on temporal control, and at the same
time prove or refute the validity of the task-dynamics model.

89
4
Lenition of I hi and glottal stop

JANET PIERREHUMBERTand DAVID TALKIN

4.1 Introduction

In this paper we examine the effect of prosodic structure on how segments are
pronounced. The segments selected for study are /h/ and glottal stop /?/.
These segments permit us to concentrate on allophony in source characteris-
tics. Although variation in oral gestures may be more studied, source
variation is an extremely pervasive aspect of obstruent allophony. As is well
known, /t/ is aspirated syllable-initially, glottalized when syllable-final and
unreleased, and voiced throughout when flapped in an intervocalic falling
stress position; the other unvoiced stops also have aspirated and glottalized
variants. The weak voiced fricatives range phonetically from essentially
sonorant approximants to voiceless stops. The strong voiced fricatives
exhibit extensive variation in voicing, becoming completely devoiced at the
end of an intonation phrase. Studying /h/ and /?/ provides an opportunity to
investigate the structure of such source variation without the phonetic
complications related to presence of an oral closure or constriction. We hope
that techniques will be developed for studying source variation in the
presence of such complications, so that in time a fully general picture
emerges.
Extensive studies of intonation have shown that phonetic realization rules
for the tones making up the intonation pattern (that is, rules which model
what we do as we pronounce the tones) refer to many different levels of
prosodic structure. Even for the same speaker, the same tone can correspond
to many different Fo values, depending on its prosodic environment, and a
given Fo value can correspond to different tones in different prosodic
environments (see Bruce 1977; Pierrehumbert 1980; Liberman and Pierre-
humbert 1984; Pierrehumbert and Beckman 1988). This study was motivated
by informal observations that at least some aspects of segmental allophony
90
4 Janet Pierrehumbert and David Talkin

Figure 4.1 Wide-band spectrogram and waveform of the word hibachi produced with contras-
tive emphasis. Note the evident aspiration and the movement in F, due to the spread glottis
during the /h/. The hand-marked segment locators and word boundaries are indicated in the
lower panel: m is the /m/ release; v marks the vowel centers; h the /h/ center; b the closure onset
of the /b/ consonant. The subject is DT

behave in much the same way. That is, we suspected that tone has no special
privilege to interact with prosody; phonetic realization rules in general can be
sensitive to prosodic structure. This point is illustrated in the spectrograms
and waveforms of figures 4.1 and 4.2. In figure 4.1 the word hibachi carries
contrastive stress and is well articulated. In figure 4.2, it is in postnuclear
position and the /h/ is extremely lenited; that is, it is produced more like a
vowel than the /h/ in figure 4.1. A similar effect of sentence stress on /h/
articulation in Swedish is reported in Gobi (1988).
Like the experiments which led to our present understanding of tonal
realization, the work reported here considers the phonetic outcome for
particular phonological elements as their position relative to local and
nonlocal prosodic features is varied. Specifically, the materials varied pos-
ition relative to the word prosody (the location of the word boundary and the
word stress) and relative to the phrasal prosody (the location of the phrase
boundary and the phrasal stress as reflected in the accentuation). Although
there is also a strong potential for intonation to affect segmental source
characteristics (since the larynx is the primary articulator for tones), this
issue is not substantially addressed in the present study because the difficul-

91
Gesture

Figure 4.2 Wide-band spectrogram and waveform of the word hibachi in postnuclear position.
Aspiration and F, movement during /h/ are less than in figure 4.1. The subject is DT

ties of phonetic characterization for /h/ and /?/ led us to an experimental


design with Low tones on all target regions. Pierrehumbert (1989) and a
study in progress by Silverman, Pierrehumbert, and Talkin do address the
effects of intonation on source characteristics directly by examining vocalic
regions, where the phonetic characterization is less problematic.
The results of the experiment support a parallel treatment of segmental
source characteristics and tone by demonstrating that the production of
laryngeal consonants depends strongly on both word- and phrase-level
prosody. Given that the laryngeal consonants are phonetically similar to
tones by virtue of being produced with the same articulator, one might ask
whether this parallel has a narrow phonetic basis. Studies such as Beckman,
Edwards, and Fletcher (this volume) which reports prosodic effects on jaw
movement, indicate that prosody is not especially related to laryngeal
articulations, but can affect the extent and timing of other articulatory
gestures as well. We would suggest that prosody (unlike intonational and
segmental specifications) does not single out any particular articulator, but
instead concerns the overall organization of articulation.
A certain tradition in phonology and phonetics groups prosody and
intonation on the one hand as against segments on the other. Insofar as
segments behave like tones, the grouping is called into question. We would
92
4 Janet Pierrehumbert and David Talkin
like instead to argue for a point of view which contrasts structure (the
prosodic pattern) with content (the substantive demands made on the
articulators by the various tones and segments). The structure is represented
by the metrical tree, and the content by the various autosegmental tiers and
their decomposition into distinctive features. This point of view follows from
recent work in metrical and autosegmental phonology, and is explicitly put
forward in Pierrehumbert and Beckman (1988). However, many of its
ramifications remain to be worked out. Further studies are needed to clarify
issues such as the degree of abstractness of surface phonological represen-
tations, the roles of qualitative and quantitative rules in describing allo-
phony, and the phonetic content of distinctive features in continuous speech.
We hope that the present study makes a contribution towards this research
program.

4.2 Background
4.2.1 jhj and glottal stop
Both /h/ and glottal stop /?/ are produced by a laryngeal gesture. They make
no demands on the vocal-tract configuration, which is therefore determined
by the adjacent segments. They are both less sonorous than vowels, because
both involve a gesture which reduces the strength of voicing. For /h/, the
folds are abducted. /?/ is commonly thought to be produced by adduction
(pressing the folds together), as is described in Ladefoged (1982), but
examination of inversefilteringresults and electroglottographic (EGG) data
raised doubts about the generality of this characterization. We suggest that a
braced configuration of the folds produces irregular voicing even when the
folds are not pressed together (see further discussion below).

4.2.2 Source characterization


The following broad characteristics of the source are crucial to our character-
ization. (1) For vowels, the main excitation during each pitch period occurs
at the point of contact of the vocal folds, because the discontinuity in the
glottal flow injects energy into the vocal tract which can effectively excite the
formants (see Fant 1959). This excitation point is immediately followed by
the "closed phase" of the glottal cycle during which the formants have their
most stable values and narrowest band widths. The "open phase" begins
when the vocal folds begin to open. During this phase, acoustic interaction at
the glottis results in greater damping of the formants as well as shifts in their
location. (2) "Softening" of vocal-fold closure and an increase in the open
quotient is associated with the "breathy" phonation in /h/. The abduction
93
Gesture
gesture (or gesture of spreading the vocal folds) associated with this type of
phonation brings about an increase in the frequencies and bandwidths of the
formants, especially F^ an increase in the overall rate of spectral roll-off; an
additional abrupt drop in the magnitude of the second and higher harmonics
of the excitation spectrum; and an increase in the random noise component
of the source, especially during the last half of the open phase. For some
speakers, a breathy or soft voice quality is found during the postnuclear
region of declaratives, as a reflex of phrasal intonation. (3) A "pressed" or
"braced" glottal configuration is used to produce /?/. This is realized
acoustically as period-to-period irregularities in the timing and spectral
content of the glottal excitation pulses. A full glottal stop (with complete
obstruction of airflow at the glottis) is quite unusual. Some speakers use
glottalized voicing, rather than breathy voicing, during the postnuclear
region of declaratives.

4.2.3 Prosody and intonation


We assume that the word and phrase-level prosody is represented by a
hierarchical structure along the lines proposed by Selkirk (1984), Nespor and
Vogel (1986), and Pierrehumbert and Beckman (1988) (see Ladd, this
volume). The structure represents how elements are grouped phonologically,
and what relationships in strength obtain among elements within a given
grouping. Details of these theories will not be important here, provided that
the representation makes available to the phonetic realization rules all
needed information about strength and grouping.
Substantive elements, both tones and segments, are taken to be autoseg-
mentally linked to nodes in the prosodic tree. The tones and segments are
taken to occur on separate tiers, and in this sense have a parallel relationship
to the prosodic structure (see Broe, this volume). In this study, the main focus
is on the relationship of the segments to the prosodic structure. The
relationship of the tones to the prosodic structure enters into the study
indirectly, as a result of the fact that prosodic strength controls the location
of pitch accents in English. In each phrase, the last (or nuclear) pitch accent
falls on the strongest stress in the phrase, and the prenuclear accents fall on
the strongest of the preceding stresses. For this reason, accentuation can be
used as an index of phrasal stress, and we will use the word "accented" to
mean "having sufficient phrasal stress to receive an accent." "Deaccented"
will mean "having insufficient phrasal stress to receive an accent"; in the
present study, all deaccented words examined are postnuclear.
Rules for pronouncing the elements on any autosegmental tier can
reference the prosodic context by examining the position and properties of
the node the segment is linked to. In particular, studies of fundamental

94
4 Janet Pierrehumbert and David Talkin
frequency lead us to look for sensitivity to boundaries (Is the segment at a
boundary or not? If so, what type of boundary?) and to the strength of the
nodes above the segment.
Pronunciation rules are also sensitive to the substantive context. For
example, in both Japanese and English, downstep or catathesis applies only
when the tonal sequence contains particular tones. Similarly, /h/ has a less
vocalic pronunciation in a consonantal environment than in a vocalic one.
Such effects, widely reported in the literature on coarticulation and assimila-
tion, are not investigated here. Instead, we control the segmental context in
order to concentrate on the less well understood prosodic effects.
Although separate autosegmental tiers are phonologically independent,
there is a strong potential for phonetic interaction between tiers in the case
examined here, since both tones and laryngeal consonants make demands on
the laryngeal configuration. This potential interaction was not investigated,
since our main concern was the influence of prosodic structure on segmental
allophony. Instead, intonation was carefully controlled to facilitate the
interpretation of the acoustic signal.

4.3 Experimental methods


4.3.1 Guiding considerations
The speech materials and algorithms for phonetic characterization were
designed together in order to achieve maximally interpretable results. Source
studies such as Gobi (1988) usually rely on inverse filtering, a procedure in
which the effects of vocal-tract resonances are estimated and removed from
the signal. The residue is an estimate of the derivative of theflowthrough the
glottis. This procedure is problematic for both /?/ and /h/. For /?/, it is
difficult to establish the pitch periods to which inverse filtering should be
applied. (Inverse filtering carried out on arbitrary intervals of the signal can
have serious windowing artifacts). Inverse filtering of /h/ is problematic
because of its large open quotient. This can introduce subglottal zeroes,
rendering the all-pole model of the standard procedure inappropriate, and it
can increase the frequency and bandwidth of the first formant to a point
where its location is not evident. The unknown contribution of noise also
makes it difficult to evaluate the spectral fit to the periodic component of the
source. These considerations led us to design materials and algorithms which
would allow us to identify differences in source characteristics without first
estimating the transfer function.
Specific considerations guiding the design of the materials were: (1) Fx is
greater than three times Fo. This minimizes the effects of F, bandwidth and
location on the low-frequency region, allowing it to reflect source character-
95
Gesture
istics in a more straightforward fashion. (2) Articulator movement in the
upper vocal tract is minimal during target segments. (3) The consonants
under study are produced by glottal gestures in an open-vowel environment
to facilitate interpretation of changes to the vocal source.

4.3.2 Materials
In the materials for the experiment, the position of /h/ and /?/ relative to
word-level and phrase-level prosodic structure is varied. We lay out the full
experimental design here, although we will only have space to discuss some
subsets of the data which showed particularly striking patterns.
In the materials /h/ is exemplified word-initially and -medially, before both
vowels with main word stress and vowels with less stress:

mahogany tomahawk hogfarmers hawkweed


hibachi Omaha

The original intention was to distinguish between a secondary stress in


tomahawk and an unstressed syllable at the end of Omaha, but it did not
prove possible to make this distinction in the actual productions, so these
cases will be treated together as "reduced" /h/s.
Intervocalic /?/ occurs only word-initially. /?/ as an allophone of /t/ is
found before syllabic nasals, as in "button," but not before vowels.) So, for
/?/ we have the following sets of words, providing near minimal comparisons
to the word-initial /h/s:

August awkwardness abundance Augustus


augmentation

This set of words was generated using computerized searches of several on-
line dictionaries. The segmental context of the target consonant was designed
to have a high first formant value and minimize formant motion, in order to
simplify the acoustic phonetic analysis. The presence of a nasal in the vicinity
of the target consonant is undesirable, because it can introduce zeroes which
complicate the evaluation of source characteristics. This suboptimal choice
was dictated by the scarcity of English words with medial /h/, even in the
educated vocabulary. We felt that it was critical to use real words rather than
nonsense words, in order to have accurate control over the word-level
prosody and in order to avoid exaggerated pronunciations.
Words in which the target consonant was word-initial were provided with
a /ma/ context on the left by appropriate choice of the preceding word.
Phrases such as the following were used:

96
4 Janet Pierrehumbert and David Talkin
Oklahoma August lima abundance figures
plasma augmentation

The position of the target words in the phrasal prosody were also manipu-
lated. The phrasal positions examined were (1) accented without special
focus, (2) accented and under contrast, (3) accented immediately following
an intonational phrase boundary, and (4) postnuclear. In order to maximize
the separation of F o and F p the intonation patterns selected to exemplify
these positions all had L tones (leading to low Fo) at the target location. This
allows the source influences on the low-frequency region to be seen more
clearly. The intonation patterns were also designed to display a level (rather
than time-varying) intonational influence on F o , again with a view to
minimizing artifacts. The accented condition used a low-rising question
pattern (L* H H% in the transcription of Pierrehumbert [1980]):
(1) Is he an Oklahoma hogfarmer?

Accent with contrast was created by embedding the "contradiction" pattern


(L* L H%) in the following dialogue:
(2) A: Is it mahogany?
B: No, it's rosewood.
C: It's mahogany!

In the "phrase boundary" condition, a preposed vocative was followed by a


list, as in the following example:
(3) Now Emma, August is hot here, July is hot here, and even June is hot here.

The vocative had a H* L pattern (that is, it had a falling intonation and was
followed by an intermediate phrase boundary rather than a full intonation
break). Non-final list items had a L* H pattern, while the final list item (which
was not examined) had a H* L L% pattern. The juncture of the H* L vocative
pattern with the L* H pattern of the first list item resulted in a low, level F o
configuration at the target consonant. Subjects were instructed to produce
the sentences without a pause after the vocative, and succeeded in all but a
few utterances (which were eliminated from the analysis). No productions
lacked the desired intonational boundary.
In the "postnuclear" condition, the target word followed a word under
contrast:
(4) They're Alabama hogfarmers, not Oklahoma hogfarmers.

In this construction, the second occurrence of the target word was the one
analyzed.

97
Gesture

4.3.3 Recording procedures


Since pilot studies showed that subjects had difficulty producing the correct
patterns in a fully randomized order, a blocked randomization was used.
Each block consisted of twelve sentences with the same phrasal intonation
pattern; the order within each block was randomized. The four blocks were
then randomized within each set. Six sets were recorded. The first set was
discarded, since it included a number of dysfluent productions for both
speakers.
The speech was recorded in an anechoic chamber using a 4165 B&K
microphone with a 2231 B&K amplifier, and a Sony PCM-2000 digital audio
tape recorder. The speakers were seated. A distance of 30 cm from the mouth
to the microphone was maintained. This geometry provides intensity sensiti-
vity due to head movement of approximately 0.6 dB per cm change in
microphone-to-mouth distance. Since we have no direct interface between
the digital tape recorder and the computer, the speech was played back and
redigitized at 12 kHz with a sharp-cutoff anti-alias filter set at 5.8 kHz, using
an Ariel DSP-16 rev. G board, yielding 16 bits of precision. The combined
system has essentially constant amplitude and phase response from 20 Hz to
over 5 kHz. The signal-to-noise ratio for the digitized data was greater than
55 dB. Electroglottographic signals were recorded and digitized simulta-
neously on a second channel to provide a check for the acoustically
determined glottal epochs which drive the analysis algorithms.

4.4 Analysis algorithms and their motivation


The most difficult part of the study was developing the phonetic characteriza-
tion, and the one used is not fully successful. Given both the volume of
speech to be processed and the need for replicability, it is desirable to avoid
measurement procedures which involve extensive fitting by eye or other
subjective judgment. Instead, we would argue the need for semi-automatic
procedures, in which the speech is processed using well-defined and tested
algorithms whose results are then scanned for conspicuous errors.

4.4.1 Pitch-synchronous analysis


The acoustic features used in this study are determined by "pitch-synchro-
nous" analyses in which the start of the analysis window is phase-locked on
the time of glottal closure and the duration of the window is determined by
the length of the glottal cycle. Pitch-synchronous analysis is desirable
because it offers the best combination of physical insight and time resolution.
One glottal cycle is the minimum period of interest, since it is difficult to draw

98
4 Janet Pierrehumbert and David Talkin
conclusions about the laryngeal configuration from anything less. When the
analysis window is matched to the cycle in both length and phase, the results
are well behaved. In contrast, when analysis windows the length of a cycle are
applied in arbitrary phase to the cycle, extensive signal-processing artifacts
result. Therefore non-pitch-synchronous moving-window analyses are typi-
cally forced to a much longer window length in order to show well-behaved
results. The longer window lengths in turn obscure the speech events, which
can be extremely rapid.
Pitch-synchronous analysis is feasible for segments which are voiced
throughout because the point of glottal closure can be determined quite
reliably from the acoustic waveform (Talkin 1989). We expected it to be
applicable in our study since the regions of speech to be analyzed were
designed to be entirely voiced. For subject DT, our expectations were
substantially met. For subject MR, strong aspiration and glottalization in
some utterances interfered with the analysis.
Talkin's algorithm for determining epochs, or points of glottal closure,
works as follows: speech, recorded and digitized using a system with known
amplitude and phase characteristics, is amplitude- and phase-corrected and
then inverse-filtered using a matched-order linear predictor to yield a rough
approximation to the derivative of the glottal volume velocity (LP). The
points in the U' signal corresponding to the epochs have the following
relatively stable characteristics: (1) Constant polarity (negative), (2) Highest
amplitude within each cycle, (3) Rapid return to zero after the extremum, (4)
Periodically distributed in time, (5) Limited range of inter-peak intervals, and
(6) Similar size and shape to adjacent peaks. A set of peak candidates is
generated from all local maxima in the U' signal. Dynamic programming is
then used to find the subset of these candidates which globally best matches
the known characteristics of U' at the epoch. The algorithm has been
evaluated using epochs determined independently from simultaneously
recorded EGG signals and was found to be quite accurate and robust. The
only errors that have an impact on the present study occur in strongly
glottalized or aspirated segments.

4.4.2 Measures used


Pitch-synchronous measurements used in the current study are (1) root mean
square of the speech samples in the first 3 msec, following glottal closure
expressed in dB re unity (RMS), (2) ratio of per-period energy in the
fundamental to that in the second harmonic (HR), and (3) local standard
deviation in period length (PDEV). RMS and HR are applied to /h/, and
PDEV is applied to /?/.
Given the relatively constant intertoken phonetic context, RMS provides
99
Gesture
an intertoken measure closely related to the strength of the glottal excitation
in corresponding segments. The integration time for RMS was held constant
to minimize interactions between Fo and formant band widths. RMS was not
a useful measure for /?/, since even a strongly articulated /?/ may have glottal
excitation as strong as the neighboring vowels; the excitation is merely more
irregular. RMS is relatively insensitive to epoch errors in /h/s, since epoch-
location uncertainty tended to occur when energy was more evenly distri-
buted through the period, which in turn renders the measurement point for
RMS less critical.
HR is computed as the ratio (expressed in dB) of the magnitudes of the
first and second harmonics of an exact DFT (Discrete Fourier Transform)
computed over one glottal period. The period duration is from the current
epoch to the next, but the start time of the period is taken to be at the zero
crossing immediately preceding the current epoch. This minimizes periodicity
violations introduced by cycle-to-cycle excitation variations, since the
adjusted period end will usually also fall in a low-amplitude (near zero)
region of the cycle. HR is a relevant measure because the increase in open
quotient of the glottal cycle and the lengthening of the time required to
accomplish glottal closure associated with vocal-fold abduction tends to
increase the power in the fundamental relative to the higher harmonics. This
increase in the average and minimum glottal opening also changes the vocal-
tract tuning and sub- to superglottal coupling. The net acoustic effect is to
introduce a zero in the spectrum in the vicinity of Fj and to increase both the
frequency and bandwidths of the formants, especially F r Since our speech
material was designed to keep F{ above the second harmonic, these effects all
conspire to increase HR with abduction. The reader is referred to Fant and
Lin (1988) for supporting mathematical derivations. Figure 4.3 illustrates the
behavior of the HR over the target intervals which include the two /h/s
shown in figures 4.1 and 4.2.
Fant and Lin's derivations do not attempt to model the contribution of
aspiration to the spectral shape, and the relation of abduction to HR indeed
becomes nonmonotonic at the point at which aspiration noise becomes the
dominant power source in the second-harmonic region. One of the subjects,
DT, has sufficiently little aspiration during the /h/s that this nonmonotoni-
city did not enter substantially into the analysis, but for subject MR it was a
major factor, and as a result RMS shows much clearer patterns. HR is also
sensitive to serious epoch errors, rendering it inapplicable to glottalized
regions.
PDEV is the standard deviation of the seven glottal period lengths
surrounding the current epoch. This measure represents an effort to quantify
the irregular periodicity which turned out to be the chief hallmark of /?/. It
was adopted after detailed examination of the productions failed to support
100
4 Janet Pierrehumbert and David Talkin

"n
34 dB

^^Y /v Y r ^ / V X/ V u ^^

Figure 4.3 HR and RMS measured for each glottal period throughout the target intervals of the
utterances introduced infigures4.1 and 4.2. Note that the difference of ~34 dB between the HR
in the /h/ and following vowel for the well-articulated case (top) is much greater than the ~2dB
observed in the lenited case (bottom). The RMS values discussed in the text are based on the
(linear) RMS displayed in this figure

the common understanding that /?/ is produced by partial or complete


adduction of the vocal folds. This view predicts that the spectrum during
glottalization should display a lower HR and a less steep overall spectral roll-
off than are found in typical vowels. However, examination of the EGG
signal in conjunction with inverse-filtering results showed that many tokens
had a large, rather than a small, open quotient and even showed aspiration
noise during the most closed phase, indicating that the closure was incom-
plete. The predicted spectral hallmarks were not found reliably, even in
utterances in which glottalization was conspicuously indicated by irregular
periodicity. We surmise that DT in particular produces /?/ by bracing or
tensing partially abducted vocal folds in a way which tends to create irregular
vibration without a fully closed phase.

4.4.3 Validation using synthetic signals


In order to validate the measures, potential artifactual variation due to F 0-Fj
interactions was assessed. A six-formant cascade synthesizer excited by a

101
Gesture
Liljencrants-Fant voice source run at 12 kHz sampling frequency was used
to generate synthetic voiced-speech-like sounds. These signals were then
processed using the procedures outlined above. F, and Fo were orthogonally
varied over the ranges observed in the natural speech tokens. ¥ x bandwidth
was held constant at 85 Hz while its frequency took on values of 500 Hz, 700
Hz and 800 Hz. For each of these settings the source fundamental was swept
from 75 Hz to 150 Hz during one second with the open quotient and leakage
time held constant. The bandwidths and frequencies of the higher formants
were held constant at nominal 17 cm, neutral vocal-tract values. Note that
the extremes in F{ and Fo were not simultaneously encountered in the natural
data, so that this test should yield conservative bounds on the artifactual
effects.
As expected, PDEV did not vary significantly throughout the test signal.
The range of variation in HR for these test signals was less than 3 dB. The
maximum peak-to-valley excursion in RMS due to F o harmonics crossing the
formants was 2 dB with a change in Fo from 112 Hz to 126 Hz and FY at 500
Hz. This is small compared to RMS variations observed in the natural speech
tokens under study.

4.4.4 Analysis of the data


Time points were established for the /m/ release, the first vowel center, the
center of the /h/ or glottal stop, the center of the following vowel, and the
point of oral constriction for the consonant. This was done by inspection of
the waveform and broad-band spectrogram, and by listening to parts of the
signal.
The RMS, HR and PDEV values for the vowel were taken to be the values
at the glottal epoch nearest to the measured vowel center.
RMS was used to estimate the /h/ duration, since it was almost invariably
lower at the center of the /h/ than during the following vowel. The /h/
interval was defined as the region around the minimum RMS observed for
the /h/ during which RMS did not exceed a locally determined threshold.
Taking RMS(C) as the minimum RMS observed and RMS(V2) as the
maximum RMS in the following vowel, the threshold was defined as
RMS(C) + 0.25*[RMS(V2) - RMS(C)]. The measure was somewhat
conservative compared to a manual segmentation, and was designed to avoid
spurious inclusions of the preceding schwa when it was extremely lenited.
The consonantal value for RMS was taken to be the RMS minimum, and its
HR value was taken to be the maximum HR during the computed /h/
interval.
The PDEV value for the /?/ was taken at the estimated center, since the

102
4 Janet Pierrehumbert and David Talkin
intensity behaviour of the /?/s did not support the segmentation scheme
developed for the /h/s. Durations for /?/ were not estimated.

4.5 Results
After mentioning some characteristics of the two subjects' speech, we first
present results for /h/ and then make some comparisons to /?/.

4.5.1 Speaker characteristics


There were some obvious differences in the speech patterns of the two
subjects. When these differences are taken into account, it is possible to
discern strong underlying parallels in the effects of prosody on /h/ and /?/
production.
MR had vocal fry in postnuclear position. This was noticeable both in the
deaccented condition and at the end of the preposed vocative Now Emma. He
had strong aspiration in /h/, leading to failure of the epoch finding in many
cases and also to nonmonotonic behavior of the HR measure. As a result, the
clearest patterns are found in RMS (which is more insensitive than HR to
epoch errors) and in duration. In general, MR had clear articulation of
consonants even in weak positions.
DT had breathiness rather than fry in postnuclear position. Aspiration in
/h/ was relatively weak, so that the epoch finder and the HR measure were
well behaved. Consonants in weak positions were strongly reduced.

4.5.2 Effects of word prosody and phrasal stress on jhj


Both the position in the word and the phrasal stress were found to affect how
/h/ was pronounced. In order to clarify the interpretation of the data, we
would first like to present some schematic plots. Figure 4.4 shows a blank
plot of RMS in the /h/ against RMS in the vowel. Since the /h/ articulation
decreases the RMS, more /h/-like /h/s are predicted to fall towards the left of
the plot while more vowel-like /h/s fall towards the right of the plot.
Similarly, more /h/-like vowels are predicted to fall towards the bottom of
the plot, while more vowel-like vowels should fall towards the top of the plot.
The line y = x, shown as a dashed diagonal, represents the case where the
/h/ and the vowel had the same measured RMS. That is, as far as RMS is
concerned, there was no local contrast between the /h/ and the vowel. Note
that this case, the case of complete neutralization, is represented by a wide
range of values, so that the designation "complete lenition" does not actually
fix the articulatory gesture. In general, we do not expect to find /h/s which are

103
O
o
CD

CD
O

40 60 70

more V-like •

Figure 4.4 A schema for interpreting the relation of RMS in the /h/ to RMS in the following
vowel. Greater values of RMS correspond to more vowel-like articulations, and lesser values
correspond to more /h/-like articulations. The line y = x represents the case in which the /h/ and
the following vowel do not contrast in terms of RMS. Distance perpendicular to this line
represents the degree of contrast between the /h/ and the vowel. Distance parallel to this line
cannot be explained by gestural magnitude, but is instead attributed to a background effect on
both the /h/ and the vowel. The area below and to the right of y = x is predicted to be empty

o
h/-li

CO
-->
CD
JOLU

O
CM
like

> O
CD
10LU

10 20 30 40

• more V-like more /h/-like -


Figure 4.5 A schema for interpreting the relation of HR in the /h/ to HR in the following vowel.
It has the same structure as in figure 4.4, except that greater rather than lesser values of the
parameter represent more /h/-like articulations
4 Janet Pierrehumbert and David Talkin

more vocalic than the following vowel, so that the lower-right half is
expected to be empty. In the upper-left half, the distance from the diagonal
describes the degree of contrast between the /h/ and the vowel. Situations in
which both the /h/ and the vowel are more fully produced would exhibit
greater contrast, and would therefore fall further from the diagonal. Note
again that a given magnitude of contrast can correspond to many different
values for the /h/ and vowel RMS.
Figure 4.5 shows a corresponding schema for HR relations. The structure
is the same except that higher, rather than lower, x and y values correspond
to more /h/-like articulations.
In view of this discussion, RMS and HR data will be interpreted with
respect to the diagonal directions of each plot. Distance perpendicular to the
y = x line (shown as a dotted line in each plot) will be related to the strength
or magnitude of the CV gesture. Location parallel to this line, on the other
hand, is not related to the strength of the gesture, but rather to a background
effect on which the entire gesture rides. One of the issues in interpreting the
data is the linguistic source of the background effects.
Figures 4.6 and 4.7 compare the RMS relations in word-initial stressed /h/,
when accented in questions and when deaccented. The As are farther from
the y = x line than the Ds, indicating that the magnitude of the gesture is
greater when /h/ begins an accented syllable. For subject DT, the two clouds
of points can be completely separated by a line parallel to y = x. Subject MR
shows a greater range of variation in the D case, with the most carefully
articulated Ds overlapping the gestural magnitude of the As.
Figures 4.8 and 4.9 make the same comparison for word-medial /h/
preceding a weakly stressed or reduced vowel. These plots have a conspi-
cuously different structure from figures 4.6 and 4.7. First, the As are above
and to the right of the Ds, instead of above and to the left. Second, the As
and Ds are not as well separated by distance from the y = x line; whereas this
separation was clear for word-initial /h/s, there is at most a tendency in this
direction for the medial reduced /h/s.
The HR data shown for DT in figures 4.10 and 4.11 further exemplifies this
contrast. Word-initial /h/ shows a large effect of accentuation on gestural
magnitude. For medial reduced /h/ there is only a small effect on magnitude;
however, the As and Ds are still separated due to the lower HR values during
the vowel for the As. HR data is not presented for MR because strong
aspiration rendered the measure a nonmonotonic function of abduction.
Since the effect of accentuation differs depending on position in the word,
we can see that both phrasal prosody and word prosody contribute to
determining how segments are pronounced. In decomposing the effects, let us
first consider the contrasts in gestural magnitude, that is perpendicular to the
x = y line. In the case of hawkweed and hogfarmer, the difference between As
105
Gesture

A
A
o
V\ D
D
•D
A D D
c
1Z CD D
A D D

b\ D
CD
DC

.•'
in
in

40 45 50 55 60

RMS in /h/ in dB
A accented in questions; D deaccented; lines: y = x and y = -x + b

Figure 4.6 RMS in /h/ of hawkweed and hogfarmer plotted against RMS in the following vowel,
when the words are accented in questions (the As) and deaccented (the Ds). The subject is DT

A A
A \ \ tA
\ A D \
A
D
CO D
•o
D D
. D D

D"'•••-.
S D
• • • . .

o . . - • •

CD

50 55 60 65

RMS in /h/ in dB
A accented in questions; D deaccented; lines: y = x and y = -x + b

Figure 4.7 RMS in /h/ of hawkweedand hogfarmer plotted against RMS in the following vowel,
when the words are accented in questions (the As) and deaccented (the Ds). The subject is MR

106
4 Janet Pierrehwnbert and David Talkin

CD

-o

c/) o

ID
ID

40 45 50 55 60

RMS in /h/ in dB
A accented in questions; D deaccented; lines: y = x and y = -x + b

Figure 4.8 RMS in /h/ of Omaha and tomahawk plotted against RMS in the following vowel,
when the words are accented in questions and deaccented. The subject is DT

CD
•D

cc

o
CD

50 55 60 65

RMS in /h/ in dB
A accented in questions; D deaccented; lines: y = x and y = -x + b

Figure 4.9 RMS in /h/ of Omaha and tomahawk plotted against RMS in the following vowel,
when the words are accented in questions and deaccented. The subject is MR

107
Gesture

en
•u
c

DC
X

10 20 30 40

HR in /h/ in dB
A accented in questions; D deaccented; lines: y = x and y = -x + b

Figure 4.10 HR in /h/ of hawkweed and hogfarmer plotted against HR in the following vowel,
when the words are accented in questions and deaccented. The subject is DT

CO
•D

DC

10 20 30 40

HR in /h/ in dB
A accented in questions; D deaccented; lines: y = x and y = -x + b
Figure 4.11 HR in /h/ of Omaha and tomahawk plotted against HR in the following vowel,
when the words are accented in questions and deaccented. The subject is DT

108
4 Janet Pierrehumbert and David Talkin
and Ds is predominately in this direction. The Omaha and tomahawk As and
Ds exhibit a small difference in this direction, though this is not the most
salient feature of the plot. From this we deduce that accentuation increases
gestural magnitude, making vowels more vocalic and consonants more
consonantal. The extent of the effect depends on location with respect to the
word prosody; the main stressed word-initial syllable inherits the strength of
accentuation, so to speak, more than the medial reduced syllable does. At the
same time we note that in tomahawk and Omaha, the As are shifted relative
to the Ds parallel to the y = x line. That is, both the consonant and the vowel
are displaced in the vocalic direction, as if the more vocalic articulation of the
main stressed vowel continued into subsequent syllables. The data for
tomahawk and Omaha might thus be explicated in terms of the superposition
of a local effect on the magnitude of the CV gesture and a larger-scale effect
which makes an entire region beginning with the accented vowel more
vocalic.
The present data do not permit a detailed analysis of what region is
affected by the background shift in a vocalic direction. Note that the effect of
a nuclear accent has abated by the time the deaccented postnuclear target
words are reached, since these show a more consonantal background effect
than the accented words do. In principle, data on the word mahogany would
provide critical information about where the effect begins, indicating, for
example, whether the shift in the vocalic direction starts at the beginning of
the first vowel in an accented word or at the beginning of the stressed vowel
in a foot carrying an accent. Unfortunately, the mahogany data showed
considerable scatter and we are not prepared at this point to make strong
claims about their characterization.

4.5.3 The Effect of the phrase boundary on jhj


It is well known that syllables are lengthened before intonational boundaries.
Phrase-final voiced consonants are also typically devoiced. An interesting
feature of our data is that it also demonstrated lengthening and suppression
of voicing after an intonational boundary, even in the absence of a pause.
Figures 4.12 and 4.13 compare duration and RMS in word-initial /h/ after a
phrase boundary (that is, following Now Emma with word-initial /h/) in
accented but phrase-medial position, and in deaccented (also phrase-medial)
position. In both plots, the " %" points are below and to the right of the A
and D points, indicating a combination of greater length and less strong
voicing.
DT shows a strong difference between A and D points, with Ds being
shorter and more voiced than As. MR shows at most a slight difference
between As and Ds, reflecting his generally small degree of lenition of

109
Gesture

o
CD

CO

o
m
C/)
DC

0.0 0.05 0.10

Duration of /h/ in seconds


% phrase boundary; A accented in questions; D deaccented

Figure 4.12 Duration vs. RMS in /h/ for hawkweed and hogfarmer when accented at a phrase
boundary, accented but phrase-medial in questions, and deaccented. The subject is DT

in
CD

o
CD
CD
•o
c

in
in
CO
DC

0.0 0.05 0.10 0.15

Duration of /h/ in seconds


% phrase boundary; A accented in questions; D deaccented

Figure 4.13 Duration vs. RMS in /h/ for hawkweed and hogfarmer when accented at a phrase
boundary, accented but phrase-medial in questions, and deaccented. The subject is MR

110
4 Janet Pierrehumbert and David Talkin
consonants in weak positions. For MR, the effect of the phrase boundary is
thus a more major one than the effect of accentual status.
A subset of the data set, the sentences involving tomahawk, make it
possible to extend the result to a nonlaryngeal consonant. The aspiration
duration for the /t/ was measured in the four prosodic positions. The results
are displayed in figures 4.14 and 4.15. The lines represent the total range of
observations for each condition, and each individual datum is indicated with
a tick. For DT, occurring at a phrase boundary approximately doubled the
aspiration length, and there was no overlap between the phrase-boundary
condition and the other conditions. For MR, the effect was somewhat
smaller, but the overlap can still be attributed to only one point, the smallest
value for the phrase-boundary condition. For both subjects, a smaller effect
of accentuation status can also be noted.
The effect of the phrase boundary on gestural magnitude can be investi-
gated by plotting the RMS in the /h/ against RMS in the vowel, the word-
initial accented /h/ in phrase-initial and phrase-medial position. This com-
parison, shown infigures4.16 and 4.17, indicates that the gestural magnitude
was greater in phrase-initial position. The main factor was lower RMS (that
is, a more extreme consonantal outcome) for the /h/ in phrase-initial
position; the vowels differed slightly, if at all. Returning to the decomposition
in terms of gestural-magnitude effects and background effects, we would
suggest that the phrase boundary triggers both a background shift in a
consonantal direction (already observed in preboundary position in the
"deaccented" cases) and an increase in gestural magnitude. The effect on
gestural magnitude must be either immediately local to the boundary, or
related to accentual strength, if deaccented words in the middle of the
postnuclear region are to be exempted as observed.
It is interesting to compare our results on phrase-initial articulation with
Beckman, Edwards, and Fletcher's results (this volume) on phrase-final
articulation. Their work has shown that stress-related lengthening is asso-
ciated with an increase in the extent of jaw movement while phrase-final
lengthening is not, and they interpret this result as indicating that stress
triggers an underlying change in gestural magnitude while phrase-final
lengthening involves a change in local tempo but not gestural magnitude.
Given that our data do show an effect of phrase-initial position on gestural
magnitude, their interpretation leads to the conclusion that phrase-initial and
phrase-final effects are different in nature.
However, we feel that the possibility of a unified treatment of phrase-
peripheral effects remains open, pending the resolution of several questions
about the interpretation of the two experiments. First, it is possible that the
gestural-magnitude effect observed in our data is an artifact of the design of
the materials, since the Now Emma sentences may have seemed more unusual
111
Gesture

questions
i—i—i

deaccented
c
<D
C

o contradiction
LLJ

phrase boundary

0.02 0.03 0.04 0.05 0.06 0.07 0.08

Duration in seconds

Figure 4.14 Voice-onset time in /t/ of tomahawk for all four prosodic positions; subject DT.
Ticks indicate individual data points

questions
h-hH V

deaccented
c
CD
C
o contradiction

phrase boundary
+ -H- +-

0.02 0.04 0.06 0.08

Duration in seconds

Figure 4.15 Voice-onset time in /t/ of tomahawk for all four prosodic positions; subject MR.
Ticks indicate individual data points

112
4 Janet Pierrehumbert and David Talkin

M
1 M
1M M
MXI
CD > •

c
M
1Z
M
>
c 1
CO
DC

40 45 50 55 60

RMS in /h/ in dB
I phrase-initial; M phrase-medial; lines: y = x and y = -x + b

Figure 4.16 RMS of/h/ andfinalvowel for subject DT in accented phrase-initial (I) and phrase-
medial (M) question contexts

MM
m _ 1 \ M1
1
1 | 1' M
GQ 1
T5
o
r-

o
RMS in >

in
CD

o
CD
..--' '•••..,

'•••..

..•••'

i .-•

i i i

50 55 60 65

RMS in /h/ in dB
I phrase-initial; M phrase-medial; lines: y = x and y = -x + b
Figure 4.17 RMS of /h/ and final vowel for subject MR in accented phrase-initial (I) and
phrase-medial (M) question contexts

113
Gesture
or semantically striking than those where the target words were phrase-
internal. If this is the case, semantically matched sentences would show a
shift towards the consonantal direction in the vowel following the consonant,
as well as the consonant itself. Second, it is possible that an effect on intended
magnitude is being obscured in Beckman, Edwards, and Fletcher's (this
volume) data by the nonlinear physical process whose outcome serves as an
index. Possibly, jaw movement was impeded after the lips contacted for the
labial consonant in their materials, so that force exerted after this point did
not result in statistically significant jaw displacement. If this is the case,
measurements of lip pressure or EMG (electromyography) might yield
results more in line with ours. Third, it is possible that nonlinearities in the
vocal-fold mechanics translate what is basically a tempo effect in phrase-
initial position into a difference in the extent of the acoustic contrast. That is,
it is possible that the vocal folds are no more spread for the phrase-initial /h/s
than for otherwise comparable /h/s elsewhere but that maintaining the
spread position for longer is in itself sufficient to result in greater suppression
of the oscillation. This possibility could be evaluated using high-speed optical
recording of the laryngeal movements.

4.5.4 Observations about glottalization


Although all /h/s in the study had some noticeable manifestation in the
waveform, this was not the case for /?/. In some prosodic positions,
glottalization for /?/ appeared quite reliably, whereas in others it did not.
One might view /?/ insertion as an optional rule, whose frequency of
application is determined in part by prosodic position. Alternatively, one
might take the view that the /?/ is always present, but that due to the
nonlinear mechanics involved in vocal-fold vibration, the characteristic
irregularity only becomes apparent when the strength and/or duration of the
gesture is sufficiently great. That is, the underlying control is gradient, just as
for /h/, but a nonlinear physical system maps the gradient control signal into
two classes of outcomes. From either viewpoint, an effect of prosodic
structure on segmental production can be demonstrated; the level of rep-
resentation for the effect cannot be clarified without further research on
vocal-fold control and mechanics.
Table 4.1 summarizes the percentage of cases in which noticeable glottali-
zation for /?/ appeared. The columns represent phrasal prosody and the rows
indicate whether the target syllable is stressed or reduced in its word. The
most striking feature of the table is that the reduced, non-phrase-boundary
entries are much lower than the rest, for both subjects. That is, although
stressed syllables had a high percentage of noticeable /?/s in all positions,
reduced syllables had a low percentage except at a phrase boundary. This
114
4 Janet Pierrehumbert and David Talkin
Table 4.1 Percentage of tokens with noticeable /?/

Subject Stress %-boundary Accented Deaccented


MR stressed 100 85 100
reduced 93 33 44
DT stressed 90 95 80
reduced 97 17 27

s -\
d

z: o

% A

Q o

o
d

0.0 0.002 0.004 0.006


PDEV in I?I in seconds
/A accented in questions; D deaccented; % phrase boundaryy

Figure 4.18 PDEV for /?/ beginning August and awkwardness plotted against PDEV in the
following vowel, for subject DT

result shows that word-level and phrase-level prosody interact to determine


the likelihood of observed glottalization. It does not provide any information
about the degree of glottalization in cases where it was equally likely. In
figure 4.18, PDEV is used to make this comparison for subject DT, for
syllables with word stress. Only utterances in which glottalization was
observed are included. In the deaccented tokens, PDEV during /?/ was
overall much lower than in the accented phrase-medial tokens or the phrase-
initial tokens.

115
Gesture

4.6 Discussion and conclusions


The experiment showed that the pronunciation of both /h/ and /?/ depends
on word- and phrase-level prosody. We decompose these effects into effects
on gestural magnitude and background effects. An overall shift in a vocalic
direction was associated with accent, beginning at the rhyme of the accented
syllable and affecting even later syllables in the same word. The phrase
boundary was found to shift articulation on both sides in a more consonantal
direction; related phrase-initial lengthening of the consonant, analogous to
the phrase-final lengthening observed by many other researchers, was also
observed. Superimposed on the background effects we observe effects on
gestural magnitude related to the strength of a segment's prosodic position in
the word and in the phrase. Accent affected the gestural magnitude both for
main stressed and reduced syllables within the accented word, but it affected
the stressed syllables more. There is also some evidence for a phrase-
boundary effect on gestural magnitude, although further investigation is
called for.
The interaction of effects on gestural magnitude and background effects is
highly reminiscent of the interactions between local and large-scale effects
which have proved critical for modeling the manifestations of tone in Fo
contours. The effects on gestural magnitude for /h/ and /?/ are broadly
analogous to the computations involved in mapping tones into Fo target
levels or excursions, while the background effects are reminiscent of effects
such as declination and final lowering which affect the Fo values achieved for
tones in an entire region. Thus, the experimental results support a parallel
treatment of segments and tones in terms of their phonological represen-
tation and the phonetic realization rules which interpret them. They argue
against the view which segregates tone and intonation into a "suprasegmen-
tal" component, a view which still underlies current speech technology (Lea
1980; Allen, Hunnicutt, and Klatt 1987; Waibel 1988). This view provides for
prosodic effects on Fo, intensity, and duration, but does not support the
representations or rules needed to describe prosodic effects on segmental
allophony of the kind observed here.
Our observations about /h/ and /?/ production broadly support the ideas
about phonetic representation expressed in Browman and Goldstein (1990)
and Saltzman and Munhall (1989), as against the approach of the Internatio-
nal Phonetic Association or The Sound Pattern of English (Chomsky and
Halle 1968). Gradient or n-ary features on individual segments would not
well represent the pattern of lenition observed here; for example, equally
lenited /h/s can be pronounced differently in different positions, and equally
voiced /h/s can represent different degrees of lenition in different positions.
An intrinsically quantitative representation, oriented towards critical aspects
116
4 Comments
of articulation, appears to offer more insight than the traditional fine
phonetic transcription. At the same time, the present results draw attention
to the need for work on articulatory representation to include a proper
treatment of hierarchical structure and its manifestations. A quantitative
articulatory description will still fail to capture the multidimensional
character of lenition if it handles only local phonological and phonetic
properties.

Comments on chapter 4
OSAMU FUJIMURA
First of all, I must express my appreciation of the careful preparation by
Pierrehumbert and Talkin of the experimental material. Subtle phonetic
interactions among various factors such as Fo, F p and vocal-tract constric-
tion are carefully measured and assessed using state-of-the-art signal-
processing technology. This makes it possible to study subtle but critical
effects of prosodic factors on segmental characteristics with respect to vocal-
source control. In this experiment, every technical detail counts, from the
way the signals are recorded to the simultaneous control of several phono-
logical conditions.
Effects of suprasegmental factors on segmental properties, particularly of
syntagmatic or configurational factors, have been studied by relatively few
investigators beyond qualitative or impressionistic description of allophonic
variation. It is difficult to prepare systematically controlled paradigms of
contrasting materials, partly because nonsense materials do not serve the
purpose in this type of work, and linguistic interactions amongst factors to be
controlled prohibit an orthogonal material design. Nevertheless, this work
exemplifies what can be done, and why it is worth the effort. It is perhaps a
typical situation of laboratory phonology.
The general point this study attempts to demonstrate is that so-called
"segmental" aspects of speech interact strongly with "prosodic" or "supra-
segmental" factors. Paradoxically, based on the traditional concept of
segment, one might call this situation "segmental effects of suprasegmental
conditions." As Pierrehumbert and Talkin note, such basic concepts are
being challenged. Tones as abstract entities in phonological representations
manifest themselves in different fundamental frequencies. Likewise, pho-
nemes or distinctive-feature values in lexical representations are realized with
different phonetic features, such as voiced and voiceless or with and without
articulatory closure, depending on the configuration (e.g. syllable- or word-
initial vs. final) and accentual situations in which the phoneme occurs. The
117
Gesture
same phonetic segments, to the extent that they can be identified as such, may
correspond to different phonological units. Pierrehumbert and Talkin,
clarifying the line recently proposed by Pierrehumbert and Beckman (1988),
use the terms 'structure' and 'content' to describe the general framework of
phonological/phonetic representations of speech. The structure, in my inter-
pretation (Fujimura 1987, 1990), is a syntagmatic frame (the skeleton) which
Jakobson, Fant, and Halle (1952) roughly characterized by configurational
features. The content (the melody in each autosegmental tier) was described
in more detail in distinctive (inherent and prosodic) features in the same
classical treatise. Among different aspects of articulatory features, Pierre-
humbert and Talkin's paper deals with voice-source features, in particular
with glottal consonants functioning as the initial margin of syllables in
English.
What is called a glottal stop is not very well understood and varies greatly.
The authors interpret acoustic EGG signal characteristics to be due to braced
configurations of the vocal folds. What they mean by "braced" is not clear to
me. They "surmise" that the subject DT in particular produces the glottal
stop by bracing or tensing partially abducted vocal folds in a way that tends
to create irregular vibration without a fully closed phase. Given the current
progress of our understanding of the vocal-fold vibration mechanism and its
physiological control, and the existence of advanced techniques for direct
and very detailed optical observation of the vocal folds, such qualitative and
largely intuitive interpretation will, I hope, be replaced by solid knowledge in
the near future. Recent developments in the technique of high-speed optical
recording of laryngeal movement, as reported by Kiritani and his co-workers
at the University of Tokyo (RILP), seem to promise a rapid growth of our
knowledge in this area.
A preliminary study using direct optical observation with a fiberscope
(Fujimura and Sawashima 1971) revealed that variants of American English
/t/ were accompanied by characteristic gestures of the false vocal folds.
Physiologically, laryngeal control involves many degrees of freedom, and
EGG observations, much less acoustic signals, reveal little information about
specific gestural characteristics. What is considered in the sparse distinctive-
feature literature about voice-source features tends to be grossly impression-
istic or even simply conjectural with respect to the production-control
mechanisms. The present paper raises good questions and shows the right
way to carry out an instrumental study of this complex issue. Particularly in
this context, Pierrehumbert and Talkin's detailed discussion of their speech
material is very timely and most welcome, along with the inherent value of
the very careful measurement and analysis of the acoustic-signal characteris-
tics. This combination of phonological (particularly intonation-theoretical)
competence and experimental-phonetic (particularly speech-signal engineer-

118
4 Comments
ing) expertise is a necessary condition for this type of study, even just for
preparing effective utterances for examination. Incidentally, it was in these
carefully selected sample sentences that the authors recently made the
striking discovery that a low-tone combination of voice-source characteris-
tics gives rise to a distinctly different spectral envelope (personal
communication).
One of the points of this paper that particularly attracts my attention is the
apparently basic difference between the two speakers examined. In recent
years, I have been impressed by observations that strong interspeaker
variation exists even in what we may consider to be rather fundamental
control strategies of speech production (see Vaissiere 1988 on velum move-
ment strategies, for example). One may hypothesize that different production
strategies result in the same acoustic or auditory consequence. However, I do
not believe this principle explains the phenomena very well, even though in
some cases it is an important principle to consider. In the case of the "glottal
stop," it represents a consonantal function in the syllable onset in opposition
to /h/, from a distributional point of view. Phonetically (including acousti-
cally), however, it seems that the only way to characterize this consonantal
element of the onset (initial demisyllable) is that it lacks any truly consonan-
tal features. This is an interesting issue theoretically in view of some of the
ideas related to nonlinear phonology, particularly with respect to underspeci-
fication. The phonetic implementation of unspecified features is not neces-
sarily empty, being determined by coarticulation principles only, but can
have some ad hoc processes that may vary from speaker to speaker to a
large extent. In order to complete our description of linguistic specifica-
tion for sound features, this issue needs much more attention and serious
study.
In many ways this experimental work is the first of its kind, and it may
open up, together with some other pioneering work of similar nature, a new
epoch in speech research. I could not agree more with Pierrehumbert and
Talkin's conclusion about the need for work on articulatory representation
to include a proper treatment of hierarchical structure and its manifestations.
Much attention should be directed to their assertion that a quantitative
articulatory description will still fail to capture the multidimensional char-
acter of lenition if it handles only the local phonological and phonetic
properties. But the issue raised here is probably not limited to the notion of
lenition.

119
Gesture
Comments on chapters 3 and 4
LOUIS GOLDSTEIN
Introduction
The papers in this section, by Pierrehumbert and Talkin and by Beckman,
Edwards, and Fletcher, can both be seen as addressing the same fundamental
question: namely, how are the spatiotemporal characteristics of speech
gestures modulated (i.e., stretched and squeezed) in different prosodic
environments?* One paper examines a glottal gesture (laryngeal abduction/
adduction for /h/- Pierrehumbert and Talkin), the other an oral gesture
(labial closure/opening for /p/- Beckman, Edwards, and Fletcher). The
results are similar for the different classes of gestures, even though differences
in methods (acoustic analysis vs. articulator tracking) and materials makes a
point-by-point comparison impossible. In general, the studies find that
phrasal accent increases the magnitude of a gesture, in both space and
time, while phrasal boundaries increase the duration of a gesture without
a concomitant spatial change. This striking similarity across gestures
that employ anatomically distinct (and physiologically very different) struc-
tures argues that general principles are at work here. This similarity (and
its implications) are the focus of my remarks. I will first present addi-
tional evidence showing the generality of prosodic effects across gesture
type. Second, I will examine the oral gestures in more detail, asking
how the prosodic effects are distributed across the multiple articula-
tors whose motions contribute to an oral constriction. Finally, I will ask
whether we yet have an adequate understanding of the general principles
involved.

Generality of prosodic effects across gesture type


The papers under discussion show systematic effects of phrasal prosodic
variables that cut across gesture type (oral/laryngeal). This extends the
parallelism between oral and laryngeal gestures that was demonstrated for
word stress by Munhall, Ostry, and Parush (1985). In their study, talkers
produced the utterance /kakak/, with stress on either the first or second
syllable. Tongue-lowering and laryngeal-adduction gestures for the inter-
vocalic /k/ were measured using pulsed ultrasound. The same effects were
observed for the two gesture types: words with second-syllable stress showed
larger gestures with longer durations. In addition, their analyses showed that

•This work was supported by NSF grant BNS 8820099 and NIH grants HD-01994 and HD-
13617 to Haskins Laboratories.

120
3 and 4 Comments
the two gesture types had the same velocity profile, a mathematical charac-
terization of the shape of curve showing how velocity varies over time in the
course of the gestures. On the basis of this identity of velocity profiles, the
authors conclude that "the tongue and vocal folds share common principles
of control" (1985: 468).
Glottal gestures involving laryngeal abduction and adduction may occur
with a coordinated oral-consonant gesture, as in the case of the /k/s analyzed
by Munhall, Ostry, and Parush, or without such an oral gesture, as in the /h/s
analyzed by Pierrehumbert and Talkin. It would be interesting to investigate
whether the prosodic influences on laryngeal gestures show the same patterns
in these two cases. There is at least one reason to think that they might
behave differently, due to the differing aerodynamic and acoustic conse-
quences. Kingston (1990) has argued that the temporal coordination of
laryngeal and oral gestures could be more tightly constrained when the oral
gesture is an obstruent than when it is a sonorant, because there are critical
aerodynamic consequences of the glottal gesture in obstruents (allowing
generation of release bursts and frication). By the same logic, we might
expect the size (in time and space) of a laryngeal gesture to be relatively
more constrained when it is coordinated with an oral-obstruent gesture
than when it is not (as in /h/). The size (and timing) of a laryngeal gesture
coordinated with an oral closure will determine the stop's voice-onset
time (VOT), and therefore whether it is perceived as aspirated or not,
while there are no comparable consequences in the case of /h/. On the
other hand, these differences may prove to be irrelevant to the prosodic
effects.
In order to test whether there are differences in the behavior of the
laryngeal gesture in these two cases, I compared the word-level prosodic
effects in Pierrehumbert and Talkin's /h/ data (some that were discussed
by the authors and others that I estimated from their graphs) with the data
of a recent experiment by Cooper (forthcoming). Cooper had subjects
produce trisyllabic words with varying stress patterns (e.g. percolate,
passionate, Pandora, permissive, Pekingese), and then reproduce the pro-
sodic pattern on a repetitive /pipipip/ sequence. The glottal gestures in
these nonsense words were measured by means of transillumination. I
was able to make three comparisons between Cooper and Pierrehumbert
and Talkin, all of which showed that the effects generalized over the
presence or absence of a coordinated oral gesture. (1) There is very
little difference in gestural magnitude between word-initial and word-
medial positions for a stressed /h/, hawkweed vs. mahogany. (2) There is,
however, a word-position effect for unstressed syllables (hibachi shows a
larger gesture than Omaha or tomahawk). (3) The laryngeal gesture in
word-initial position is longer when that syllable is stressed than when it

121
Gesture
is unstressed (hawkweed vs. hibachi). All of these effects can be seen in
Cooper's data.
In addition, Cooper's data show a very large reduction of the laryngeal
gesture in a reduced syllable immediately following the stressed vowel (in the
second /p/ of utterances modeled after percolate and passionate). In many
cases, no laryngeal spreading was observable at all. While this environment
was not investigated by Pierrehumbert and Talkin, they note that such /h/s
have been considered by phonologists as being deleted altogether (e.g. vehicle
vs. vehicular). The coincidence of these effects is again striking. Moreover,
this is an environment where oral gestures may also be severely reduced:
tongue-tip closure gestures reduce to flaps (Kahn 1976). Thus, there is strong
parallelism between prosodic effects on laryngeal gestures for /h/ and on
those that are coordinated with oral stops. This similarity is particularly
impressive in face of the very different acoustic consequences of laryngeal
gesture in the two cases: generation of breathy voice (/h/) and generation
of voiceless intervals. It would seem, therefore, that it is the gestural
dynamics themselves that are being directly modulated by stress and
position, rather than the output variables such as VOT. The changes
can be stated most generally at the level of gestural kinematics and/or
dynamics.

Articulatory locus of prosodic effects for oral gestures


The oral gestures analyzed by Beckman, Edwards, and Fletcher are bilabial
closures and openings into the following vowel. Bilabial closures are
achieved by coordinated action of three separate articulatory degrees of
freedom: jaw displacement, displacement of the lower lip with respect to the
jaw, and displacement of the upper lip with respect to the upper teeth. The
goal of bilabial closure can be defined in terms of the vertical distance
between the upper and lower lips, which needs to be reduced to zero (or to a
negative value, indicating lip compression). This distance has been shown to
be relatively invariant for a bilabial stop produced in different vowel contexts
(Sussman, MacNeilage, and Hanson 1973: Macchi 1988), while the contribu-
tions of the individual articulators vary systematically - high-vowel contexts
show both higher jaw positions and less displacement of the lower lip
with respect to the jaw than is found in low-vowel contexts. Token-
to-token variation shows a similarly systematic pattern (Gracco and Abbs
1986). This vertical interlip distance, or lip aperture, is an example of what
we call "vocal-tract variables" within the computational gestural model
being developed at Haskins Laboratories (Browman and Goldstein 1986,
1990; Saltzman 1986; Saltzman et al. 1988a). Gestures are the primitive

122
3 and 4 Comments
phonological units; each is modeled as a dynamical system, or control
regime, whose spatial goals are defined in terms of tract variables such as
these. When a given gesture is active, the individual articulatory compo-
nents that can contribute to a given tract variable constitute a "coordina-
tive structure" (Kelso, Saltzman, and Tuller, 1986) and cooperate to
achieve the tract-variable goal. Individual articulators may compensate for
one another, when one is mechanically restrained or involved in another
concurrent speech gesture. In this fashion, articulatory differences in
different contexts are modeled (Saltzman 1986). With respect to the pro-
sodic effects discussed by Beckman, Edwards, and Fletcher, it is important
to know whether the gesture's tract-variable goals are being modified,
or rather if some individual articulator's motions are being amplified or
reduced, and if so, whether these changes are compensated for by other
articulators.
Since Beckman, Edwards, and Fletcher present data only for the jaw,
the question cannot be answered directly. However, Macchi (1988)
has attempted to answer this question using data similar to the type em-
ployed by them. Macchi finds that prosodic variation (stress, syllable
structure) influences primarily the activity of the jaw, but that, unlike
variation introduced by vowel environment, the prosodic effects on the
jaw are not compensated for by the displacement of the lower lip with
respect to the jaw. The displacement remains invariant across prosodic
contexts. Thus, the position of the lower lip in space does vary as a
function of prosodic environment, but this variation is caused almost
exclusively by jaw differences. That is, the locus of the prosodic effects is
the jaw.

Unifying oral and laryngeal effects


If Macchi's analysis is correct (and generalizes beyond the speakers and
utterances analyzed), it fits well with Beckman, Edwards, and Fletcher's
characterization of prosodic effects in terms of a sonority variable. It suggests
that it would be possible to develop a model in which "segmental" structure
is modeled by gestures defined in the space of tract variables, such as Lip
Aperture, while prosodic effects would be modeled by long-term "gestures"
defined in the space of Sonority, which would be related directly to jaw
position. The association between the jaw height and sonority (or vocal-tract
"openness") is an attractive one (although Keating [1983] presents some
problems with using it as a basis for explaining why segments order
according to the sonority hierarchy).
A major problem with this view emerges, however, if we return to the

123
Gesture
laryngeal data. Here, as we have seen, the effects parallel those for oral
gestures, in terms of changes in gesture magnitude and duration. Yet it would
be hard to include the laryngeal effects under the rubric of sonority, at least
as traditionally defined. Phrasal accent results in a more open glottis, but a
more open glottis would result in a less sonorous output. Thus the sonority
analysis fails to account, in a unified way, for the parallel prosodic
modulations of the laryngeal and oral gestures. An alternative analysis
would be to examine the effects in terms of the overall amount of energy
expended by the gestures, accented syllables being more energetic. However,
this would not explain why the effects on oral gestures seem to be restricted
to the jaw (which is explained by the sonority account). Finding
a unified account of laryngeal and oral effects remains an exciting
challenge.

Comments on chapters 3 and 4


IRENE VOGEL
Prosodic structure
Chapters 3 and 4, in common with those in the prosody section of this
volume, all view the structure of phonology as consisting of hierarchi-
cally arranged phonological, or prosodic, constituents.* The phonetic
phenomena under investigation, furthermore, are shown to depend
crucially on such constituents in the correct characterization of their
domains of application. Of particular importance is the position of
specific elements within the various constituents. As Pierrehumbert and
Talkin suggest, "phonetic realization rules in general can be sensitive to
prosodic structure," whether they deal with tonal, segmental, or, presum-
ably, durational phenomena. In fact, a large part of the phonology-
phonetics interface seems to involve precisely the matching up of the
hierarchical structures - the phonology - and the physical realizations
of the specific tonal, segmental, and durational phenomena - the
phonetics.
This issue - the role of phonological constituents in phonetics implemen-
tation - leads directly to the next point: precisely, what are the phonological
constituents that are relevant for phonetics? A common view of phonology

This was presented at the conference as a commentary on several papers, but because of the
organization of the volume appears here with chapters 3 and 4.

124
3 and 4 Comments
groups speech sounds into the following set of constituents (from the word
up):
(1) phonological utterance;
intonational phrase;
phonological phrase;
clitic group;
phonological word;
Phonological constituents referred to in this volume, however, include the
following:
(2) (a) Pierrehumbert and Talkin:
(intonational) phrase
(phonological) word
(b) Beckman, Edwards, and Fletcher:
(intonation) phrase
(phonological word)
(c) Kubozono:
major phrase
minor phrase
(d) van den Berg, Gussenhoven, and Rietveld:
association domain
association domain'

Given such an array of proposed phonological constituents, it is important


to stop and ask a number of basic questions. First of all, do we expect any, or
possibly all, of the various levels of phonological structure to be universal? If
not, we run the risk of circularity: a phenomenon P in some language is found
to apply within certain types of strings which we thus define as a phonologi-
cal constituent C; C is then claimed to be motivated as a phonological
constituent of the language because it is the domain of application of P. In
so doing, however, we lose any predictive power phonological constituents
may have, not to mention the fact that we potentially admit an infinite
number of language types in terms of their phonological constituent struc-
ture. It would thus be preferable to claim that there is some finite, indepen-
dently motivated, universal set of phonological constituents. But what are
these?
The constituents in (1) were originally proposed and motivated as such
primarily on the basis of phonological rules (e.g. Selkirk 1978; Nespor and
Vogel 1986). In various papers in this volume we find phonetic data arguing
in favor of phonological constituents, but with some different names.
Pierrehumbert and Talkin as well as Beckman, Edwards, and Fletcher
assume essentially the structures in (1), though they do not state what

125
Gesture
definitions they are using for their constituents. In Kubozono's paper,
however, we find major phrase and minor phrase, and, since neither is
explicitly defined, we do not know how they relate to the other proposed
constituents. Similarly, van den Berg, Gussenhoven, and Rietveld explicitly
claim that their association domain and association domain' do not coincide
with any phonological constituents proposed elsewhere in the literature.
Does this mean we are, in fact, adopting the position that essentially
anything goes, where we just create phonological constituent structure as we
need it? Given the impressive cross-linguistic insights that have been gained
in phonology by identifying a small finite set of prosodic constituents along
the lines of (1), it would be surprising, and dismaying, if phonetic investi-
gation yielded significantly different results.

Phonology and phonetics


It might be said that anything that is rule-governed and thus predictable is
part of competence and should therefore be considered phonology. If this is
so, one could also argue that "phonetic implementation rules" are phono-
logy since they, too, follow rule-governed patterns (expressed, for example,
in parametric models). Is phonetics, then, just the "mechanical" part of the
picture? This is probably too extreme a position to defend, but it is not clear
how and where exactly we are to draw the line between what is phonological
and what is phonetic.
One of the stock (if simplified) answers to this question is something like
the following: phonology deals with unique, idealized representations of
speech, while phonetics deals with their actual manifestations. Since in
theory infinite variation is possible for any idealized phonological represen-
tation, a question arises as to how we know whether particular variations are
acceptable for a given phenomenon. Which variations do we consider in
our research and which may/must we exclude? Some of the papers in this
volume report that it was necessary to set aside certain speakers and/or data
because the speakers were unable to produce the necessary phenomena. This
is all the more surprising since the data were collected in controlled
laboratory settings where we would expect there to be less variation
that usual. Furthermore, in more than one case, it seems that the data found
to be most reliable and crucial to the study were produced by someone in-
volved in the research. This is not meant necessarily as a methodological
criticism, since much can be gained by examining constrained sets of data.
It does, however, raise serious questions about interpretation of the
results. Moreover, if in phonetic analyses, too, we have to pull back from
the reality of variation, we blur the distinction between phonology and
phonetics, since abstraction and idealization may no longer be considered

126
3 and 4 Comments
defining characteristics of phonology. Of course, if the goal of phonetics
is to model specific phenomena, as is often the case, we do need to end
up with abstractions again. We must still ask, though, what is actually
being modeled when this model itself is based on such limited sets of
data.

127
5
On types of coarticulation

NIGEL HEWLETT and LINDA SHOCKEY

5.1 Introduction
With few exceptions, both phonetics and phonology have used the "seg-
ment" as a basis of analysis in the past few decades.* The phonological
segment has customarily been equated with a phonemic unit, and the nature
of the phonetic segment has been different for different applications. Coarti-
culation has, on the whole, been modeled as intersegmental influences, most
frequently for adjacent segments but also at a greater distance, and the
domain of coarticulatory interactions has been assumed to be controlled by
cognitive (language-specific, structural, phonological) factors as well as by
purely mechanical ones. As such, the concept of coarticulation has been a
useful conceptual device for explaining the gap between unitary phonological
representations and variable phonetic realizations. Less segment-based views
of phonological representation and of speech production/perception have led
to different ways of talking about coarticulation (see, for example, Browman
and Goldstein, this volume).
We are concerned here with examining some segment-based definitions of
coarticulation, asking some questions about the coherence of these defini-
tions, and offering some experimental evidence for reevaluation of
terminology.

5.1.1 Ways of measuring coarticulation


Previous studies show a general agreement that coarticulation consists of
intersegmental influences. However, different studies have used different

*We are grateful to Peter Ladefoged, John Ohala, and Christine Shadle for their comments and
advice about this work. We also thank Colin Watson of Queen Margaret College for his work in
designing and building analysis equipment.

128
5 Nigel Hewlett and Linda Shockey
approaches to measuring these influences. We will compare two of these
approaches. One could be called "testing for allophone separation." A
number of studies (Turnbaugh et al. 1985; Repp 1986; Sereno and Lieberman
1987; Sereno et al. 1987; Hewlett 1988; Nittrouer et al. 1989) have examined
spectral characteristics of the consonant in CV syllables by adults and
children with a view to estimating the amount of influence of the vowel on the
realization of the consonant (typically a voiceless stop), i.e. to determine
whether and to what extent realizations of the same consonant phoneme
were spectrally distinct before different vowels. The precise techniques are
not at issue here. Some have involved simply measuring the frequency of the
tallest peak in the spectrum or identifying whether the vowel F2 is anticipated
in the consonant spectrum, others have used more elaborate techniques. A
recent and extensive investigation of this type is that of Nittrouer, Studdert-
Kennedy, and McGowan (1989). In this study, the centroids of /s/ spectra in
/si/ and /su/ syllables, as spoken by young children and adults, were
calculated and compared. The authors conclude that the fricative spectrum is
more highly influenced by a following vowel in the children's productions
than in the adults'.
Another approach could be termed "finding the target-locus relationship"
(Lindblom and Lindgren 1985; Krull 1987, 1989). According to this criter-
ion, the difference between F2 at vowel onset (the "locus") and F2 at mid-
vowel (the "target") is inversely related to the amount of CV coarticulation
present. The reasoning here is that coarticulation is a process which makes
two adjacent sounds more similar to each other, so if it can be shown that
there is a shallower transition between consonant and vowel in one condition
than in another, then that condition can be thought of as more coarticulated.
This approach has been applied to a comparison of more careful and less
careful speech styles and the findings have indicated greater coarticulation
associated with a decrease in care.
It is difficult to use the first approach (measuring allophone separation) in
dealing with connected speech because of yet another phenomenon, which is
often termed "coarticulation": that of vowel centralization. It is well known
that on average the vowel space used in conversation is much reduced from
that found for citation-form speech. Sharf and Ohde (1981) distinguish
two types of coarticulation which they call "feature spreading" and
"feature reduction," the latter of which includes vowel centralization. If
one wants to compare coarticulation in citation form with coarticulation
in connected speech by testing for allophone separation, one's task is very
much complicated by the fact that two variables are involved: difference
in phonological identity of the vowel and difference in overall vowel-space
shape. The same objection cannot be made to the method of finding the
target-locus relationship because it is insensitive to the difference

129
Gesture
between (for example) a "change up" in target and a "change down" in
the locus.
As far as we know, no previous study has attempted to determine whether
the application of the two approaches to the same data leads to a similar
conclusion concerning the relative degree of coarticulation in each case.

5.2 The experiment


5.2.1 Method
The study described here was designed to test whether there is a difference in
degree of coarticulation in CV syllables between (1) very carefully spoken
(citation-form) syllables and (2) the same syllables produced in connected
(read) speech. One speaker's productions of /k/ and /t/ before the vowels /i/
and /u/ were investigated, /k/ was chosen because its allophonic variation
before front and back vowels is well established.
The subject was a linguistically naive male speaker in his thirties who
spoke with a Received Pronunciation (RP) accent, though a long-term
resident of Edinburgh. In younger speakers of RP, the vowel /u/ is often
fronted: this subject may show an exaggeration of this trend as a result of his
residence in Scotland, where /u/ is notably fronted phonetically ([«]).
A passage of text was composed in such a way as to contain eight words
beginning with the sequences: /ki/, /ku/, /ti/, /tu/; a total of thirty-two words.
This text is reproduced as the appendix to this chapter (p. 137), with the
experimental words in bold print. All words are either monosyllabic or,
where they are disyllabic, the relevant CV sequence occurs in an initial
stressed syllable. The items were chosen so as to be sufficiently contentful not
to undergo vowel reduction in connected speech. The subject was asked to
read through the text twice, with no special instructions being given
concerning style. He had no knowledge of the purpose of the experiment or
the identity of the experimental words.
He then pronounced the words key, coo, tea, and two sixteen times each, in
randomized order and at an interval of about 5 seconds, as cued by the words
written on index cards. He was asked to say the words "very carefully".
The session thus yielded sixteen each of what will subsequently be called
"citation forms" and "reading forms" of each of the CV forms cited above.
These were recorded using a high-quality digital cassette recorder. Digitized
waveforms of the experimental items were stored on a Hewlett-Packard
Vectra computer, using the A-to-D facility of a Kay Digital Sonagraph, with
a sampling rate of 25,600 samples per second. Fourier spectra were obtained
of the initial part of each consonant burst, using a 20 msec, time frame with a
Gaussian window. The frequency of the most prominent peak was measured
130
5 Nigel Hewlett and Linda Shockey
Table 5.1 Mean frequencies in Hz of the most
prominent peak of consonant burst, second formant
of the vowel at onset and secondformant of the vowel
at steady state

Citation Reading

[ki] [ku] [ki] [ku]

Burst 3,046 2,020 2,536 1,856


Peak (248) (160) (167) (93)
2,226 1,648 2,083 1,579
Vowel (90) (228) (138) (227)
Onset
F2 2,375 1,403 2,127 1,585
Steady state (131) (329) (112) (239)

Burst frequencies for velars


3 -100
3,200.
3,000.
X 2,800-
•^ 2,600,
T
§ 2,400-
§" 2,200,

i
LL
2,000 •
1,800,
1 600-
I
ki (citation) ki (reading) ku (citation) ku (reading)

Figure 5.1 Means and standard deviations of velar bursts

in each case. Where there was more than one very prominent peak, spectra of
subsequent time frames were examined in order to determine which pole was
most robust overall.
The frequencies of the first three formants of each vowel were measured at
onset and in the middle, using the continuous spectrogram facility of a
Spectrophonics system, which produces traditional broad-band spectral
displays on a CRT (cathode ray tube). Where frequencies were in doubt due

131
Gesture
Table 5.2 Mean frequencies in Hz of the most
prominent peak of consonant burst, second formant
of the vowel at onset and secondformant of the vowel
at steady state

Citation Reading

[ti] [tu] [ti] [tu]

Burst 6,264 6,234 4,780 4,591


Peak (714) (905) (705) (848)
F2 2,104 1,666 2,012 1,635
Vowel (140) (63) (95) (169)
Onset
F2 2,320 1,517 2,075 1,596
Steady state (104) (155) (114) (226)

Burst frequencies for alveolars


7 Rnn •

7,000- •I

6,500-
<
"requency in Hz

<

6,000-
5,500-

5,000'
4,500 <

4,000-
o cr\r\.
O,C)UU
ti (citation) ti (reading) tu (citation) tu (reading)

Figure 5.2 Means and standard deviations of alveolar bursts

to there being two formants close together or an unusually weak formant,


spectral sections were used, since they allow for finer discrimination of
formant amplitude.
5.2.2 Results
The /k/ spectra were well differentiated in shape from the /t/ spectra,
regardless of the style or following vowel, and in the expected fashion. That
132
5 Nigel Hewlett and Linda Shockey
is, /k/ spectra were characterized by a prominent peak at mid-frequency and
/t/ spectra by a concentration at higher frequencies. This can be seen in the
frequencies of the most prominent peak of the burst spectra, which ranged
from 1,656 to 3,500 Hz for /k/ (mean = 2,364) and from 3,437 to 7,703 Hz
for /t/ (mean = 5,467).
Table 5.1 gives the means and standard deviations of the most prominent
peak of the burst spectra, the second formant of the vowel at its onset, and
the second formant of the middle of the vowel for each of the four
experimental conditions, for the velar stops. Figure 5.1 shows a graphic
representation of the means and standard deviations of the burst frequencies
for all velars. Table 5.2 and figure 5.2 give the same information for the
alveolars.
The /k/ spectra reveal both a strong vowel effect and a strong effect from
speech style: in all cases the /k(u)/ spectra have a lower frequency peak than
the analogous /k(i)/ spectra. The /k(i)/ and /k(u)/ spectra were widely
separated by the measure of burst peak frequency in the citation forms. They
were also separated, to a lesser extent, in the reading forms. The /k(i)/
reading forms had a lower burst peak frequency than in the citation forms, as
predicted for a situation of lesser coarticulation. The /k(u)/ reading forms,
however, also had a lower burst peak frequency compared to the /k(u)/
citation forms: in this case, the difference of burst peak frequency was in the
opposite direction to the difference of the F2 at the middle of the vowel (see
below). A T-test revealed that the /k/ forms were significantly different from
each other as regards both phonetic environment and style. The /k/ releases
were also highly significantly different from the /t/ releases.
The /t/ spectra show no significant effect from the following vowel, but do
reveal the same strong effect from speech style: both /t(i)/ and /t(u)/ have a
lower burst peak frequency in the reading mode than in the citation mode.
Thus all CV forms showed a lowering of burst frequency in the reading
forms. The citation-form /t/ releases proved significantly different from the
reading /t/ releases when a T-test was applied, but within conditions the
differences were not significant, i.e. the vocalic environment did not figure
statistically as a differentiator. (See table 5.3 for T-test values.)
With regard to the vowels, centralization was observed in reading mode, as
can be seen in figure 5.3. Average formant frequency at vowel onset,
however, showed a different pattern, with F2 being lower in all cases for the
reading forms than is seen for citation forms. In this respect, the vowel onset
reflects the pattern seen at consonant release. Figure 5.4 shows relative
positions of averaged tokens in F,-F 2 space at vowel onset.

133
Gesture
Table 5.3 T-Test results for comparisons described
in column 1 (df = 75 throughout)

Variable T-value * = Probability


(release from) less than 0.001
citation [ki]/citation [ku] 13.7 *
read [ki]/read [ku] 16.1 *
citation [ki]/read [ki] 7.7 *
citation [ku]/read [ku] 4.2 *
citation [ti]/citation [tu] -0.03
read [ti]/read [tu] 0.63
citation [ti]/read [ti] 4.9 *
citation [tu]/read [tu] 5.8 *
citation [ki]/citation [ti] -17.2 *
read [ki]/read [ti] -12.3 *
citation [ku]/citation [tu] -19.3 *
read [ku]/read [tu] -12.4 *

kHz 3.0 2.4 2.0 1.6 1.2 1.0 0.8

u.o —

ki
l tu

0 4 -•-- if • t tu
• ku

•ku
nR


• citation
• reading

Figure 5.3 Formant frequencies at middle of vowel

134
5 Nigel Hewlett and Linda Shockey

kHz 3.0 2.4 2.0 1.6 1.2 1.0 0.8

0.3

0.4 | i" •™
cu
ti tu W
ki tu

0.5

0.6
• citation
# reading

Figure 5.4 Formant frequencies at vowel onset

5.3 Two types of coarticulation?


Testing for allophone separation gives us a negative or marginal result based
on these data: the major burst peaks for the two variants of/k/ are separated
by 1,026 Hz in very careful speech and by 780 Hz in read speech, so they
could be said to show less, rather than more, coarticulation in the connected
speech. However, standard deviations are very large relative to the size of the
effect and we have verified that we have vowel centralization and therefore
know that the vowel targets are also closer together in the connected speech.
It would therefore be difficult to make a convincing case for any significant
difference between the /k/ spectra in the two styles. In addition, we find
contradictory results for /t/: allophone separation is less in citation form than
in read speech (citation-form difference = 30 Hz, read-speech difference =
209 Hz), even though the vowel centralization is about the same in the two
read-speech corpora. Allophone separation criteria do not, in this case, give
us a satisfactory basis for claiming a difference in coarticulation in the two
styles.
When we use locus-target difference as a criterion, however, the results
give us a completely different picture. The difference between F2 frequency at
vowel onset and vowel steady state is consistently much greater for citation-
form speech than for reading, as can be seen from table 5.4. Based on this
criterion, there is consistently more coarticulation in connected speech.

135
Gesture
Table 5.4 Difference between the frequencies ofF2 of
the vowel steady state and vowel onset (Hz)

[ki] [ku] [ti] [tu]


Citation form +149 -245 +216 -149
Read form +44 +6 +63 -39

The question which immediately arises is: how can reading and citation
forms be simultaneously not different and different with respect to coarticu-
lation? A similar anomalous situation can be adduced from the literature on
acquisition by children: studies such as that by Nittrouer et al. (1989), based
on allophone separation, suggest that greater coarticulation is found in
earlier stages of language development and that coarticulation decreases with
age. Kent (1983), however, points to the fact that if greater coarticulation is
associated with greater speed and fluency of production, it would be liable to
increase with greater motor skill and hence with the age of children. He
observes that this is compatible with the finding that children's speech is
slower than adults', and he offers as evidence the fact that the F 2 trajectory in
sequences such as /oju/ (in the phrase We saw youl) is flatter in adults'
pronunciation than in children's, indicating greater coarticulation in adult
speech. The criterion used in this argument is comparable to that of the
target-locus relation.
These contradictory findings suggest the possibility that F2 trajectories and
allophone separation are really measuring two different kinds of coarticula-
tion which behave differently in relation to phonetic development as well as
in relation to speech style.
Pertinent to this point is a fact which is well known but often not
mentioned in discussions of coarticulation: very carefully articulated speech
manifests a great deal of allophone separation. Striking examples are
frequently seen to come from items pronounced in isolation or in frame
sentences (e.g. Daniloff, Shuckers, and Feth 1980: 328-33). Further evidence
of a large amount of allophone separation in careful speech comes from the
search for acoustic invariance (Blumstein and Stevens 1979). The extent of
the differences in the spectra of velar consonants in particular before
different vowels testifies to the amount of this type of coarticulation in
maximally differentiated tokens. Whatever sort of coarticulation may be
measured by allophone separation, it seems very unlikely to be the sort which
is increased by increases in speech rate or in degree of casualness. We assume
it to reflect local variation in lip/tongue/jaw movement. As such, it could be
easily accommodated in the gestural framework suggested by Browman and
Goldstein (1986, this volume).
136
5 Nigel Hewlett and Linda Shockey
Our results show the characteristic vowel centralization which is normally
attributed to connected speech. We have found, in addition, another quite
general effect which might, if found to be robust, come to be considered
coarticulation of the same sort as vowel centralization; this is the marked
lowering of burst frequencies in read speech. The cause of this lowering has
yet to be discovered experimentally. Suggested reasons are: (1) that in
connected speech /ki/ and /ku/ are liable to be produced with a more open
jaw and therefore a larger cavity in front of the release, which would have the
effect of lowering the frequency (Ladefoged, personal communication); and
(2) that since there is probably greater energy expended in the production of
citation forms, it is very likely that there is a greater volume velocity of
airflow (Ohala, personal communication). Given a higher rate of flow
passing through approximately the same size aperture, the result is a higher
center frequency in the source spectrum of the burst. These explanations are
not incompatible and both are amenable to experimental investigation.
It is quite likely that vowel centralization, lowering of burst frequencies,
and flattening of locus-target trajectories in connected speech are all parts of
the same type of long-term effect (which we may or may not want to term
"coarticulation") which can be attributed to larger mandible opening
combined with smaller mandible movements. It certainly appears in our data
that lowered vowel-onset frequencies (which are directly linked to lowered
burst frequencies) and vowel centralization conspire to produce flatter
trajectories, but this hypothesis requires further investigation. Such long-
term effects would be more difficult to explain using a gestural model, but
could possibly be described as a style-sensitive overall weighting on articula-
tor movement. How style can, in practice, be used to determine this
weighting is an open question.
Our results and the others discussed above support Sharf and Ohde's
(1981) notion that it is fruitful to divide what is currently called "coarticula-
tion" into at least two separate areas of study: relatively short-term effects
and longer-term settings. In addition, our results suggest that the former may
not be much influenced by differences in style while the latter show strong
style effects.

Appendix: text of read story


"Just relax," said Coutts. "Now we can have a proper chat." "To begin
with," he went on coolly, "I don't appreciate the intruders."
The gun that Coutts was holding was a .32 Smith and Wesson, Keith noted.
It was its function rather than its make which mattered, of course, but noting

137
Gesture
such details helped to steady his nerves. To the same end, he studied the titles
of the books that were propping the broken sash window open at the bottom,
providing a welcome draught of air into the room. The Collected Poems of
T.S. Eliot were squashed uncomfortably between A Tale of Two Cities and
Teach Yourself Icelandic.
"Perhaps you'd like to tell me the purpose of this unexpected visit?" Courts
smiled, but only with his teeth. The eyes above them remained perfectly
expressionless.
"You know already. I want my client's document back. And the key that
goes with it."
"Now what key would that be, I wonder?"
"About two inches long, grey in colour and with BT5024 stamped on the
barrel."
"Oh, that key."
"So, you know about this too," Courts mused. "Well, we've got lots of time
to talk about it. As it happens, I'm willing to devote the rest of my afternoon
to your client's little problem."
He laughed again, a proper laugh this time, which revealed a badly
chipped tooth. That might have been Stella's handiwork of the previous day
with the teapot. There was a photograph of her which was lying on its back
on the highly polished teak surface of the desk. Next to it was another
photograph, of a teenage boy who didn't seem to bear any resemblance to
Courts.
"I'm not too keen on spending the rest of the afternoon cooped up in your
pokey little office," said Keith.
He tried to think of something more interesting to say, something that was
guaranteed to keep Court's attention directed towards him. For just along
from the Collected Poems of T.S. Eliot, a face was peering through the gap at
the bottom of the window. This was somewhat surprising, since to the best of
his recollection they were two floors up.

Comments on chapter 5
WILLIAM G. BARRY and SARAH HAWKINS
Hewlett and Shockey are concerned with coarticulation from three different
angles. Firstly, they present data from an experiment designed to allow the
comparison of single-syllable CV utterances with the same CV sequences
produced in a continuous-speech passage. The question is whether the degree
of coarticulation in these sequences differs between speech-production
conditions. Secondly, they are concerned with the methodological question
138
5 Comments
of how to quantitatively assess degree of coarticulation. Thirdly, following
their title, they suggest different types of coarticulation based on different
speech-production factors underlying the measured coarticulatory
phenomena.
The simultaneous concern with three aspects of coarticulation studies is
very illuminating. The application for the first time of two different ways of
quantifying CV coarticulation, using the same speech data, provides a clear
illustration of the extent to which theoretical models can be the product of a
particular methodological approach. Given this dependency, however, it is
rather a bold step to conclude the discussion by postulating two categories of
coarticulation, neatly differentiated by analytic method.
This question mark over the conclusions results more from the structure of
the material on which the discussion of coarticulation types is based than
from the interpretation of the data. The contradictory trends found by the
two analysis methods for the two production conditions might be related to
separate aspects of production, namely a "local" tongue/lip/jaw effect and a
"long-term" effect due to jaw movement. This is an interesting way of
looking at the data, although we agree with Hewlett and Shockey's comment
that the latter effect may be an articulatory setting rather than due to
coarticulation. But in order to place these observations into a wider
theoretical perspective, a discussion of types of coarticulation is needed to
avoid post hoc definitions of type along the divisions of analytic method. We
discuss two aspects of (co)articulation that are particularly relevant to
Hewlett and Shockey's work:first,the types of articulatory processes that are
involved in coarticulation; and second, the domains coarticulation can
operate over.
In an early experimental investigation of speech, Menzerath and De
Lacerda (1933) distinguish "Koartikulation," involving the preparatory or
perseveratory activity of an articulator not primarily involved in the current
segment, and "Steuerung" ("steering" or "control"), which is the deviation
of an articulator from its target in one segment due to a different target for
the same articulator in a neighboring segment. This distinction not only
illuminates aspects of Hewlett and Shockey's experimental material, but also
points to a potential criterion for distinguishing (one type of) longer-term
coarticulatory effect from more local effects. The /ki ku ti tu/ syllables can be
seen to carry both factors: (1) the lip-rounding for /u/ is completely
independent of the consonantal tongue articulation for /k/ and /t/, and is
therefore "Koartikulation" in the Menzerath and De Lacerda sense; (2) the
interaction between the tongue targets for /k/ and the two vowels, and for /t/
and the two vowels is a clear case of "Steuerung." "Steuerung" is likely to be
a more local effect, in that the trajectory of a single articulator involved in
consecutive "targets" will depend upon the relation between those targets.
139
Gesture
The "independent" articulator, on the other hand, is free to coarticulate with
any segments to which it is redundant (see Benguerel and Cowan 1974;
Lubker 1981). Note that Menzerath and De Lacerda's distinction, based on
articulatory criteria, encourages different analyses from Hewlett and
Shockey's, derived from acoustic measurements. The latter suggest that their
more long-term effect stems from differences in jaw height, whereas in the
former's system jaw involvement is primarily regarded as a local (Steuerung)
effect. Thus Hewlett and Shockey's "local" effect seems to incorporate both
Menzerath and De Lacerda's terms, and their own "long-term" effect is not
included in the earlier system. The danger of defining coarticulatory types
purely in terms of the acoustic-analytic method employed is now clear. The
acoustically defined target-locus relationship cannot distinguish between
"Koartikulation" and "Steuerung," which are worth separating.
The emphasis on the method of acoustic analysis rather than on articula-
tion leads to another design problem, affecting assessment of allophone
separation. Allophone separation can only be assessed in the standard way
(see Nittrouer et al. 1989) if there are no (or negligible) differences in the
primary articulation. In Menzerath and De Lacerda's terminology, we can
only assess allophone separation for cases of Koartikulation rather than
Steuerung. Allophone separation can thus be assessed in the standard way
for /ti/ and /tu/ but not for /ki/ and /ku/. In careful speech, the position of the
tongue tip and blade for /t/ is much the same before /i/ as before /u/, but
vowel-dependent differences due to the tongue body will have stronger
acoustic effects on the burst as the speech becomes faster and less careful. For
/ki/ and /ku/, on the other hand, different parts of the tongue body form the
/k/ stop closure: the allophones are already separated as [k] and [k], even
(judging from the acoustic data) for the speaker with his fronted /u/. Thus, as
Hewlett and Shockey hint in their discussion, velar and alveolar stops in
these contexts represent very different cases in terms of motor control, and
should not be lumped together. We would therefore predict that the
difference in the frequency of the peak amplitude of alveolar burst should
increase as the /t/ allophones become more distinct, but should decrease for
velar bursts as the /k/ allophones become less distinct. This is exactly what
Hewlett and Shockey found. Hence we disagree with their claim that
allophone-separation measures fail to differentiate citation and reading
forms.
Seen in this light, citation and reading forms in Hewlett and Shockey's
data seem to differ in a consistent way for both the target-locus and the
allophone-separation measure. These observations need to be verified statis-
tically: a series of independent T-tests is open to the risk of false significance
errors and, more importantly, does not allow us to assess interactions
between conditions. The factorial design of the experiment described in
140
5 Comments
chapter 5 is ideally suited to analysis of variance which would allow a more
sensitive interpretation of the data.
Turning now to the domain of coarticulation, we question whether it is
helpful to distinguish between local and long-term effects, and, if it is,
whether one can do so in practice. First, we need to say what local and long
term mean. One possibility is that local coarticulatory effects operate over a
short timescale, while long-term effects span longer periods. Another is that
local influences affect only adjacent segments while long-term influences
affect nonadjacent segments. The second possibility is closest to traditional
definitions of coarticulation in terms of "the overlapping of adjacent
articulations" (Ladefoged 1982: 281), or "the influence of one speech
segment upon another... of a phonetic context upon a given segment"
(Daniloff and Hammarberg 1983: 239). The temporal vs. segmental defini-
tions are not completely independent, and we could also consider a four-way
classification: coarticulatory influences on either adjacent or nonadjacent
segments, each extending over either long or short durations. Our attempts
to fit data from the literature into any of these possible frameworks lead us to
conclude that distinguishing coarticulatory influences in terms of local versus
long-term effects is unlikely to be satisfactory.
There are historical antecedents for a definition of coarticulation in terms
of local effects operating over adjacent segments in Kozhevnikov and
Chistovich's (1965) notion of "articulatory syllable," and in variants of
feature-spreading models (e.g. Henke 1966; Bell-Berti and Harris 1981) with
which Hewlett and Shockey's focus on CV structures implicitly conforms.
There is also plenty of evidence of local coarticulatory effects between
nonadjacent segments. Cases in point are Ohman's (1966a) classic study of
vowels influencing each other across intervening stops and Fowler's (1981a)
observations on variability in schwa as a function of stressed-vowel context.
(In these examples of vowel-to-vowel influences, there are also, of course,
coarticulatory effects on the intervening consonant or consonants.)
The clarity of this division into adjacent and nonadjacent depends on how
one defines "segment." If the definition involved mapping acoustic segments
onto phone or phoneme strings, the division could be quite clear, but if the
definition is articulatory, which it must be in any production model, then the
distinction is not at all clear: units of articulatory control could overlap such
that acoustically nonadjacent phonetic-phonological segments influence one
another. This point also has implications for experimental design: to assess
coarticulatory influences between a single consonant and vowel in connected
speech, the context surrounding the critical acoustic segments must either be
constant over repetitions, or sufficiently diverse that it can be treated as a
random variable. Hewlett and Shockey's passage has uneven distributions of
sounds around the critical segments. For example, most of the articulations
141
Gesture
surrounding /ku/ were dental or alveolar (five before and seven after the
syllable, of eight repetitions), whereas less than half of those surrounding /ti/
and /tu/ were dental or alveolar.
The above examples are for effects over relatively short periods of time;
segmentally induced modifications to features or segments can also extend
over longer stretches of time. For example, Kelly and Local (1989) have
described the spread of cavity features such as velarity and fronting over the
whole syllable and the foot. This spreading affects adjacent (i.e. uninter-
rupted strings of) segments and is reasonably regarded as segment-induced,
since it occurs in the presence of particular sounds (e.g. /r/ or /I/); it is
therefore categorizable as coarticulation. Long-term effects on nonadjacent
segments have also been observed. Slis (personal communication) and
Kohler, van Dommelen, and Timmermann (1981) have found for Dutch and
French, respectively, that devoicing of phonologically voiced obstruents is
more likely in a sentence containing predominantly voiceless consonants.
Such sympathetic voicing or devoicing in an utterance is the result of a
property of certain segments spreading to other (nonadjacent) segments.
However, classifying these longer-term effects as coarticulation runs the
risk that the term becomes a catch-all category, with a corresponding loss in
usefulness. In their articulatory (and presumably also acoustic) manifes-
tation there appears to be nothing to differentiate at least these latter cases of
long-term effects from the acquired permanent settings of some speakers, and
Hewlett and Shockey in fact point out that their longer-term effect (mandible
movement) may be best regarded as an articulatory setting. Articulatory
setting is probably a useful concept to retain, even though in terms of
execution it may not always be very distinct from coarticulation.
There are also similarities between the above long-term effects and certain
connected-speech processes of particular languages, such as the apparently
global tendency in French and Russian for voicing to spread into segments
that are phonologically voiceless, which contrasts with German, English, and
Swedish, where the opposing tendency of devoicing is stronger. If these
general properties of speech are classed as coarticulation, it seems a relatively
small step to include umlaut usage in German, vowel harmony, and various
other phonological prosodies (e.g. Firth 1948; Lyons 1962) as forms of
coarticulation. Many of these processes may have had a coarticulatory basis
historically, but there are good grounds in synchronic descriptions for
continuing to distinguish aspects of motor control from phonological rules
and sociophonetic variables. We do not mean to advocate that the term
coarticulation should be restricted to supposed "universal" tendencies and
all language-, accent-, or style-dependent variation should be called some-
thing else. But we do suggest that there is probably little to be gained by

142
5 Comments
describing all types of variation as coarticulation, unless that word is used as
a synonym for speech motor control.
We suggest that the type of long-term effect that Hewlett and Shockey
have identified in their data is linked with the communicative redundancy of
individual segments in continuous speech compared to their communicative
value in single-syllable citation. This relationship between communicative
context and phonetic differentiation (Kohler 1989; Lindblom 1983) is
assumed to regulate the amount of articulatory effort invested in an
utterance. Vowel centralization has been shown to disappear in communica-
tive situations where greater effort is required, e.g. noise (Schindler 1975;
Summers et al. 1988). Such "communicative sets" would presumably also
include phenomena such as extended labiality in "pouted" speech, but not
permanent settings that characterize speakers or languages (see the comment
on voicing and devoicing tendencies above), though these too are probably
better not considered to be "coarticulation."
The exclusion of such phenomena from a precise definition of coarticula-
tion does not imply that they have no effect on coarticulatory processes. The
"reduction of effort" set, which on the basis of Hewlett and Shockey's data
might account for vowel undershoot and lowered release-burst frequencies,
can also be invoked to explain indirectly the assimilation of alveolars
preceding velars and labials. By weakening in some sense the syllable-final
alveolar target, it allows the anticipated velar or labial to dominate acousti-
cally. Of course, this goes no way towards explaining the fact (Gimson 1960;
Kohler 1976) that final alveolars are somehow already more unstable in their
definition than other places of articulation and therefore susceptible to
coarticulatory effects under decreased effort.
To conclude, then, we suggest that although there seems to be a great deal
of physiological similarity between segment-induced modifications and the
settings mentioned above that are linked permanently to speakers or
languages, or temporarily to communicative context, it is useful to
distinguish them conceptually. This remains true even though a model
of motoric execution might treat them all similarly. To constrain the use
of the term coarticulation, we need to include the concept of "source
segment(s)" and "affected segments" in its definition. The existence of
a property or feature extending over a domain of several segments, some
of which are not characterized by that feature, does not in itself indicate
coarticulation.
Our final comment concerns Hewlett and Shockey's suggested acoustic
explanation for their finding that the most prominent peaks of the burst
spectra were all lower in frequency in the read speech than in the citation
utterances. One of their suggested possibilities is that less forceful speech

143
Gesture
could be associated with a lower volume velocity of flow through the
constriction at the release, which would lower the center frequency of the
noise-excitation spectrum. The volume velocity of flow might well have
been lower in Hewlett and Shockey's connected-speech condition, but we
suggest it was probably not responsible for the observed differences. The
bandwidth of the noise spectrum is so wide that any change in its center
frequency is unlikely to have a significant effect on the output spectrum at
the lips.
The second suggestion is that a more open jaw could make a larger cavity
in front of the closure. It is not clear what is being envisaged here, and it is
certainly not simple to examine this claim experimentally. We could measure
whether the jaw is more open, but if it were more open, how would this affect
the burst spectrum? Hewlett and Shockey's use of the word "size" suggests
that they may be thinking of a Helmholtz resonance, but this seems unlikely,
given the relationship between the oral-cavity volume and lip opening that
would be required. A more general model that reflects the detailed area
function (length and shape) of the vocal tract is likely to be required (e.g.,
Fant 1960). The modeling is not likely to be simple, however, and it is
probably inappropriate to attribute the observed pattern to a single cause.
The most important section of the vocal tract to consider is the cavity from
the major constriction to the lips. It is unclear how front-cavity length and
shape changes associated with jaw height would produce the observed
patterns. However, if an explanation in terms of cavity size is correct, the
most likely explanation we know of is not so much that the front cavity itself
is larger, as that the wider lip aperture (that would probably accompany a
lowered jaw position) affected the radiation impedance.1 Rough calcula-
tions following Flanagan (1982: 36) indicate that changes in lip aperture
might affect the radiation impedance by appropriate amounts at the relevant
frequencies.
Other details should also be taken into consideration, such as the degree to
which the cavity immediately behind the constriction is tapered towards the
constriction, and the acoustic compliance of the vocal-tract walls. There is
probably more tapering in citation than reading form of alveolars and
fronted /k/, and the wall compliance is probably less in citation than reading
form. Changes in these parameters due to decreased effort could contribute
to lowering the frequencies and changing the intensities of vocal-tract
resonances. The contribution of each influencing factor is likely to depend on
the place of articulation of the stop.
To summarize, we feel that in searching for explanations of motor

1
We are indebted to Christine Shadle for pointing this out to us.

144
5 Comments
behavior from acoustic measurements, it is important to use models (acoustic
and articulatory) that can represent differences as well as similarities between
superficially similar things. The issues raised by Hewlett and Shockey's study
merit further research within a more detailed framework.

145
Section B
Segment

147
6
An introduction to feature geometry

MICHAEL BROE

6.0 Introduction
This paper provides a short introduction to a theory which has in recent
years radically transformed the appearance of phonological representations.
The theory, following Clement's seminal (1985) paper, has come to be known
as feature geometry.1 The rapid and widespread adoption of this theory as
the standard mode of representation in mainstream generative phonology
stems from two main factors. On the one hand, the theory resolved a debate
which had been developing within nonlinear phonology over competing
modes of phonological representation (and resolved it to the satisfaction of
both sets of protagonists), thus unifying the field at a crucial juncture. But
simultaneously, the theory corrected certain long-standing and widely
acknowledged deficiencies in the standard version of feature theory, which
had remained virtually unchanged since The Sound Pattern of English,
(Chomsky and Halle 1968; hereafter SPE).
The theory of feature geometry is rootedfirmlyin the tradition of nonlinear
phonology - an extension of the principles of autosegmental phonology to
the wider phonological domain - and in section 6.1 I review some of this
essential background. I then show, in section 6.2, how rival modes of
representation developed within this tradition. In section 6.3 I consider a
related problem, the question of the proper treatment of assimilation
phenomena. These two sections prepare the ground for section 6.4, which
1
Clements (1985) is the locus classicus for the theory, and the source of the term "feature
geometry" itself. Earlier suggestions along similar lines (in unpublished manuscripts) by
Mascaro (1983) and Mohanan (1983) are cited by Clements, together with Thrainsson (1978)
and certain proposals in Lass (1976), which can be seen as an early adumbration of the leading
idea. A more detailed survey of the theory may be found in McCarthy (1988), to which I am
indebted. Pulleyblank (1989) provides an excellent introduction to nonlinear phonology in
general.

149
Segment
shows how feature geometry successfully resolves the representation prob-
lem, and at the same time provides a more adequate treatment of assimila-
tion. I then outline the details of the theory, and in section 6.5 show how the
theory removes certain other deficiencies in the standard treatment of
phonological features. I close with an example of the theory in operation: a
treatment of the Sanskrit rule of «-Retroflexion or Nati.

6.1 Nonlinear phonology


Feature geometry can be seen as the latest stage in the extension of principles
of autosegmental phonology - originally developed to handle tonal pheno-
mena (Goldsmith 1976) - to the realm of segmental phonology proper.
Essential to this extension is the representation of syllabicity on a separate
autosegmental tier, the so-called "CV-skeleton" (McCarthy 1979, 1981;
Clements and Keyser 1983). Segmental features are then associated with the
slots of the CV-tier just as tones and tone-bearing units are associated in
autosegmental phonology; and just as in tonal systems, the theory allows for
many-to-one and one-to-many associations between the CV-skeleton and the
segmental features on the 'melodic' tier.
For example, contour segments - affricates, prenasalized stops, short
diphthongs, and so on, whose value for some feature changes during the
course of articulation - can be given representations analogous to those used
for contour tones, with two different values for some feature associated with
a single timing slot:

(1) H L [-cont] [ + cont] [ + nas] [-nas]

V C C
Contour tone Affricate Prenasalized stop

And conversely, geminate consonants and long vowels can be represented as


a single quality associated with two CV-position:

(2) H i m

V V V V C C
Tonal spread Long vowel Geminate consonant

As Clements (1985) notes, such a representation gives formal expression to


the acknowledged ambiguity of such segments/sequences. Kenstowicz (1970)

150
6 Michael Broe
has observed that, for the most part, rules which treat geminates as atoms are
rules affecting segment quality, while rules which require a sequence represen-
tation are general "prosodic," affecting stress, tone, and length itself. Within
a nonlinear framework, the distinction can be captured in terms of rule
application on the prosodic (quantitative) tier or the melodic (qualitative)
tier respectively.
Futher exceptional properties of geminates also find natural expression in
the nonlinear mode of representation. One of the most significant of these is
the property of geminate integrity: the fact that, in languages with both
geminates and rules of epenthesis, epenthetic segments cannot be inserted
"within" a geminate cluster. In nonlinear terms, this is due to the fact that the
structural representation of a geminate constitutes a "closed domain"; it is
thus impossible to construct a coherent representation of an epenthesized
geminate:

(3) 1 1

V c V

V
V c
Here, the starred representation contains crossing association lines. Such
representations are formally incoherent: they purport to describe a state of
affairs in which one segment simultaneously precedes and follows another.
It was quickly noted that assimilated clusters also exhibit the properties of
true geminates. The following Palestinian data from Hayes (1986) show
epenthesis failing to apply both in the case of underlying geminates and in the
case of clusters arising through total assimilation:
(4) /?akl/ -> [Pakil] food
/Pimm/ -> [Pimm] mother
/jisr kbiir/ -• [jisrikbir] bridge big
/1-walad 1-zyiir/ —> [lwaladizzyiir] DEF-boy DEF-small

Halle and Vergnaud (1980) suggested that assimilation rules be stated as


manipulations of the structural relations between an element on the melodic
tier and a slot in the CV-skeleton, rather than as a feature changing operation

151
Segment
on the melodic tier itself. They provide the following formulation of Hausa
regressive assimilation:
(5) /littaf+taaf+ai/ -• [littattaafai]

c v c c v c c v v c v v

Here - after the regressive spread of the [t] melody and delinking of the [f] -
the output of the assimilation process is structurally identical to a geminate
consonant, thus accounting for the similarities in their behavior. This in turn
opens up the possibility of a generalized account of assimilation phenomena
as the autosegmental spreading of a melody to an adjacent CV-slot (see the
articles by Hayes, Nolan, and Local this volume).
However, a problem immediately arises when we attempt to extend this
spreading account to cases of partial assimilation, as in the following
example from Hausa (Halle and Vergnaud 1980):
(6) /gaddam + dam + ii/ -> [gaddandamii]
The example shows that, in their words, "it is possible to break the link
between a skeleton slot and some features on the melody tier only while
leaving links with other features intact" (p. 92): in this case, to delink place
features while leaving nasality and sonority intact. Halle and Vergnaud do
not extend their formalism to deal with such cases.
The problem is simply to provide an account of this partial spreading
which is both principled and formally adequate. The radical autosegmental-
ism of phonological properties gives rise to a general problem of organiza-
tion. As long as there is just one autosegmental tier - the tonal tier -
organized in parallel to the basic segmental sequence, the relation between
the two is straightforward. But when the segmental material is itself
autosegmentalized on a range of tiers - syllabicity, nasality, voicing, contin-
uancy, tone, vowel-harmony features, place features, all of which have been
shown to exhibit autosegmental behavior - then it is no longer clear how the
tiers are organized with respect to each other. It is important to notice that
this issue is not resolved simply by recourse to a "syllabic core" or CV-
skeleton, although the orientation this provides is essential. As Pulleyblank
(1986) puts it: "Are there limitations on the relating of tiers? Could a tone
tier, for example, link directly to a nasality tier? Or could a tone tier link
directly to a vowel harmony tier?"

152
6 Michael Broe

6.2 Competing representations


Steriade (1982) shows that in Kolami, partially assimilated clusters
(obstruent followed by homorganic nasal) resist epenthesis just as geminates
do, suggesting that the output of the assimilation rule should indeed exhibit
linked structure (in the case of geminates, illicit *CC codas are simplified by
consonant deletion, rather than epenthesis):
(7) Present Imperative Gloss
melp-atun melep shake
idd-atun id tell
porjg-atun porj boil over
Steriade sketches two possible formulations of this process, which came to be
known as multiplanar and coplanar respectively (see Archangeli 1985 for
more extensive discussion of the two approaches):
(8) Multiplanar Coplanar

place place place place


featuresj [_features_ featuresj featuresj

i--
manner manner
featuresj |_featuresj

manner manner
featuresj [features
These two solutions induce very different types of representation, and
make different empirical predictions. Note first that, in the limiting case, a
multiplanar approach would produce a representation in which every feature
constituted a separate tier. Elements are associated directly with the skeletal
core, ranged about it in a so-called "paddlewheel" or "bottlebrush" forma-
tion (Pulleyblank 1986, Sagey 1986b). Here, then, the limiting case would
exhibit the complete structural autonomy of every feature:

(9) [ coronal ]

[ consonantal ] [ sonorant ]

[ continuant ]
153
Segment

In the coplanar model, on the other hand, tiers are arrayed in parallel, but
within the same plane, with certain primary features associated directly with
skeletal slots, and secondary or subordinate features associated with a CV-
position only indirectly, mediated through association to the primary
features. This form of representation, then, exhibits intrinsic featural depen-
dencies. Compare the coplanar representation in (8) above with the following
reformulation:
(10) manner manner
|_featuresj |_featuresj

I place I F place ~|
l^featuresj [_featuresj

—1
Here, counterfactually, place features have been represented as primary,
associated directly to the skeletal tier, while manner features are secondary,
linked only indirectly to a structural position. But under this regime of
feature organization it would be impossible to represent place assimilation
independently of manner assimilation. Example (10) is properly interpreted
as total assimilation: any alteration in the association of the place features is
necessarily inherited by features subordinate to them. The coplanar model
thus predicts that assimilation phenomena will display implicational asym-
metries. If place features can spread independently of manner features, then
they must be represented as subordinate to them in the tree; but we then
predict that it will be impossible to spread manner features independently.
The limiting case here, then, would be the complete structural dependency of
every feature. In the case of vowel-harmony features, for example, Archan-
geli (1985: 370) concludes: "The position we are led to, then, is that Universal
Grammar provides for vowel features arrayed in tiers on a single plane in the
particular fashion illustrated below."
[back]

[ round]

[high]

V
154
6 Michael Broe
Which model of organization is the correct one? It quickly becomes
apparent that both models are too strong: neither complete dependence nor
complete independence is correct. The theory of feature geometry brings a
complementary perspective to this debate, and provides a solution which
displays featural dependence in the right measure.

6.3 The internal structure of the segment


To get a sense of this complementary perspective, we break off from the
intellectual history of nonlinear phonology for a moment to consider a
related question: what kinds of assimilation process are there in the world's
languages? The standard theory of generative phonology provides no answer
to this question. The fact that an assimilation process affecting the features,
{coronal, anterior, back} is expected, while one affecting {continuant,
anterior, voice} is unattested, is nowhere expressed by the theory. In SPE,
there is complete formal independence between features: a feature matrix is
simply an unstructured bundle of properties. The class of assimilation
processes predicted is simply any combination of values of any combination
of features. The constant recurrence of particular clusters of features in
assimilation rules is thus purely accidental. What is required is some formal
way of picking out natural classes of features (Clements 1987), a "built-in
featural taxonomy" (McCarthy 1988), thus making substantive claims about
possible assimilation processes in the world's languages.
There is a further problem with the standard account of assimilation, first
pointed out by Bach (1968). Consider the following rule:

(11) a cor a cor


(3 ant / (3 ant
y back / y back

The alpha-notation, which seems at first sight such an elegant way of


capturing assimilatory dependencies, also allows us to express such undesir-
able dependencies as the following:

(12) a cor P cor


V (3 ant / y ant
y back / a back

The problem is in fact again due to the complete formal independence of the
features involved. This forces us to state assimilation as a set of pairwise
agreements, rather than agreement of a set as a whole (Lass 1976: 163).

155
Segment

A possible solution is to introduce an «-ary valued PLACE feature, where


the variable [a PLACE] ranges over places of articulation directly. Apart from
giving up binarity, the disadvantage of such a solution is that it loses the
cross-classificatory characterization of place: retroflex sounds, for example,
receiving the description [ret PLACE], would no longer be characterized
simultaneously in terms of coronality and anteriority, and a process affecting
or conditioned by just one of these properties could not be formulated.
Access to a PLACE feature is intuitively desirable, however, and a refine-
ment of this idea is able to meet the above criticisms: we introduce a category-
valued feature. That is, rather than restricting features to atomic values only
(" +," " —," "ret"), we allow a feature matrix itself to constitute the value for
some feature. Thus, rather than a representation such as [ret PLACE], we may
adopt the following:

(13) + cor PLACE


— ant

Such a move allows us to preserve thefine-grained,multifeatured characteri-


zation of place of articulation; the advantage derives from allowing variables
to range over these complex values just as they would over atomic ones. Thus
the category (13) above is still matched by the variable [a PLACE].
Category-valued features have been used extensively in syntactic theories
(such as lexical-functional grammar [LFG] and generalized phrase-structure
grammar [GPSG]) which adopt some form of unification-based represen-
tation (Sheiber 1986), and we will adopt such a representation here (see
Local, this volume):

PLACE = cor = +
ant = —

Here, values have simply been written to the right of their respective features.
We may extend the same principle to manner features; and further, group
together place and manner features under a superordinate ROOT node, in
order to provide a parallel account of total assimilation (that is, in terms of
agreement of ROOT specifications):
(15) Fcor = +
PLACE =
|_ant = -
ROOT =

MANNER = [ nas = —]

156
6 Michael Broe
Such a representation has the appearance of a set of equations between
features and their subordinate values, nested in a hierarchical structure. This
is in fact more than analogy: any such category constitutes a mathematical
function in the strict sense, where the domain is the set of feature names, and
the range is the set of feature values. Now just as coordinate geometry
provides us with a graphical representation of any function, the better to
represent certain salient properties, so we may give a graph-theoretic
representation of the feature structure above:

(16) A ROOT

PLACE

ant

Here, each node represents a feature, and dependent on the feature is its
value, be it atomic - as at the leaves of the tree - or category-valued.
Formally, then, the unstructured feature matrix of the standard theory has
been recast in terms of a hierarchically structured graph, giving direct
expression to the built-in feature taxonomy.

6.4 Feature geometry


How is this digression through functions and graphs relevant to the notion of
feature geometry? Consider the following question: if (16) above is a
segment, what is a string? What is the result of concatenating two such
complex objects? A major use of the graph-theoretic modeling of functions in
coordinate geometry is to display interactions - points of intersection, for
example - between functions. The same holds good for the representations
we are considering here. Rather than representing our functions on the x/y-
axis of coordinate geometry, however, we arrange them on the temporal axis
of feature geometry, adding a third dimension to the representation as in
figure 6.1. Such is the mode of representation adopted by feature geometry.
Historically, the model developed not as a projection of the graph-theoretic
model of the segment, but through an elaboration of the theory of planar
organization informed by the concerns of featural taxonomy.2
2
As is usual in formal linguistics, various formalizations of the same leading idea, each with
subtly different ramifications, are on offer. The fundamental literature (Clements, 1985; Sagey
1986a) is couched in terms of planar geometry; Local (this volume) adopts a rigorous graph-
theoretic approach; while work by Hammond (1988), Sagey (1988), and Bird and Klein (1990)
is expressed in terms of algebras of properties and relations.

157
Segment

ROOT

MANNER
nasal

Figure 6.1 Geometrical representation of two successive segments in a phonological string.


When the figure is viewed "end on," the structure of each segment appears as a tree like that in
example (16). The horizontal dimension represents time

MANNER
PLACE

Figure 6.2 The representation of autosegmental assimilation in a geometrical structure of the


sort shown in figure 6.1

The essential insight of feature geometry is to see that, in such a


representation, "nodes of the same class or feature category are ordered
under the relation of concatenation and define a TIER" (Clements 1985: 248).
With respect to such tiers, the familiar processes of nonlinear phonology
such as spreading, delinking, and so on, may be formulated in the usual way.
Such a representation thus provides a new basis for autosegmental organiza-
tion. The rule of place assimilation illustrated in (8) above, for example,
receives the formulation in figure 6.2. Thus whole clusters of properties may
be spread at a single stroke, and yet may do so quite independently of
features which are represented on a different tier, in some other subfield of
the taxonomy.
Clements (1985) helpfully describes these elaborate representations in the
following manner: "This conception resembles a construction of cut and
glued paper, such that each fold is a class tier, the lower edges are feature
tiers, and the upper edge is the CV tier" (p. 229). Feature geometry solves the
organization problem mentioned above by adopting a representation in
which autosegmental tiers are arranged in a hierarchical structure based on a
featural taxonomy. The theory expresses the intimate connection between

158
6 Michael Broe

[ROOT

SUPRALARYNGEAL

constricted spread stiff


glottis glottis v.c.

RADICAL

round anterior distributed high low back ATR

Figure 6.3 An overview of the hierarchical organization of phonological features, based on


Sagey 1986 and others

the internal organization of the segment and the kinds of phonological


processes such segments support. As Mohanan (1983) and Clements (1985)
point out, the hierarchical organization of autosegmental tiers, combined
with a spreading account of assimilation, immediately predicts the existence
of three common types of assimilation process in the world's languages: total
assimilation processes, in which the spreading element is a root node; partial
assimilation processes, in which the spreading element is a class node; and
single-feature assimilation, in which a single feature is spread. Clements
(1985: 231f.) provides exemplification of this three-way typology of assimila-
tion processes. More complex kinds of assimilation can still be stated, but at
greater cost.
While the details of the geometry are still under active development, a
gross architecture has recently received some consensus. A recurrent shaping
idea is that many features are "articulator-bound" (Ladefoged and Halle
1988) and that these should be grouped according to the articulator which
executes them: larynx, soft palate, lips, tongue blade, tongue body, tongue
root (see Browman and Goldstein, this volume). The features [high], [low],
and [back], for example, are gestures of the tongue body, and are grouped
together under the DORSAL node. Controversy still surrounds features which
are "articulator free" - such as stricture features - and the more abstract
major class features [sonorant] and [consonantal]; we will have little to say
about such features here (see McCarthy 1988 for discussion).
In addition, the geometry reflects the proclivity of features associated with
certain articulators to pattern together in assimilatory processes, gathering
them under the SUPRALARYNGEAL and PLACE nodes. The picture that emerges

159
Segment

looks like figure 6.3. At the highest level of branching, the framework is able
to express Lass's (1976: 152f.) suggestion that feature matrices can be
internally structured into laryngeal and supralaryngeal gestures:
in certain ways [?] is similar to (the closure phase of) a postvocalic voiceless stop.
There is a complete cut-off of airflow through the vocal tract: i.e. a configuration that
can reasonably be called voiceless, consonantal (if we rescind the ad hoc restriction
against glottal strictures being so called), and certainly noncontinuant. In other
words, something very like the features of a voiceless stop, but MINUS SUPRA-
LARYNGEAL ARTICULATION . . . Thus [?] and [h] are DEFECTIVE . . . they are missing an
entire component or parameter that is present in "normal" segments.

Moreover, Lass's suggested formalization of this insight bears a striking


resemblance to the approach we have sketched above:
Let us represent any phonological segment, not as a single matrix, but as containing
two submatrices, each specifying one of the two basic parameters: the laryngeal
gesture and the oral or supralaryngeal gesture. The general format will be:

r [oral] I
I [laryngeal] I

Lass cites alternations such as the following from Scots:


(17) ka? cap ka:r?red3 cartridge o?m open
ba? bat wen?Ar winter bA?n button
ba? back f\l?Ar filter bro?n broken
and notes that the rule neutralizing [p t k] to [?] can now be formulated
simply as deletion of the supralaryngeal gesture: in our terms, delinking of
the SUPRALARYNGEAL node. The LARYNGEAL node, then, dominates a set of
features concerning the state of the glottis and vocal cords in the various
phonation-types.
The organization of features below the PLACE node has been greatly
influenced by Sagey's (1986a) thesis. This work is largely concerned with
complex segments - labiovelars, labiocoronals, clicks, and so on, sounds with
more than one place of articulation - and the development of a model which
allows the expression of just those combinations which occur in human
languages. (These are to be distinguished from the contour segments men-
tioned above. The articulations within a complex segment are simultaneous -
at least as far as the phonology is concerned.) It is to be noted, then, that a
simple segment will be characterized by just one of the PLACE articulators,
while a complex segment will be characterized by more than one - LABIAL and
DORSAL, say.

160
6 Michael Broe

6.5 More problems with standard feature theory


The hierarchical view of segment structure also suggests a solution to certain
long-standing problems in standard feature theory. Consider the following
places of articulation, together with their standard classification: 3
(18)
labial alveolar palatal retroflex velar uvular

— cor -hcor -hcor + cor — cor — cor


+ ant + ant — ant — ant — ant — ant

Such an analysis predicts the following natural classes:


(19) [-hcor]: {alveolar, palatal, retroflex}
[ — cor]: *{labial, velar, uvular}

But while the [ + cor] class is frequently attested in phonological rules, the
[ — cor] class is never found. The problem here is that standard feature theory
embodies an implicit claim that, if one value of a feature denotes a natural
class, then so will the opposite value. This is hard-wired into the theory: it is
impossible to give oneself the ability to say [ + F] without simultaneously
giving oneself the ability to say [ — F].
Consider n o w a classification based on active articulators:
(20)
labial alveolar palatal retroflex velar uvular
LAB COR COR COR DOR DOR

Such a theory predicts the following classes:


(21) [LABIAL]: {labial}
[CORONAL]: {alveloar, palatal, retroflex}
[DORSAL]: {velar, uvular}

Under this approach, the problematic class mentioned above simply cannot
be mentioned - the desired result.
Consider now the same argument with respect to the feature [anterior]; we
predict the following classes:
(22) [-fant]: *{labial, alveolar}
[ — ant]: *{palatal, retroflex, velar, uvular}
Here the problem is even worse. As commonly remarked in the literature,
there seems to be no phonological process for which this feature denotes a
3
The following discussion is based on Yip (1989).

161
Segment

natural class, [anterior] is effectively an ancillary feature, whose function is


the subclassification of [ + coronal] segments, not the cross-classification of
the entire consonant inventory. We can express this ancillary status in a
hierarchical model by making the feature [anterior] subordinate to the
CORONAL node:
(23)
labial alveolar palatal retroflex velar uvular

LAB COR COR COR DOR DOR


1 1 1
+ ant — ant — ant

Similarly, the feature [distributed] is also represented as a dependent of the


coronal node, distinguishing retroflex sounds.

6.6 An example

This organization of the PLACE articulators can be seen in action in an


account of Sanskrit w-Retroflexion or ISTati (Schein and Steriade 1986). The
rule, simplified, may be quoted direct from Whitney:
189. The dental nasal n . . . is turned into the lingual [i.e. retroflex] i\ if preceded in the
same word by . . . § or r.-: and this, not only if the altering letter stands immediately
before the nasal, but at whatever distance from the latter it may be found: unless,
indeed, there intervene (a consonant moving the front of the tongue: namely) a
palatal . . . , a lingual or a dental. (Whitney 1889: 64)

N o t e Whitney's attention to the active articulator in his formulation of the


rule: " a consonant moving the front of the t o n g u e . " The rule is exemplified in
the data in table 6.1 (Schein and Steriade 1986). We may represent this
consonant harmony as the autosegmental spreading of the CORONAL node,
targeting a coronal nasal, as shown in figure 6.4.

+ nas

-ant —dist

Figure 6.4 Sanskrit w-retroflexion (Nati) expressed as spreading of coronal node

162
6 Michael Broe
Table 6.1 Data exemplifying the Sanskrit
n-Retroflexion rule shown in figure 6.4

Base form Nati form Gloss

Present
1 mr;d-naa- be gracious
2 i§-r\aa- seek
3 Pt-n.aa- fill
Passive
4 bhug-na- bend
5 puu|;-i\a- fill
6 vr.k-r(a- cut up
Middle participle 1
7 marj-aana- wipe
8 k§ved-aana- hum
9 pur,-aar\a- fill
10 k§ubh-aar\a- quake
11 cak§-aar\a- see
Middle participle 2
12 krt-a-maana- cut
13 kr,p-a-maai\a- lament

If a labial, velar, or vowel intervenes - segments characterized by LABIAL


and DORSAL articulators - no ill-formedness results, and the rule is free to
apply. If, however, a coronal intervenes, the resultant structure will be ruled
out by the standard no-crossing constraint of autosegmental phonology. The
featural organization thus explains why non-coronals are transparent to the
coronal harmony (see figure 6.5). This accords nicely with Whitney's
account:

We may thusfigureto ourselves the rationale of the process: in the marked proclivity
of the language toward lingual utterance, especially of the nasal, the tip of the tongue,
when once reverted into the loose lingual position by the utterance of non-contact
lingual element, tends to hang there and make its next nasal contact in that position:
and does so, unless the proclivity is satisfied by the utterance of a lingual mute, or the
organ is thrown out of adjustment by the utterance of an element which causes it to
assume a different posture. This is not the case with the gutturals or labials, which do
not move the front part of the tongue. (Whitney 1879: 65)

Note that, with respect to the relevant (CORONAL) tier, target and trigger are

163
Segment
r - - -

LARYNGEAL SUPRALARYNGEAL

PLACE

LABIAL
I CORONAL
' anterior
distributed

Figure 6.5 The operation of the rule shown in figure 6.4, illustrating the transparency of an
intervening labial node

adjacent. This gives rise to the notion that all harmony rules are "local" in an
extended sense: adjacent at some level of representation.
It may be helpful to conclude with an analogy from another cognitive
faculty, one with a long and sophisticated history of notation: music. The
following representation is a perceptually accurate transcription of a piece of
musical data:

(24)

i
In this transcription, x and y are clearly nonadjacent. But now consider this
performance score, where the bass line is represented "autosegmentalized" on
a separate tier:

(25)

Here, x and y are adjacent, on the relevant tier. Note, too, that there is an
articulatory basis to this representation, being a transcription of the gestures
of the right and left hands respectively: x and y are adjacent in the left hand.
An articulator-bound notion of feature geometry thus lends itself naturally
to a conception of phonological representation as gestural score (Browman
and Goldstein 1989 and this volume).
164
6 Michael Broe
The most notable achievement of feature geometry, then, is the synthesis it
achieves between a theory of feature classification and taxonomy, on the one
hand, and the theory of autosegmental representation - in its extended
application to segmental material - on the other. Now as Chomsky (1965:
172) points out, the notion of the feature matrix as originally conceived gives
direct expression to the notion of "paradigm" - the system of oppositions
operating at a given place in structure: "The system of paradigms is simply
described as a system of features, one (or perhaps some hierarchic configu-
ration) corresponding to each of the dimensions that define the system of
paradigms." Feature geometry makes substantive proposals regarding the
"hierarchic configuration" of the paradigmatic dimension in phonology. But
it goes further, and shows what kind of syntagmatic structure such a
hierarchy supports. A new syntagm, a new paradigm.

165
7
The segment: primitive or derived?

JOHN J. OHALA

7.1 Introduction
The segmental or articulated character of speech has been one of the
cornerstones of phonology since its beginnings some two-and-a-half millen-
nia ago.* Even though segments were broken down into component features,
the temporal coordination of these features was still regarded as a given.
Other common characteristics of the segment, not always made explicit, are
that they have a roughly steady-state character (or that most of them do),
and that they are created out of the same relatively small set of features used
in various combinations.
Autosegmental phonology deviates somewhat from this by positing an
underlying representation of speech which includes autonomous features
(autosegments) uncoordinated with respect to each other or to a CV core or
"skeleton" which is characterized as "timing units."1 These autonomous
features can undergo a variety of phonological processes on their own.
Ultimately, of course, the various features become associated with given Cs
or Vs in the CV skeleton. These associations or linkages are supposed to be
governed by general principles, e.g. left-to-right mapping (Goldsmith 1976),
the obligatory contour principle (Leben 1978), the shared feature convention
(Steriade 1982). These principles of association are "general" in the sense

*I thank Bjorn Lindblom, Nick Clements, Larry Hyman, John Local, Maria-Josep Sole, and an
anonymous reviewer for helpful comments on earlier versions of this paper. The program which
computed the formant frequencies of the vocal-tract shapes in figure 7.2 was written by Ray
Weitzman, based on earlier programs constructed by Lloyd Rice and Peter Ladefoged. A grant
from the University of California Committee on Research enabled me to attend and present this
paper in Edinburgh.
1
As far as I have been able to tell, the terms "timing unit" or "timing slot" are just arbitrary
labels. There is no justification to impute a temporal character to these entities. Rather, they
are just "place holders" for the site of linkage that the autosegments eventually receive.

166
7 John J. Ohala
that they do not take into account the "intrinsic content" of the features
(Chomsky and Halle 1968: 400ff.); the linkage would be the same whether the
autosegments were [± nasal] or [ + strident]. Thus autosegmental phonology
preserves something of the traditional notion of segment in the CV-tier but
this (auto)segment at the underlying level is no longer determined by the
temporal coordination of various features. Rather, it is an abstract entity
(except insofar as it is predestined to receive linkages with features proper to
vowels or consonants).
Is the primitive or even half-primitive nature of the segment justified or
necessary? I suggest that the answer to this question is paradoxically both
"no" and "yes": "no" from an evolutionary point of view, but "yes" in every
case after speech became fully developed; this latter naturally includes the
mental grammars of all current speakers. I will argue that it is impossible to
have articulated speech, i.e., with "segments," without having linked, i.e.
temporally coordinated, features. However, it will be necessary to justify
separately the temporal linkage of features, the existence of steady-states,
and the use of a small set of basic features; it will turn out that these
characteristics do not all occur in precisely the same temporal domain or
"chunk" in the stream of speech. Thus the "segment" derived will not
correspond in all points to the traditional notion of segment.
For the evolutionary part of my story I am only able to offer arguments
based primarily on the plausibility of the expected outcome of a "gedanken"
simulation; an actual simulation of the evolution of speech using computer
models has not been done yet.2 However, Lindblom (1984, 1989) has
simulated and explored in detail some aspects of the scenario presented here.
Also relevant is a comparison of speech-sound sequences done by Kawasaki
(1982) and summarized by Ohala and Kawasaki (1984). These will be
discussed below. In any case, much of my argument consists of bringing well-
known phonetic principles to bear on the issue of how speech sounds can be
made different from each other - the essential function of the speech code.

7.2 Evolutionary development of the segment


7.2.1 Initial conditions
Imagine a prespeech state in which all that existed was the vocal tract and the
ear (including their neurological and neuromuscular underpinnings). The
vocal tract and the ear would have the same physical and psychophysical
constraints that they have now (and which presumably can be attributed to
2
A computational implementation of the scenario described was attempted by Michelle Caisse
but was not successful due to the enormity of the computations required and the need to
define more carefully the concept of "segmentality," the expected outcome.

167
Segment
natural physical and physiological principles and the constraints of the
ecological niche occupied by humans). We then assign the vocal tract the task
of creating a vocabulary of a few hundred different utterances (words) which
have the following properties:
1 They must be inherently robust acoustically, that is, easily differentiate
from the acoustic background and also sufficiently different from each other.
1 will refer to both these properties as "distinctness". Usable measures of
acoustic distinctness exist which are applicable to all-voiced speech with no
discontinuities in its formant tracks; these have been applied to tasks
comparable to that specified here (Kawasaki 1982; Lindblom 1984; Ohala et
al. 1984). Of course, speech involves acoustic modulations in more than just
spectral pattern; there are also modulations in amplitude, degree of periodi-
city, rate of periodicity (fundamental frequency), and perhaps other para-
meters that characterize voice quality. Ultimately all such modulations have
to be taken into account.
2 Errors in reception are inevitable, so it would be desirable to have some
means of error correction or error reduction incorporated into the code.
3 The rate and magnitude of movements of the vocal tract must operate
within its own physical constraints and within the constraints of the ear to
detect acoustic modulations. What I have in mind here is, first, the obser-
vation that the speech organs, although having no constraint on how slowly
they can move, definitely have a constraint on how rapidly they can move.
Furthermore, as with any muscular system, there is a trade-off between
amplitude of movement and the speed of movement; the movements of
speech typically seem to operate at a speed faster than that which would
permit maximal amplitude of movement but much slower than the maximal
rate of movement (Ohala 1981b, 1989). (See McCroskey 1957; Lindblom
1983; Lindblom and Lubker 1985 on energy expenditure during speech.) On
the auditory side, there are limits to the magnitude of an optimal acoustic
modulation, i.e., any change in a sound. Thus, as we know from numerous
psychophysical studies, very slow changes are hardly noticeable and very
rapid changes present a largely indistinguishable "blur" to the ear. There is
some optimal range of rates of change in between these extremes (see
Licklider and Miller 1951; Bertsch et al. 1956). Similar constraints govern the
rate of modulations detectable by other sense modalities and show up in, e.g.,
the use of flashing lights to attract attention.
4 The words should be as short as possible (and we might also establish an
upper limit on the length of a word, say, 1 sec). This is designed to prevent a
vocabulary where one word is /ba/, another /baba/, another /b9b9b9/ etc.,

168
7 John J. Ohala

with the longest word consisting of a sequence of n /b9/s where n = the size
of the vocabulary.

7.2.2 Anticipated results


7.2.2.1 Initially: random constrictions and expansions
What will happen when such a system initially sets out to make the required
vocabulary? Notice that no mention has been made of segments, syllables or
any other units aside from "word" which in this case is simply whatever
happens between silences. One might imagine that it would start by making
sequences of constrictions and expansions randomly positioned along the
vocal tract, which sequences had some initially chosen arbitrary time
duration. At the end of this exercise it would apply its measure of accoustic
robustness and distinctness and, if the result was unsatisfactory, proceed to
create a second candidate vocabulary, now trying a different time duration
and a different, possibly less random, sequence of constrictions and expan-
sions and again apply its acoustic metric, and so on until the desired result
was achieved. Realistically the evaluation of a vocabulary occurs when
nonrobust modulations are weeded out because listeners confuse them, do
not hear them, etc., and replace them by others. Something of this sort may
be seen in the loss of [w] before back rounded vowels in the history of the
pronunciations of English words sword (now [sojd]), swoon (Middle English
sun), and ooze (from Old English wos) (Dobson 1968: 979ff.). The acoustic
modulation created when going from [w] to back rounded vowel is particu-
larly weak in that it involves little change in acoustic parameters (Kawasaki
1982).
My prediction is that this system would "discover" that segments were
necessary to its task, i.e. that segments, far from being primitives, would "fall
out" as a matter of course, given the initial conditions and the task
constraints. My reasons for this speculation are as follows.

7.2.2.2 Temporal coordination


The first property that would evolve is the necessity for temporal coordina-
tion between different articulators. A single articulatory gesture (constriction
or expansion) would create an acoustic modulation of a certain magnitude,
but the system would find that by coordinating two or more such gestures it
could create modulations that were greater in magnitude and thus more
distinct from other gestures.
A simple simulation will demonstrate this. Figure 7.1 shows a possible
vowel space defined by the frequencies of the first two formants. To relate
this figure to the traditonal vowel space, the peripheral vowels from the adult
male vowels reported by Peterson and Barney (1952) are given as filled
169
Segment

2,000

1,000--

800
200 500
Formant 1

Figure 7.1 Vowel space with five hypothetical vowels corresponding to the vocal-tract con-
figurations shown in figure 7.2. Abscissa: Formant 1; ordinate: Formant 2. For reference, the
average peripheral vowels produced by adult male speakers, as reported by Peterson and Barney
(1952) is shown by filled squares connected by solid lines

squares connected by solid lines; hypothetical vowels produced by the shapes


given in figure 7.2 are shown as filled circles. Note that the origin is in the
lower left corner, thus placing high back vowels in the lower left, high front
vowels in the upper left, and low vowels on the far right. Point 1 marks the
formant frequencies produced by a uniform vocal tract of 17 cm length, i.e.
with equal cross-dimensional area from glottis to lips. Such a tract is
represented schematically as " 1 " in figure 7.2. A constriction at the lips
(schematically represented as "2" in figure 7.2) would yield the vowel labelled
2. If vowel 1 is the "neutral" central vowel, then vowel 2 is somewhat higher
and more back. Now if, simultaneous with the constriction in vowel 2, a
second constriction were made one-third of the way up from the glottis,
approximately in the uvular or upper pharyngeal region (shown schemati-
cally as " 3 " in figure 7.2) vowel 3 would result. This is considerably more
back and higher than vowel 2. As is well known (Chiba and Kajiyama 1941;
Fant 1960) we get this effect by placing constrictions at both of the nodes in
the pressure standing wave (or equivalently, the antinodes in the velocity
standing wave) for the second resonance of the vocal tract. (See also Ohala
and Lorentz 1977; Ohala 1979a, 1985b.)
Consider another case. Vowel 4 results when a constriction is made in the
palatal region. With respect to vowel 1, it is somewhat higher. But if a
pharyngeal expansion is combined with the palatal constriction, as in vowel
5, we get a vowel that is, in fact, maximally front and high.

170
7 John J. Ohala

Lips Glottis

Figure 7.2 Five hypothetical vocal-tract shapes corresponding to the formant frequency
positions in figure 7.1. Vertical axis: vocal-tract cross-dimensional area; horizontal axis: vocal-
tract length from glottis (right) to lips (left). See text for further explanation

This system, I maintain, will discover that coordinated articulations are


necessary in order to accomplish the task of making a vocabulary consisting
of acoustically distinct modulations. This was illustrated with vowel articula-
tions, but the same principle would apply even more obviously in the case of
consonantal articulations. In general, manner distinctions (the most robust
cues for which are modulations of the amplitude envelope of the speech
signal) can reach extremes only by coordinating different articulators.
Minimal amplitude during an oral constriction requires not only abduction
of the vocal cords but also a firm seal at the velopharyngeal port. Similar
arguments can be made for other classes of speech sounds.
In fact, there is a growing body of evidence that what might seem like quite
distant and anatomically unlinked articulatory events actually work
together, presumably in order to create an optimal acoustic-auditory signal.
For example, Riordan (1977) discovered interactions between lip rounding
and larynx height, especially for rounded vowels. Sawashima and Hirose
(1983) have discovered different glottal states for different manners of
articulation: a voiceless fricative apparently has a wider glottis than a
comparable voiceless stop does. Lofqvist et al. (1989) find evidence of
differing tension - not simply the degree of abduction - in the vocal cords
between voiced and voiceless obstruents. It is well known, too, that voiceless
obstruents have a greater closure duration than cognate voiced obstruents

171
Segment
(Lehiste 1970: 28); thus there is interaction between glottal state and the
overall consonantal duration. The American English vowel [a*], which is
characterized by the lowest third formant of any human vowel, has three
constrictions: labial, mid-palatal, and pharyngeal (Uldall 1958; Delattre
1971; Ohala 1985b). These three locations are precisely the locations of the
three antinodes of the third standing wave (the third resonance) of the vocal
tract. In many languages the elevation of the soft palate in vowels is
correlated with vowel height or, what is probably more to the point, inversely
correlated with the first formant of the vowel (Lubker 1968; Fritzell 1969;
Ohala 1975). There is much cross-linguistic evidence that [u]-like vowels are
characterized not only by the obvious lip protrusion but also by a lowered
larynx (vis-a-vis the larynx position for a low vowel like [a]) (Ohala and
Eukel 1987). Presumably, this lengthening of the vocal tract helps to keep the
vowel resonances as low as possible and thus maximally distinct from other
vowels.
As alluded to above, it is well known in sensory physiology that modula-
tions of stimulus parameters elicit maximum response from the sensory
receptor systems only if they occur at some optimal rate (in time or space,
depending on the sense involved). A good prima facie case can be made that
the speech events which come closest to satisfying this requirement for the
auditory system are what are known as "transitions" or the boundaries
between traditional segments, e.g. bursts, rapid changes in formants and
amplitude, changes from silence to sound or from periodic to aperiodic
excitation and vice versa. So all that has been argued for so far is that
temporally coordinated gestures would evolve - including, perhaps, some
acoustic events consisting of continuous trajectories through the vowel space,
clockwise and counterclockwise loops, S-shaped loops, etc. These may not
fully satisfy all of our requirements for the notion of "segment," so other
factors, discussed below, must also come into play,

7.2.2.3 "Steady-state"
Regarding steady-state segments, several things need to be said. First of all,
from an articulatory point of view there are few if any true steady-state
postures adopted by the speech organs. However, due to the nonlinear
mapping from articulation to aerodynamics and to acoustics there do exist
near steady-states in these latter domains.3 In most cases the reason for this
3
Thus the claim, often encountered, that the speech signal is continuous, that is, shows few
discontinuities and nothing approximating steady-states in between (e.g. Schane 1973: 3;
Hyman 1975: 3), is exaggerated and misleading. The claim is largely true in the articulatory
domain (though not in the aerodynamic domain). And it is true that in the perceptual domain
the cues for separate segments or "phonemes" may overlap, but this by itself does not mean
that the perceptual signal has no discontinuities. The claim is patently false in the acoustic
domain as even a casual examination of spectrograms of speech reveals.

172
7 John J. Ohala
nonlinear relationship is not difficult to understand. Given the elasticity of
the tissue and the inertia of the articulators, during a consonantal closing
gesture the articulators continue to move even after complete closure is
attained. Nevertheless, for as long as the complete closure lasts it effectively
attenuates the output sound in a uniform way. Other parts of the vocal tract
can be moving and still there will be little or no acoustic output to reveal it.
Other nonlinearities govern the creation of steady-states or near-steady-
states for other types of speech events (Stevens 1972, 1989).
But there may be another reason why steady-states would be included in
the speech signal. Recall the task constraint that the code should include
some means for error correction or error reduction. Benoit Mandelbrot
(1954) has argued persuasively that any coded transmission subject to errors
could effect error reduction or at least error limitation by having "break-
points" in the transmission. Consider the consequences of the alternative,
where everything transmitted in between silence constituted the individual
cipher. An error affecting any part of that transmission would make the
entire transmission erroneous. Imagine, for example, a Morse-code type of
system which for each of the 16 million possible sentences that could be
conveyed had a unique string of twenty-four dots and dashes. An error on
even one of the dots and dashes would make the whole transmission fail. On
the other hand if the transmission had breakpoints often enough, that is,
places where what had been transmsitted so far could be decoded, then any
error could be limited to that portion and it would not nullify the whole of
the transmission. Checksums and other devices in digital communications
are examples of this strategy. I think the steady-states that we find in speech,
from 50 to 200 msec, or so in duration, constitute the necessary "dead"
intervals or breakpoints that clearly demarcate the chunks with high infor-
mation density. During these dead intervals the listener can decode these
chunks and then get ready for the subsequent chunks. What I am calling
"dead" intervals are, of course, not truly devoid of information but I would
maintain that they transmit information at a demonstrably lower rate than
that during the rapid acoustic modulations they separate. This, in fact, is the
interpretation I give to the experimental results of Ohman (1966b) and
Strange, Verbrugge, and Edman (1976).
It must be pointed out that if there is a high amount of redundancy in the
code, which is certainly true of any human language's vocabulary, then the
ability to localize an error of transmission allows error correction, too.
Hearing "skrawberry" and knowing that there is no such word while there is
a word strawberry allows us to correct a (probable) transmission error.
I believe that these chunks or bursts of high-density information flow are
what we call "transitions" between phonemes. I would maintain that these
are the kind of units required by the constraints of the communication task.
173
Segment
These are what the speaker is intending to produce when coordinating the
movements of diverse articulators4 and these are what the listener attends
to.
Nevertheless, these are not equivalent to our traditional conception of the
"segment." The units arrived at up to this point contain information on a
sequential pair of traditional segments. Furthermore, the inventory of such
units is larger than the inventory of traditional segments by an order of
magnitude. Finally, what I have called the "dead interval" between these
units is equivalent to the traditional segment (the precise boundaries may be
somewhat ambiguous but that, in fact, corresponds to reality).
I think that our traditional conception of the segment arises from the fact
that adjacent pairs of novel segments, i.e. transitions, are generally corre-
lated. For example, the transition found in the sequence /ab/ is almost
invariably followed by one of a restricted set of transitions, those characteris-
tic of/bi/, /be/, /bu/, etc., but not /gi/, /de/. As it happens, this correlation
between adjacent pairs of transitions arises because it is not so easy for our
vocal tract to produce uncorrelated transitions: the articulator that makes a
closure is usually the same one that breaks the closure. The traditional
segment, then, is an entity constructed by speakers-listeners; it has a
psychological reality based on the correlations that necessarily occur between
successive pairs of the units that emerge from the underlying articulatory
constraints.
The relationship between the acoustic signal, the transitions which require
the close temporal coordination between articulators, and the traditional
segments is represented schematically in figure 7.3.

7.2.2.4 Features
If an acoustically salient gesture is "discovered" by combining labial closure,
velic elevation, and glottal abduction, will the same velic elevation and
glottal abduction be "discovered" to work well with apical and dorsal
closures? Plausibly, the system should also be able to discover how to
"recycle" features, especially in the case of modulations made distinct by the
combination of different "valves" in the vocal tract. There are, after all, very
few options in this respect: glottis, velum, lips, and various actions of the
tongue (see also Fujimura 1989b). A further limitation exists in the options
available for modulating and controlling spectral pattern by virtue of the fact
that the standing wave patterns of the lowest resonances have nodes and
4
The gestures which produce these acoustic modulations may require not only temporal
coordination between articulators but also precision in the articulatory movements them-
selves. This may correspond to what Fujimura (1986) calls "icebergs": patterns of temporally
localized invariant articulatory gestures separated by periods where the gestures are more
variable.

174
7 John J. Ohala

(a)

ep pi ik ka

(c)

Time —*

Figure 7.3 Relationship between acoustic speech signal (a), the units with high-rate-of-
information transmission that require close temporal coordination between articulators (b), and
the traditional segment (c)

antinodes at discrete and relatively few locations in the vocal tract (Chiba
and Kajiyama 1941; Fant 1960; Stevens 1972, 1989): an expansion of the
pharynx would serve to keep F} as low as possible when accompanying a
palatal constriction (for an [i]) as well as when accompanying simultaneous
labial and uvular constrictions (for an [u]) due to the presence there of an
antinode in the pressure standing wave of the lowest resonance.5
Having said this, however, it would be well not to exaggerate (as
phonologists often do) the similarity in state or function of what is
considered to be the "same" feature when used with different segments. The
same velic coupling will work about as well with a labial closure as an apical
one to create [m] and [n] but as the closure gets further back the nasal
consonants that result get progressively less consonantal. This is because an
5
Pharyngeal expansion was not used in the implementation of the [u]-like vowel 3 in figure 7.1,
but if it had been it would have approached more closely the corner vowel [u] from the
Peterson and Barney study.

175
Segment
important element in the creation of a nasal consonant is the "cul-de-sac"
resonating cavity branching off the pharyngeal-nasal resonating cavity. This
"cul-de-sac" naturally gets shorter and acts less effectively as a separate
cavity the further back the oral closure is (Fujimura 1962; Ohala 1975, 1979a,
b; Ohala and Lorentz 1977). I believe this accounts for the lesser incidence or
more restricted distribution of [rj] in the sound systems of the languages of
the world6. Similarly, although a stop burst is generally a highly salient
acoustic event, all stop bursts are not created equal. Velar and apical stop
bursts have the advantage of a resonating cavity downstream which serves to
reinforce their amplitude; this is missing in the case of labial stop bursts.
Accordingly, among stops that rely heavily on bursts, i.e. voiceless stops
(pulmonic or glottalic), the labial position is often unused, has a highly
restricted distribution, or simply occurs less often in running speech (Wang
and Crawford 1960; Gamkrelidze 1975; Maddieson 1984: ch. 2). The more
one digs into such matters, the more differences are found in the "same"
feature occurring in different segments: as mentioned above, Sawashima and
Hirose have found differences in the character of glottal state during
fricatives vis-a-vis cognate stops. The conclusion to draw from this is that
what matters most in speech communication is making sounds which differ
from each other; it is less important that these be made out of recombinations
of the same gestures used in other segments. The orderly grid-like systems of
oppositions among the sounds of a language which one finds especially in
Prague School writings (Trubetzkoy 1939 [1969]) are quite elusive when
examined phonetically. Instead, they usually exhibit subtle or occasionally
not-so-subtle asymmetries. Whether one can make a case for symmetry
phonologically is another matter but phonologists cannot simply assume
that the symmetry is self-evident in the phonetic data.

7.2.2.5 Final comment on the preceding evolutionary scenario


I have offered plausibility arguments that some of the properties we
commonly associate with the notion "segment," i.e. temporal coordination
of articulators, steady-states, and use of a small set of combinable features,
are derivable from physical and physiological constraints of the speaking and
hearing mechanisms in combination with constraints of the task of forming a
vocabulary. High bit-rate transitions separated by "dead" intervals are
suggested to be the result of this effort. The traditional notion of the
"segment" itself - which is associated with the intervals between the
6
Given that [rj] is much less "consonantal" than other nasal consonants and given its long
transitions (which it shares with any velar consonant), it is often more a nasalized velar glide
or even a nasalized vowel. I think this is the reason it often shows up as an alternant of, or
substitute for, nasalized vowels in coda position, e.g., in Japanese, Spanish, Vietnamese. See
Ohala (1975).

176
7 John J. Ohala

transitions - is thought to be derived from the probabilities of cooccurrence


of successive transitions. It is important to note that temporal coordination
of articulators is a necessary property of the transitions not of the traditional
segment.
The evolutionary scenario presented above is admittedly speculative. But
the arguments regarding the necessity for temporal coordination of articula-
tors in order to build a vocabulary exhibiting sufficient contrast are based on
well-known phonetic principles and have already been demonstrated in
numerous efforts at articulatory-based synthesis.

7.3 Interpretation

7.3.1 Segment is primitive now


If the outcome of this "gedanken" simulation is accepted, then it must also
be accepted that spoken languages incorporate, indeed, are based on, the
segment. To paraphrase Voltaire on God's existence: if segments did not
exist, we would have invented them (and perhaps we did). Though not a
primitive in the prespeech stage, it is a primitive now, that is, from the point
of view of anyone having to learn to speak and to build a vocabulary. All of
the arguments given above for why temporal coordination between articula-
tors was necessary to create an optimal vocabulary would still apply for the
maintenance of that vocabulary. I suggest, then, that autosegmental phono-
logy's desegmentalization of speech, especially traditional segmental sounds
(as opposed to traditional suprasegmentals) is misguided.
Attempts to link up the features or autosegments to the "time slots" in the
CV tier by purely formal means (left-to-right association, etc.) are missing an
important function of the segment. The linkages or coordination between
features are there for an important purpose: to create contrasts, which
contrasts exploit the capabilities of the speech apparatus by coordinating
gestures at different locations within the vocal tract. Rather than being linked
by purely formal means that take no account of the "intrinsic content" of the
features, the linkages are determined by physical principles. If features are
linked of necessity then they are not autonomous. (See also Kingston [1990];
Ohala [1990b].)
It is true that the anatomy and physiology of the vocal tract permit the
coordination between articulators to be loose or tight. A relatively loose link
is found especially between the laryngeal gestures which control the funda-
mental frequency (Fo) of voice and the other articulators. Not coincidentally,
it was tone and intonation that were the first to be autosegmentalized (by the
Greeks, perhaps; see below). Nevertheless, even Fo modulations have to be
coordinated with the other speech events. Phonologically, there are cases
177
Segment
where tone spreading is blocked by consonants known to perturb Fo in
certain ways (Ohala 1982 and references cited there). Phonetically, it has
been demonstrated that the Fo contours characteristic of word accent in
Swedish are tailored in various ways so that they accommodate to the voiced
portions of the syllables they appear with (Erikson and Alstermark 1972;
Erikson 1973). Evidence of a related sort for the Fo contours signaling stress
in English has been provided by Steele and Liberman (1987).
Vowel harmony and nasal prosodies, frequently given autosegmental
treatment, also do not show themselves to be completely independent of
other articulations occurring in the vocal tract. Vowel harmony shows
exceptions which depend on the phonetic character of particular vowels and
consonants involved (Zimmer 1969; L. Anderson 1980). Vowel harmony that
is presumably still purely phonetic (i.e., which has not yet become phonolo-
gized) is observable in various languages (Ohman 1966a; Yaeger 1975) but
these vowel-on-vowel effects are (a) modulated by adjacent consonants and
(b) are generally highly localized in their domain, being manifested just on
that fraction of one vowel which is closest to another conditioning vowel. In
this latter respect, vowel-vowel coarticulation shows the same temporal
limitation characteristic of most assimilations: it does not spread over
unlimited domains. (I discuss below how assimilations can enlarge their
temporal domain through sound change, i.e. the phonologization of these
short-span phonetic assimilations.) Assimilatory nasalization, another pro-
cess that can develop into a trans-syllabic operation, is sensitive to whether
the segments it passes through are continuant or not and, if noncontinuant,
whether they have a constriction further forward of the uvula (Ohala 1983).
All of this, I maintain, gives evidence for the temporal linkage of features.7

7.3.2 Possible counterarguments


Let me anticipate some counterarguments to this position.

7.3.2.1 Feature geometry


It might be said that some (all?) of the interactions between features will be
taken care of by so-called "feature geometry" (Clements 1985; McCarthy
1989) which purports to capture the corelatedness between features through
a hierarchical structure of dependency relationships. These dependencies are
said to be based on phonetic considerations. I am not optimistic that the
interdependencies among features can be adequately represented by any
network that posits a simple, asymmetric, transitive, dependency relationship
7
Though less common, there are also cases where spreading nasalization is blocked by certain
continuants, too; see Ohala (1974, 1975).

178
7 John J. Ohala
between features. The problem is that there exist many different types of
physical relationships between the various features. Insofar as a phonetic
basis has been considered in feature geometry, it is primarily only that of
spatial anatomical relations. But there are also aerodynamic and acoustic
relations, and feature geometry, as currently proposed, ignores these. These
latter domains link anatomically distant structures. Some examples (among
many that could be cited): simultaneous oral and velic closures inhibit vocal-
cord vibration; a lowered soft palate not only inhibits frication and trills in
oral obstruents (if articulated at or further forward of the uvula) but also
influences the Fj (height) of vowels; the glottal state of high airflow segments
(such as /s/, /p/), if assimilated onto adjacent vowels, creates a condition that
mimics nasalization and is apparently reinterpreted by listeners as nasaliza-
tion; labial-velar segments like [w,kp] pattern with plain labials ([ + anterior])
when they influence vowel quality or when frication or noise bursts are
involved, but they frequently pattern like velars ([ — anterior]) when nasal
consonants assimilate to them; such articulatorily distant and disjoint
secondary articulations as labialization, retroflexion, and pharyngealization
have similar effects on high vowels (they centralize [i] maximally and have
little effect on [u]) (Ohala 1976, 1978, 1983, 1985a, b; Beddor, Krakow, and
Goldstein 1986; Wright 1986).
I challenge the advocates of "feature geometry" to represent such criss-
crossing and occasionally bidirectional dependencies in terms of asymmetric,
transitive, relationships. In any case, the attempt to explain these and a host
of other dependencies other than by reference to phonetic principles will be
subject to the fundamental criticism: even if one can devise a formal
relabeling of what does happen in speech, one will not be able to show in
principle - that is, without ad hoc stipulations - why certain patterns do not
happen. For example, why should [ + nasal] affect primarily the feature [high]
in vowels and not the feature [back]? Why should [ — continuant] [ — nasal]
inhibit [ +voice] instead of [ — voice]?

73.2.2 Grammar, not physics


Also, it might legitimately be objected that the arguments I have offered so
far have been from the physical domain, whereas autosegmental represen-
tations have been posited for speakers' grammars, i.e. in the psychological
domain.8 The autosegmental literature has not expended much effort
gathering and evaluating evidence on the psychological status of autoseg-
8
There is actually considerable ambiguity in current phonological literature as to whether
physical or psychological claims are being made or, indeed, whether the claims should apply
to both domains or neither. I have argued in several papers that phonologists currently assign
some events that are properly phonetic to the psychological domain (Ohala 1974, 1985b,
forthcoming). But even this is not quite so damaging as assigning to the synchronic
psychological domain events which properly belong to a language's history.

179
Segment

ments, but there is at least some anecdotal and experimental evidence that
can be cited and it is not all absolutely inconsistent with the autosegmental
position (though, I would maintain, it does not unambiguously support it
either). Systematic investigation of the issues is necessary, though, before any
confident conclusions may be drawn.
Even outside of linguistics analyzers of song and poetry have for millennia
extracted metrical and prosodic structures from songs and poems. An
elaborate vocabulary exists to describe these extracted prosodies, e.g. in the
Western tradition the Greeks gave us terms and concepts such as iamb,
trochee, anapest, etc. Although worth further study, it is not clear what
implication this has for the psychological reality of autosegments. Linguisti-
cally naive (as well as linguistically sophisticated) speakers are liable to the
reification fallacy. Like Plato, they are prone to regard abstract concepts as
real entities. Fertility, war, learning, youth, and death are among the many
fundamental abstract concepts that people have often hypostatized, some-
times in the form of specific deities. Yet, as we all know, these concepts only
manifest themselves when linked with specific concrete people or objects.
They cannot "float" as independent entities from one object to another.
Though more prosaic than these (so to speak), is iamb any different? Are
autosegments any different?
But even if we admit that ordinary speakers are able to form concepts of
prosodic categories paralleling those in autosegmental phonology, no cul-
ture, to my knowledge, has shown an awareness of comparable concepts
involving, say, nasal (to consider one feature often treated autosegmentally).
That is, there is no vocabulary and no concept comparable to iamb and
trochee for the opposite patterns of values for [nasal] in words like dam vs.
mid or mountain vs. damp. The concepts and vocabulary that do exist in this
domain concerning the manipulation of nonprosodic entities are things like
rhyme, alliteration, and assonance, all of which involve the repetition of
whole segments.
Somewhat more to the point, psychologically, is evidence from speech
errors, word games and things like "tip of the tongue" (TOT) recall. Errors
of stress placement and intonation contour do occur (Fromkin 1976; Cutler
1980), but they are often somewhat difficult to interpret. Is the error of
ambiguty for the target ambiguity a grafting of the stress pattern from the
morphologically related word ambiguous (which would mean that stress is an
entity separable from the segments it sits on) or has the stem of this latter
word itself intruded? Regarding the shifting of other features, including
[nasal] and those for places of articulation, there is some controversy.
Fromkin (1971) claimed there was evidence of feature interchange, but
Shattuck-Hufnagel and Klatt (1979) say this is rare - usually whole bundles
of features, i.e. phonemes, are what shift. Hombert (1986) has demonstrated

180
7 John J. Ohala

using word games that the tone and vowel length of words can, in some cases
(but not all), be stripped off the segments they are normally realized on and
materialized in new places. In general, though, word games show that it is
almost invariably whole segments that are manipulated, not features. TOT
recall (recall of some aspects of the pronunciation of a word without full
retrieval of the word) frequently exhibits awareness of the prosodic character
of the target word (including the number of syllables, as it happens; Brown
and McNeill 1966; Browman 1978). Such evidence is suggestive but unfortu-
nately, even when knowledge of features is demonstrated, it does not provide
crucial evidence for differentiating between the psychological reality of
traditional (mutually linked) features and autonomous features, i.e., autoseg-
ments. There is as yet no hitching post for autosegmental theory in this data.

7.3.2.3 Features migrate across segment boundary lines


If, as I maintain, there is temporal coordination between features in order to
create contrasts, how do I account for the fact that features are observed to
spill over onto adjacent segments as in assimilation? My answer to this is first
to reemphasize that the temporal coordination occurs on the transitions, not
necessarily during the traditional segment. Second, "coordination" does not
imply that the participating articulators all change state simultaneously at
segment boundaries, rather, that at the moment the rapid acoustic modula-
tion, i.e. the transition, is to occur, e.g., onset of a postvocalic [3], the various
articulators have to be in specified states. These states can span two or more
traditional segments. For a [3] the soft palate must be elevated and the
tongue must be elevated and bunched in the palatal region. Given the inertia
of these articulators, these actions will necessarily have to start during the
preceding vowel (if these postures had not already been attained). Typically,
although many of these preparatory or perseveratory gestures leave some
trace in the speech signal, listeners learn to discount them except at the
moment when they contribute to a powerful acoustic modulation. The
speech-perception literature provides many examples showing that listeners
factor out such predictable details unless they are presented out of their
normal context (i.e. where the conditioning environment has been deleted)
(Ohala 1981b; Beddor, Krakow, and Goldstein 1986). My own pronunci-
ation of "measure," with what I regard as the "monophthong" [e] in the first
syllable, has a very noticeable palatal glide at the end of that vowel which
makes it resemble the diphthong [ej] in a word like "made." I do not perceive
that vowel as [ej], presumably because when I "parse" 9 the signal I assign
the palatal on-glide to the [3], not the vowel. Other listeners may parse this
glide with the vowel and thus one finds dialectally mergers of /ej/ and /e/
9
I use "parse" in the sense introduced by Fowler (1986).

181
Segment
before palato-alveolars, e.g., spatial and special become homophones. (See
also Kawasaki [1986] regarding the perceptual "invisibility" of nasalization
near nasal consonants.)
Thus, from a phonetic point of view such spill-over of articulatory gestures
is well known (at least since the early physiological records of speech using
the kymograph) and it is a constant and universal feature of speech, even
before any sound change occurs which catches the attention of the linguist.
Many features thus come "prespread," so to speak; they do not start
unspread and then migrate to other segments. Such spill-over only affects the
phonological interpretation of neighboring elements if a sound change
occurs. I have presented evidence that sound change is a misapprehension or
reinterpretation on the part of the listener (Ohala 1974, 1975, 1981b, 1985a,
1987,1989). Along with this reinterpretation there may be some exaggeration
of aspects of the original pronunciation, e.g. the slight nasalization on a
vowel may now be heavy and longer. Under this view of sound change, no
person, neither the speaker nor the listener, has implemented a change in the
sense of having in their mental grammar a rule that states something like /e/
-> /ej/ /_/3/; rather, the listener parses the signal in a way that differs from the
way the speaker parses it. Similarly, if a reader misinterprets a carelessly
handwritten "n" as the letter "u," we would not attribute to the writer or the
reader the psychological act or intention characterized by the rule "n" -*
"u." Such a rule would just be a description of the event from the vantage
point of an observer (a linguist?) outside the speaker's and listener's domains.
In the case of sound patterns of language, however, we are now able to go
beyond such external, "telescoped," descriptions of events and provide
realistic, detailed, accounts in terms of the underlying mechanisms.
The migration of features is therefore not evidence for autosegmental
representations and is not evidence capable of countering the claim that
features are nonautonomous. There is no mental representation requiring
unlinked or temporally uncoordinated features.

7.4 Conclusions
I have argued that features are so bound together due to physical principles
and task constraints that if we started out with uncoordinated features they
would have linked themselves of their own accord.10 Claims that features
can be unlinked have not been made with any evident awareness of the full
phonetic complexity of speech, including not only the anatomical but also
the aerodynamic and the acoustic-auditory principles governing it. Thus,
more than twenty years after the defect was first pointed out, phonological
10
Similar arguments are made under the heading of "feature enhancement" by Stevens, Keyser,
and Kawasaki (1986) and Stevens and Keyser (1989).

182
7 John J. Ohala
representations still fail to reflect the "intrinsic content" of speech (Chomsky
and Halle 1968: 400ff.). They also suffer from a failure to consider fully the
kind of diachronic scenario which could give rise to apparent "spreading" of
features, one of the principal motivations for unlinked features.
What has been demonstrated in the autosegmental literature is that it is
possible to represent speech-sound behavior using autosegments which
eventually become associated with the slots in the CV skeleton. It has not
been shown that it is necessary to do so. The same phonological phenomena
have been represented adequately (though still not explanatorily) without
autosegments. But there must be an infinity of possible ways to represent
speech (indeed, we have seen several in the past twenty-five years and will no
doubt see several more in the future); equally, it was possible to represent
apparent solar and planetary motion with the Ptolemaic epicycles and the
assumption of an earth-centered universe. But we do not have to relive
history to see that simply being able to "save the appearances" of pheno-
mena is not justification in itself for a theory. However, even more damaging
than the lack of a compelling motivation for the use of autosegments, is that
the concept of autosegments cannot explain the full range of phonological
phenomena which involve interactions between features, a very small sample
of which was discussed above. This includes the failure to account for what
does not occur in phonological processes or which occurs much less com-
monly. On the other hand, I think phonological accounts which make
reference to the full range of articulatory, acoustic, and auditory factors,
supported by experiments, have a good track record in this regard (Ohala
1990a).

Comments on chapter 7
G. N. CLEMENTS
In his paper "The segment: primitive or derived?" Ohala constructs what he
calls a "plausibility argument" for the view that there is no level of
phonological representation in which features are not coordinated with each
other in a strict one-to-one fashion.* In contrast to many phoneticians who
have called attention to the high degree of overlap and slippage in speech
production, Ohala argues that the optimal condition for speech perception
requires an alternating sequence of precisely coordinated rapid transitions
and steady-states. From this observation, he concludes that the phonological

*Research for this paper was supported in part by grant no. INT-8807437 from the National
Science Foundation.

183
Segment

representations belonging to the mental grammars of speakers must consist


of segments defined by sets of coordinated features. Ohala claims to find no
phonetic or phonological motivation for theories (such as autosegmental
phonology) that allow segments to be decomposed into sets of unordered
features.
We can agree that there is something right about the view that precise
coordination of certain articulatory events is conducive to the optimal
transmission of speech. This view is expressed, in autosegmental phonology,
in the association conventions whose primary function is to align features in
surface representation that are not aligned in underlying representation. In
fact, autosegmental phonology goes a step further than this and claims that
linear feature alignment is a default condition on underlying representation
as well. It has been argued that a phonological system is more highly valued
to the extent that its underlying representations consist of uniformly linear
sequences in which segments and features are aligned in a linear, one-to-one
fashion (Clements and Goldsmith 1984). Departures from this type of
regularity come at a cost, in the sense that the learner must identify each type
of nonlinearity and enter it as a special statement in the grammar. Far from
advocating a thoroughgoing "desegmentalization of speech," the position of
autosegmental phonology has been a conservative one in this respect,
compared to other nonlinear theories such as Firthian prosodic analysis,
which have taken desegmentalization a good deal further.'1
While we can agree that the coordination of features represents a default
condition on phonological representation perhaps at all levels, this is not the
whole story. The study of phonological systems shows clearly that features
do not always align themselves into segments, and it is exactly for this reason
that nonlinear representational systems such as autosegmental phonology
have been developed. The principle empirical claim of autosegmental phono-
logy is that phonological rules treat certain features and feature sets
independently of others, in ways suggesting that features and segments are
not always aligned in one-to-one fashion. The assignment of such features to
independent tiers represents a direct and nonarbitrary way of expressing this
functional independence. Crucial evidence for such feature autonomy comes
from the operation of rules which map one structure into another: it is only
when phonological features are "in motion," so to speak, that we can
determine which features act together as units.
Ohala responds to this claim by arguing that the rules which have been
proposed in support of such functional feature independence belong to
physics or history, not grammar. In his view, natural languages do not have
11
See, for example, Local (this volume). Recently, Archangeli (1988) has argued for full
desegmentalization of underlying representations within a version of underspecification
theory.

184
7 Comments
synchronic assimilation rules, though constraints on segment sequences may
reflect the reinterpretation (or "misanalysis") of phonetic processes operat-
ing at earlier historical periods. If true, this argument seriously undermines
the theory at its basis. But is it true? Do we have any criteria for determining
when a detectable linguistic generalization is a synchronic rule?
This issue has been raised elsewhere by Ohala himself. He has frequently
expressed the view that a grammar which aims at proposing a model of
speaker competence must distinguish between regularities which the speaker
is "aware" of, in the sense that they are used productively, and those that are
present only for historical reasons, and which do not form part of the
speaker's grammar (see e.g. Ohala 1974; Ohala and Jaeger 1986). In this
view, the mere existence of a detectable regularity is not by itself evidence
that it is incorporated into the mental grammar as a synchronic rule. If a
regularity has the status of a rule, we expect it to meet what we might call the
productivity standard: the rule should apply to new forms which the speaker
has not previously encountered, or which cannot have been plausibly
memorized.
Since its beginnings, autosegmental phonology has been based on the
study of productive rules in just this sense, and has based its major theoretical
findings on such rules. Thus, for example, in some of the earliest work, Leben
(1973) showed that when Bambara words combine into larger phrases, their
surface tone melody varies in regular ways. This result has been confirmed
and extended in more recent work showing that the surface tone pattern of
any word depends on tonal and grammatical information contributed by the
sentence as a whole (Rialland and Badjime 1989). Unless all such phrases are
memorized, we must assume that unlinked "tone melodies" constitute an
autonomous functional unit in the phonological composition of Bambara
words.
Many other studies in autosegmental phonology are based on productive
rules of this sort. The rules involved in the Igbo and Kikuyu tone systems, for
example, apply across word boundaries and, in the case of Kikuyu, can affect
multiple sequences of words at once. Unless we are willing to believe that
entire sentences are listed in the lexicon, we are forced to conclude that the
rules are productive, and part of the synchronic grammar.12 Such evidence is
not restricted to tonal phenomena. In Icelandic, preaspirated stops are
created by the deletion of the supralaryngeal features of thefirstmember of a
geminate unaspirated stop and of the laryngeal features of the second. In his
study of this phenomenon, Thrainsson (1978) shows at considerable length
that it satisfies a variety of productivity criteria, and must be part of a
synchronic grammar. In Luganda, the rules whose operation brings to light
12
For Igbo see Goldsmith (1976), Clark (1990); for Kikuyu see Clements (1984) and references
therein.

185
Segment
the striking phenomenon of "mora stability" apply not only within morpho-
logically complex words but also across word boundaries. The independence
of the CV skeleton in this language is confirmed not only by regular
alternations, but also by the children's play language called Ludikya, in
which the segmental content of syllables is reversed while length and tone
remain constant (Clements 1986). In English, the rule of intrusive stop
formation which inserts a brief [t] in words like prince provides evidence for
treating the features characterizing oral occlusion as an autosegmental node
in hierarchical feature representation (see Clements 1987 for discussion); the
productivity of this rule has never been questioned, and is experimentally
demonstrated in Ohala (1981a). In sum, the argumentation upon which
autosegmental phonology is based has regularly met the productivity stan-
dard as Ohala and others have characterized it. We are a long way from the
days when to show that a regularity represented a synchronic rule, it was
considered sufficient just to note that it existed.
But even if we agree that autosegmental rules constitute a synchronic
reality, a fundamental question still remains: if, as Ohala argues, linear
coordination of features represents the optimal condition for speech percep-
tion, why do we find feature asynchrony at all? The reasons for this lie, at
least in part, in the fact that phonological structure involves not only
perceptually motivated constraints, but also articulatorily motivated con-
straints, as well as higher-order grammatical considerations. Phonology (in
the large sense, including much of what is traditionally viewed as phonetics)
is concerned with the mapping between abstract lexicosyntactic represen-
tations and their primary medium of expression, articulated speech. At one
end of this mapping we find linguistic structures whose formal organization
is hierarchical rather than linear, and at the other end we find complex
interactions of articulators involving various degrees of neural and muscular
synergy and inertia. Neither type of structure lends itself readily or insight-
fully to expression in terms of linear sequences of primitives (segments) or
stacks of primitives (feature bundles).
In many cases, we find that feature asynchrony is regularly characteristic
of phonological systems in which features and feature sets larger and smaller
than the segment have a grammatical or morphological function. For
instance, in languages where tone typically serves a grammatical function (as
in Bantu languages, or many West African languages), we find a greater
mismatch between underlying and surface representations than in languages
where its function is largely lexical (as in Chinese). In Bambara, to take an
example cited earlier, the floating low tone represents the definite article and
the floating high tone is a phrasal-boundary marker; while in Igbo, the
floating tone is the associative-construction marker. The well-known non-
linearities found in Semitic verb morphology are due to the fact that
186
7 Comments

consonants, vowels, and templates all play a separate grammatical role in the
make-up of the word (McCarthy 1981). In many further cases, autosegmen-
talized features have the status of morpheme-level features, rather than
segment-level features (to use terminology first suggested by Robert Vago).
Thus in many vowel-harmony systems, the harmonic feature (palatality,
ATR (Advanced Tongue Root), etc.) commutes at the level of the root or
morpheme, not the segment. In Japanese, as Pierrehumbert and Beckman
point out (1988), some tones characterize the morpheme while others
characterize the phrase, a fact which these authors represent by linking each
tone to the node it characterizes. What these and other examples suggest is
that nonlinearities tend to arise in a system to the extent that certain subsets
of features have a morphological or syntactic role independent of others.
Other types of asynchronies between features appear to have articulatory
motivation, reflecting the relative sluggishness of some articulators with
respect to others (cf. intrusive stop formation, nasal harmonies, etc.), while
others may have functional or perceptual motivation (the Icelandic pre-
aspiration rule preserves the distinction between underlying aspirated and
unaspirated geminate stops from the effects of a potentially neutralizing
deaspiration rule, but it translates this distinction into one between preaspir-
ated and unaspirated geminates). If all such asynchronies represent depar-
tures from the optimal or "default" state and in this way add to the formal
complexity of a phonological representation, then many of the rules and
principles of autosegmental phonology can be viewed as motivated by the
general, overriding principle: reduce complexity.
Ohala argues that "feature geometry," a model which uses evidence from
phonological rules to specify a hierarchical organization among features
(Clements 1985; Sagey 1986a; McCarthy 1988), does not capture all observ-
able phonetic dependencies among features, and is therefore incomplete.
However, feature geometry captures a number of significant cross-linguistic
generalizations that could not be captured in less structured feature systems,
such as the fact that the features defining place of articulation commonly
function as a unit in assimilation rules. True, it does not and cannot express
certain further dependencies, such as the fact that labiality combines less
optimally with stop production (as in [p]) than do apical or velar closure (as
in [t] or [k]). But not all such generalizations form part of phonological
grammars. Thus, phonologists have not discovered any tendency for rules to
refer to the set of all stops except [p] as a natural class. On the contrary, in
spite of its less efficient exploitation of vocal-tract mechanics, [p] consistently
patterns with [t] and [k] in rules referring to oral stops, reflecting the general
tendency of phonological systems to impose a symmetrical classification
upon speech sounds sharing linguistically significant properties. If feature
geometry attempted to derive all phonetic as well as phonological dependen-

187
Segment

cies from its formalism, it would fail to make correct predictions about cross-
linguistically favored rule types, in this and many other cases.
Ohala rejects autosegmental phonology (and indeed all formal approaches
to phonology) on the grounds that its formal principles may have an ultimate
explanation in physics and psychoacoustics, and should therefore be super-
fluous. But this argument overlooks the fact that physics and psychology are
extremely complex sciences, which are subject to multiple (and often conflict-
ing) interpretations of phenomena in almost every area. Physics and psy-
chology in their present state can shed light on some aspects of phonological
systems, but, taken together, they are far from being able to offer the hard,
falsifiable predictions that Ohala's reductionist program requires if it is to
acquire the status of a predictive empirical theory. In particular, it is often
difficult to determine on a priori grounds whether articulatory, aerodynamic,
acoustic, or perceptual considerations play the predominant role in any given
case, and these different perspectives often lead to conflicting expectations. It
is just here that the necessity for formal models becomes apparent. The
advantage of using formal models in linguistics (and other sciences) is that
they can help us to formulate and test hypotheses within the domain of study
even when we do not yet know what their ultimate explanation might be. If
we abandoned our models on the grounds that we cannot yet explain and
interpret them in terms of higher-level principles, as Ohala's program
requires, we would make many types of discovery impossible. To take one
familiar example: Newton could not explain the Law of Gravity to the
satisfaction of his contemporaries, but he could give a mathematical state-
ment of it - and this statement proved to have considerable explanatory
power.
It is quite possible that, ultimately, all linguistic (and other cognitive)
phenomena will be shown to be grounded in physical, biological, and
psychological principles in the largest sense, and that what is specific to the
language faculty may itself, as some have argued, have an evolutionary
explanation. But this possibility does not commit us to a reductionist
philosophy of linguistics. Indeed, it is only by constructing explicit, predictive
formal or mathematical models that we can identify generalizations whose
relations to language-external phenomena (if they exist) may one day become
clear. This is the procedure of modern science, which has been described as
follows by one theoretical physicist (Hawking 1988: 10): "A theory is a good
theory if it satisfies two requirements: It must ultimately describe a large class
of observations on the basis of a model that contains only a few arbitrary
elements, and it must make definite predictions about the results of future
observations." This view is just as applicable to linguistics and phonetics as it
is to physics.
Ohala's paper contains many challenging and useful ideas, but it over-
188
7 Comments
states its case by a considerable margin. We can agree that temporal
coordination plays an important role in speech production and perception
without concluding that phonological representations are uniformly segmen-
tal. On the contrary, the evidence from a wide and typologically diverse
number of languages involving both "suprasegmental" and traditionally
segmental features demonstrates massively and convincingly that phonologi-
cal systems tolerate asynchronic relations among features at all levels of
representation. This fact of phonological structure has both linguistic and
phonetic motivation. Although phonetics and phonology follow partly
different methodologies and may (as in this case) generate different hypoth-
eses about the nature of phonological structure, the results of each approach
help to illuminate the other, and take us further towards our goal of
providing a complete theory of the relationship between discrete linguistic
structure and the biophysical continuum which serves as its medium.

189
8
Modeling assimilation in nonsegmental,
rule-free synthesis

JOHN LOCAL

8.1 Introduction
Only relatively recently have phonologists begun the task of seriously testing
and evaluating their claims in a rigorous fashion.1 In this paper, I attempt
to sustain this task by discussing a computationally explicit version of one
kind of structured phonology, based on the Firthian prosodic approach to
phonological interpretation, which is implemented as an intelligent know-
ledge-based "front-end" to a laboratory formant speech synthesizer (Klatt
1980). My purpose here is to report on how the superiority of a structured
monostratal approach to phonology over catenative segmental approaches
can be demonstrated in practice. The approach discussed here compels new
standards of formal explicitness in the phonological domain as well as a need
to pay serious attention to parametric and temporal detail in the phonetic
domain (see Browman and Goldstein 1985, 1986). The paper falls into two
parts: the first outlines the nonsegmental approach to phonological interpre-
tation and representation; the second gives an overview of the way "process"
phenomena are treated within this rule-free approach and presents an
analysis of some assimilatory phenomena in English and the implementation
of that analysis within the synthesis model. Although the treatment of
assimilation presented here is similar to some current proposals (e.g. Lodge
1984), it is, to the best of my knowledge, unique in having been given an
explicit computational implementation.

The approach to phonological analysis presented here is discussed at length in Kelly and
Local (1989), where a wide range and variety of languages are considered. The synthesis of
English which forms part of our work in phonological theory is supported by a grant from
British Telecom PLC. The work is collaborative and is being carried out by John Coleman
and myself. Though my name appears as the author of this paper I owe a great debt to John
Coleman, without whom this work could not have taken the shape it does.

190
8 John Local

8.2 Problems with segmental, rewrite-rule phonologies


In order to give some sense to the terms "nonsegmental" and "rule-free"
which appear in the title, it is necessary to sketch the broad outlines of our
approach to phonological interpretation and representation. As a prelude to
this, it is useful to begin by considering some of the problems inherent in rule-
based, segmental-type phonologies. The most explicit version of such a
segmental phonology, transformational generative phonology (TGP), is a
sophisticated device for deriving strings of surface phonetic segments from
nonredundant lexical strings of phonemes. TGP rewrite rules simply map
well-ordered "phonological" strings onto well-ordered "phonetic strings."
For some time TGP has been the dominant phonological framework for text-
to-speech synthesis research.
There have been a number of general computational, practical, and
empirical arguments directed against weaknesses in the TGP approach (see,
for example, Botha 1971; Ladefoged 1971; Johnson 1972; Koutsoudas,
Sanders, and Noll 1974; Pullum 1978; Linnell 1979; see also Peters and
Ritchie 1972; Peters 1973; Lapointe 1977; King 1983, for more general
critiques of the computational aspects of TG). For instance, because of the
richness and complexity of the class of rules which may be admitted by a
TGP model, it may be impossible in practice to derive an optimal TGP. This
problem is exacerbated by rule interaction and rule-ordering paradoxes.
Moreover, standard TGP models do not explicitly recognize structural
domains such as foot, syllable, and onset. This means that structure-
dependent information cannot be felicitously represented and some
additional mechanism(s) must be sought to handle "allophony," "coarticu-
lation," and the like; notice that the frequent appeals to "boundary" features
or to "stress," for instance, tacitly trade on structural dependence. The
deletion and insertion rules exploited in TGP permit arbitrarily long portions
of strings to be removed or added. This is highly problematic. Leaving aside
the computational issues (that by admitting deletion rules TGP may define
nonrecursive languages) the empirical basis for deletion processes has never
been demonstrated. Indeed, the whole excessively procedural, "process"-
oriented approach (of which "deletion" and "insertion" are but two
exemplars) embodied by TGP has never been seriously warranted, defended,
or motivated.
Numerous criticisms have also been leveled against the TGP model in
recent years by the proponents of phonological theories such as autosegmen-
tal phonology, dependency phonology, and metrical phonology. These
criticisms have been largely directed at the "linearity" and "segmentality" of
TGP and the range and complexity of the rules allowed in TGP. An attempt
has been made to reduce the number and baroqueness of "rules" required,
191
Segment
and much simpler, more general operations and constraints have been
proposed. However, despite the claims of their proponents, these approaches
embody certain of the entrenched problematic facets of TGP. Transform-
ation rewrite rules (especially deletion and insertion rules) are still employed
whenever it is convenient (e.g. Anderson 1974; Clements 1976; McCarthy
1981; Goldsmith 1984). Such rules are only required because strings continue
to be central to phonological representation and because phonological and
phonetic representations are cast in the same terms and treated as if they
make reference to the same kinds of categories. It is true that the use of more
sophisticated representations in these approaches has gone some way
towards remediating the lack of appropriate structural domains in TGP, but
"nonlinear" analyses are still, in essence, segment-oriented and typically
treat long-domain phonological units as being merely extended or spread
short-domain units. The reasons for treating long-domain units in this way
have never been explicated or justified. Nor do any of these "nonlinear"
approaches seriously question the long-standing hypothesis that strings of
concatenated segments are appropriate phonological representations. De-
pendency graphs and metrical trees are merely graphical ways of representing
structured strings, and autosegmental graphs simply consist of well-ordered,
parallel strings linked by synchronizing "association lines." Rather than
engaging in a formal investigation of phonological interpretation, these three
frameworks have slipped into the trap of investigating the properties of
diagrams on paper (see Coleman and Local 1991).
The problems I have identified and sketched here have important conse-
quences for both phonology in general and for synthesis-by-rule systems in
particular. It is possible, however, to avoid all these problems by identifying
and removing their causes; in doing this we aim to determine a more
restrictive phonological model.

8.3 Nonsegmental, declarative phonology


Our attempt to construct a more restrictive theory of phonology than those
currently available has two main sources of inspiration: Firthian prosodic
phonology (Firth 1957) and the work on unification-grammar (UG) formal-
ism (Shieber 1986). The main characteristics of our approach to phonology
are that
it is abstract;
it is monostratal;
it is structured;
it is monotonic.

192
8 John Local
83.1 Abstractness: phonology and phonetics demarcation
One of the central aspects of Firthian approach to phonology,2 and one
that still distinguishes it from much current work, is the insistence on a strict
distinction between phonetics and phonology (see also Pierrehumbert and
Beckman 1988). This is a central commitment in our work. We take seriously
Trubetzkoy's dictum that:
The data for the study of the articulatory as well as the acoustic aspects of speech
sounds can only be gathered from concrete speech events. In contrast, the linguistic
values of sounds to be examined by phonology are abstract in nature. They are above
all relations, oppositions, etc., quite intangible things, which can be neither perceived
nor studied with the aid of the sense of hearing or touch. (Trubetzkoy 1939 [1969]: 13)
Like the Firthians and Trubetzkoy, we take phonology to be relational: it is a
study whose descriptions are constructed in terms of structural and systemic
contrast; in terms of distribution, alternation, opposition, and composition.
Our formal approach, then, treats phonology as abstract; this has a number
of important consequences. For example, this enables us (like the Firthians)
to employ a restrictive, monostratal phonological representation. There is
only one level of phonological representation and one level of phonetic
representation; there are no derivational steps. This means that for us it
would be incoherent to say such things as "a process that converts a high
tone into a rising tone following a low tone" (Kaye 1988: 1) or "a striking
feature of many Canadian dialects of English is the implementation of the
diphthongs [ay] and [aw] as [Ay] and [AW]" (Bromberger and Halle 1989: 58);
formulations such as these simply confuse phonological categories with their
phonetic exponents. Because phonological descriptions and representations
encode relational information they are abstract, algebraic objects appropri-
ately formulated in the domain of set theory.
In contrast, phonetic representations are descriptions of physical,
temporal events formulated in a physical domain. This being the case, it
makes no sense to talk of structure or systems in phonetics: there may be
differences between portions of utterance, but in the phonetics there can be
no "distinctions." The relationship between phonology and phonetics is
arbitrary (in the sense of Saussure) but systematic; I know of no evidence that
suggests otherwise (see also Lindau and Ladefoged 1986). The precise form
2
One important aspect of our approach which I will not discuss here (but see Kelly and Local
1989) is the way we conduct phonological interpretation. (The consequences of the kinds of
phonological interpretation we do can, to some extent, be discerned via our chosen mode of
representation.) Nor does space permit a proper discussion of how a declarative approach to
phonology deals with morphophonological alternations. However, declarative treatment of
such alternations poses no particular problems and requires no additional formal machinery.
Firthian prosodic phonology provides a nonprocess model of such alternations (see, e.g.,
Sprigg 1963).

193
Segment

of phonetic representations therefore has no bearing on the form of the


phonological representations. Phonetics, is, then, to be seen as interpretive:
phonetic representations are the denotations of phonological represen-
tations. Thus, like the Firthians, we talk of phonetic exponents of the
phonological representations, and like them we do not countenance rules
that manipulate these phonetic exponents. This phonetic interpretation of
the phonological representations is compositional and couched para-
metrically. By "compositional interpretation" of phonological represen-
tations (expressions) I mean simply that the phonetic interpretation (in terms
of values and times) of an expression is a function of the interpretation of the
component parts of that expression (Dowty, Wall, and Peters, 1981). By
"parametric" I mean to indicate that it is essential not to restrict phonetic
interpretation to the segmented domains and categories superficially imposed
by, say, the shapes of an impressionistic record on the language material.
This principle leads to the formulation of phonetic representations in terms
of component parameters and their synchronization in time.
Within this approach, parametric phonetic interpretation (exponency) is a
relation between phonological features at temporally interpreted nodes in the
syllable tree and sets of parameter sections (in our implementation Klatt
parameters). Exponency is thus a function from phonological features and
their structural contexts to parameter values. "Structural contexts" is here
taken to mean the set of features which accompany the feature under
consideration at its place in structure along with that place in structure in the
graph. Phonetic interpretation of the syllable graph is performed in a head-
first fashion (constituent heads are interpreted before their dependents; see
Coleman 1989), and parameter sections are simply sequences of ordered
pairs, where each pair denotes the value of a particular parameter at a
particular (linguistically relevant) time. Thus: {node(Category, Tstart, Tend),
parameter.section} (see Ladefoged [1980:495] for a similar formulation of
the mapping from phonological categories to phonetic parameters). The
times may be absolute or relative. So, for example, given a phonological
analysis of the Firthian prosodic kind, which has abstracted at least two
independent V-systems (e.g. "short" and "long"; see Albrow 1975) we can
establish that only three "height" values need to be systematically dis-
tinguished. Given, in addition, some analysis in the acoustic domain such as
that presented by Kewley-Port (1982) or Klatt (1980: 986) we can begin to
provide a phonetic interpretation for the [height 1] feature for syllables such
as pit and put thus:
{syllable([heightl], Tp T2),(F1:425; Bl:65), (Fl:485; Bl:65)}
(Interpolation between the values can be modeled by a damped sinusoid [see
Browman and Goldstein 1985].) As indicated above, however, any given

194
8 John Local
phonological feature is to be interpreted in the context of the set of features
which accompany the particular feature under consideration along with its
place in structure. This means that a feature such as [heightl] will not always
receive the same phonetic interpretation. Employing again the data from
Klatt (1980) we can provide a phonetic interpretation for the [heightl]
feature for syllables such as see and sue thus:
{syllable([heightl],T1,T2,T3,T4),(Fl:33O;Bl:55),(Fl:33O;Bl:55),
(Fl:350; Bl:60),(Fl:350; Bl:60)}
Ladefoged (1977) discusses such a structure-dependent interpretation of
phonological features. He writes: "the matrix [of synthesizer values (and
presumably of some foregoing acoustic analysis)] has different values for F2
(which correlates with place of articulation) for [s] and [t], although these
segments are both [ + alveolar]. The feature Alveolar has to be given a
different interpretation for fricatives and plosives" (1977: 231). We should
note, however, that the domain of these phonetic parameters - whether they
are formulated in articulatory or acoustic terms, say, has to do with the task
at hand - is implementation-specific; it has not, and cannot have, any
implications whatsoever for the phonological theory.
The position I have just outlined is not universally accepted. Many
linguists view phonology and phonetics as forming a continuum with
phonological descriptions presented in the same terms as phonetic descrip-
tions; these are often formulated in terms of a supposedly universal set of
distinctive phonetic properties. This can be seen in the way phonological
features are employed in most contemporary approaches. Phonological
categories are typically constructed from the names of phonetic (or quasi-
phonetic) categories, such as "close, back, vowel," or "voiced, noncoronal,
obstruent." By doing this the impression of a phonology-phonetics conti-
nuum is maintained. "Phonetic" representations in generative phonologies
are merely the end result of a process of mappings from strings to strings; the
phonological representations are constructed from features taking binary
values, the phonetic representations employing the same features usually
taking scalar values. Chomsky and Halle explicitly assert this phonetics-
phonology continuum when they write: "We take 'distinctive features' to be
the minimal elements of which phonetic, lexical, and phonological transcrip-
tions are composed, by combination and concatenation" (1968: 64). One
reason that this kind of position is possible at all, of course, is that in such
approaches there is nothing like an explicit phonetic representation.
Typically, all that is provided are a few feature names, or segment symbols;
rarely is there any indication of what would be involved in constructing the
algorithm that would allow us to test such claims. The impression of a
phonology-phonetics continuum is, of course, quite illusory. In large part,

195
Segment

the illusion is sustained by the entirely erroneous belief that the phonological
categories have some kind of implicit, naive phonetic denotation (this seems
to be part of what underlies the search for invariant phonetic correlates of
phonological categories and the obsession with feature names [Keating
1988b]). Despite some explicit claims to this effect (see, for example,
Chomsky and Halle 1968; Kaye, Lowenstamm, and Vernaud 1985; Brom-
berger and Halle 1989), phonological features do not have implicit deno-
tations and it is irresponsible of phonologists to behave as if they had when
they present uninterpreted notations in their work.
One of the advantages that accrues from having a parametric phonetic
interpretation distinguished from phonological representation is that the
arbitrary separation of "segmental" from "supra-" or "non-"segmental
features can be dispensed with. After all, the exponents of so-called "segmen-
tal" phonetic aspects are no different from those of "nonsegmental" ones -
they are all parameters having various extents. Indeed, any coherent account
of the exponents of "nonsegmental" components will find it necessary to
refer to "segmental" features: for instance, as pointed out by Adrian
Simpson (pers. comm.), lip-rounding and glottality are amongst the phonetic
exponents of accentuation (stress). Compare to him, and for him, for
example, in versions where to and for are either stressed or nonstressed.
Local (1990) also shows that the particular quality differences found in final-
open syllable vocoids in words like city and seedy are exponents of the
metrical structure of the words (see also the experimental findings of
Beckman 1986).

8.3.2 Structured representation


Within our approach, phonological representations are structured and are
treated as labeled graph-objects, not strings of symbols. These graphs are
unordered. It makes no sense to talk about linear ordering of phonological
elements - such a formulation can only make sense in terms of temporal
phonetic interpretation. (Compare Carnochan:
It is perhaps appropriate to emphasise here that order and place in structure do not
correlate with sequence in time ... The symbols with which a phonological structure
is written appear on the printed page in a sequence; in vDEvhBasry • • • structure of
jefaffe, the symbol h precedes the symbol B, but one must guard against the
assumption that the exponent of the element of structure h precedes the exponent of
the element of structure B, in the pronunciation of jefaffe. There is no time in
structure, there is no sequence in structure; time and sequence are with reference to
the utterance, order and place are with reference to structure [1952:158].)
Indeed, it is largely because phonologists have misinterpreted phonological
representations as well ordered, with an order that supposedly maps straight-

196
8 John Local
forwardly to the phonetics, that "processes" have been postulated. Syntag-
matic structure, which shows how smaller representations are built up into
larger ones, is represented by graphs. The graphs are familiar from the
syllable-tree notation ubiquitous in phonology (see Fudge 1987 for copious
references). The graphs we employ, however, are directed acyclical graphs
(DAGs) rather than trees, since we admit the possibility of multidominated
nodes ("reentrant" structures; see Shieber, 1986), for instance, in the
representation of ambisyllabicity and larger structures built with feature
sharing, e.g. coda-onset "assimilation":

(1) syllable

rime

onset

coda

As will become apparent later, our interpretation of the phonetic exponents


of such constituent relationships is rather different from that currently found
in the phonological literature. Such phonological structures are built by
means of (phonotactic) phrase-structure rules.

8.3.3 Feature structures


Paradigmatic structure, which shows how informational differences between
representations are encoded, is represented using feature structures (in fact
these, too, are graph structures). Feature structures are partial functions
from their features to values. Within a category each feature may take
only one value. The value of a feature may be atomic (e.g. [height: close],
[cmp: — ]) or may itself be structured. This means that the value of some
feature may itself be specified by another feature structure rather than by an
atomic value. For instance (2a) might be one component of a larger structure
associated with an onset node, and (2b) might be a component of a structure
associated with a nucleus node:

(2) grv: - height: close


cons: voc: grv: +
cmp: +

197
Segment

(Hierarchical phonological feature structures have also been proposed by


Lass [1984b], Clements [1985], and Sagey [1986a], though their interpretation
and use is different from that proposed here.)
The primary motivation for adopting a graph-theoretic view of phono-
logical representations is that this enables us to formulate our proposals
within a mathematically explicit and well-understood formalism. By assign-
ing an independent node to every phonological unit of whatever extent, we
do away with the need to recognize "segments" at any level of representation
as well as with the need to postulate such things as "spreading" rules. That is,
we employ a purely declarative mode of representation. (Although the term
"declarative" was not in current usage during the period when Firthian
prosodic analysis was in development, it is clear that the nonprocess
orientation of prosodic phonology was declarative in spirit. For further
discussion of this claim see the reinterpretation by Broe [1988] of Allen's
[prosodic] treatment of aspiration in Harauti within a declarative unification
framework.) It is particularly important to note that within this approach no
primacy is given to any particular size of node domain. If we consider the
phonological specification of the contrasts in the words/?// ~ put, bat ~ bad,
and bent ~ bend, we can get a feel for some of the implications of structured
representations. In a typical segmental (or, for that matter, "nonlinear")
account words such as pit and put are likely to have their vowels specified as
[ + rnd] (either by explicit feature specification or as a result of some default
or underspeciflcation mechanism) which is then copied, or spread, to
adjacent consonants in order to reflect the facts of coarticulation of lip
rounding. In an approach which employs structured representations, how-
ever, all that is necessary is to specify the domain of [ + rnd] as being the
whole syllable. Thus:

(3) syllable
[ ± rnd]

onset

X
\coda

Onset and coda are in the domain of [ ± rnd] (though coda does not bear the
feature distinctively) by virtue of their occurrence as syllable constituents.
Once the structural domain of oppositions is established there is no need to

198
8 John Local
employ process modeling such as "copying" or "spreading." In a similar
fashion nonstructured phonologies will typically specify the value of the
contrast feature [± voice] for consonants, whereas vowels will typically be
specified (explicitly, or default-specified) as being [+ voi]. A structured
phonological representation, on the other hand, will explicitly recognize that
the opposition of voicing holds over onsets and rimes, not over consonants
and vowels (see Browman and Goldstein 1986: 227; Sprigg 1972). Thus
vowels will be left unspecified for voicing and the phonological voicing
distinction between bat ~ bad and bent ~ bend, rather than being assigned
to a coda domain, will be assigned to a rime domain:

(4) syllable

rime

onset / \ [ + voi ]

nucleus / \coda

If this is done, then the differences in voice quality and duration of the vowel
(and, in the case of bent ~ bend, the differences in the quality and duration of
nasality), the differences in the nature of the transitions into the closure and
the release of that closure can be treated in a coherent and unified fashion as
the exponents of rime-domain voicing opposition. (The similarity between
these kinds of claims and representations and the sorts of representations
found in prosodic analysis should be obvious.3)
The illustrative representations I have given have involved the use of rather
conventional phonological features. The features names that we use in the
construction of phonological representations are, in the main, taken from the
Jakobson, Fant, and Halle (1952) set, though nothing of particular interest
hangs on this. However, they do differ from the "distinctive features"
described by Jakobson, Fant, and Halle in that, as I indicated earlier, they
are purely phonological; they have no implicit phonetic interpretation. They
differ from the Jakobson, Fant, and Halle features in two other respects.
First, when they are interpreted (because of the compositionality principle)
they do not have the same interpretation wherever they occur in structure.

3
For example, it is not uncommon (e.g. Albrow 1975) to find "formulae" like yw(CViCh/fi) as
partial representations of the systemic (syntagmatic and paradigmatic) contrasts instantiated
by words such as pit and put.

199
Segment

So, for instance, the feature [ + voi] at onset will not receive the same
interpretation as [+ voi] at rime. Second, by employing hierarchical,
category-valued features, such as [cons] and [voc] as in (5) below, it enables
the same feature names to be used within the same feature structure but with
different values. Here, in the partial feature structure relating to apicality
with open approximation and velarity at onset, [grv] with different values is
employed to encode appropriate aspects of the primary and secondary
articulation of the consonantal portion beginning a word such as red.

(5)
grv:
cons:
cmp:

grv: +
onset: voc: height: 0
rnd: +

src: nas: —

Once the representations in the phonology are treated as structured and


strictly demarcated from parametric phonetic representations, then notions
such as "rewrite rule" are redundant. Nor do we need to have recourse to
operations such as deletion and insertion. We can treat such "processes" as
different kinds of parameter synchronization in the phonetic interpretation
of the phonological representation. This position is, of course, not new or
unique to the approach I am sketching here (see e.g. Browman and Goldstein
1986). Even within more mainstream generative phonologies similar pro-
posals have been made (Anderson 1974; Mohanan 1986), though not
implemented in the thoroughgoing way I suggest here. As far as I can see,
however, none of these accounts construe "phonology" in the sense
employed here and all trade in well-ordered concatenative relations.
One result of adopting this position is that in our approach phonological
combinators are formulated as nondestructive - phonological operations are
monotonic. The combinatorial operations with which phonological struc-
tures are constructed can only add information to representations and
cannot remove anything - "information about the properties of an expres-
sion once gained is never lost" (Klein 1987:5). As I will show in the remainder
of the paper, this nondestructive orientation has implications for the
underspecification of feature values in phonological representations.

200
8 John Local
8.4 Temporal interpretation: dealing with "processes"
Consider the following syllable graph:
(6) syllable

rime
s
/
onset

coda

A simple temporal interpretation of this might be schematically represented


thus:
(7)
Syllable exponents

Onset exponents Rime exponents

Nucleus exponents Coda exponents

c V

Although this temporal interpretation reflects some of the hierarchical


aspects of the graph structure (the exponents' "smaller" constituents overlap
those of "larger" ones), the interpretation is still essentially concatenative. It
treats phonetic exponency as if it consisted of nothing more than well-
ordered concatenated sequences of "phonetic objects" of some kind (as in
SPE-type approaches or as suggested by the X-slot type of analysis found in
autosegmental work). As is well known, however, from extensive instrumen-
tal studies, a concatenative-grouping view of phonetic realization simply
does not accord with observation - no matter how much we try to fiddle our
interpretation of "segment." Clearly, we need a more refined view of the
temporal interpretation of exponency. Within the experimental/instrumental
phonetics literature just such a view can be found, though as yet it has not
found widespread acceptance in the phonological domain. Work by Fowler
(1977), Gay (1977), Ohman (1966a), and Perkell (1969), although all
conducted with a phonemic orientation, proposes in various ways a "co-
production" (Fowler) or an "overlaying" view of speech organization. When
we give the graph above just such an overlaying/cocatenative rather than
concatenative interpretation we can begin to see how phenomena such as

201
Segment
coarticulation, deletion, and insertion can be given a nonprocess, declarative
representation:
(8)
Syllable exponents
Rime exponents
Nucleus exponents
Onset exponents Coda exponents

C V C
(The "box notation" is employed in a purely illustrative fashion and has no
formal status. So, for instance, the vertical lines indicating the ends of
exponents of constituents should not be taken to indicate an absolute cross-
parametric temporal synchrony.)

8.4.1 Cocatenation, "coarticulation,"and "deletion"


With this model of temporal interpretation of the exponents of phonological
constituents we can now see very simple possibilities for the reconstrual of
process phenomena. For example, so-called initial consonant-vowel coarticu-
lation can be viewed as an "overlaying" of the exponents of phonological
constituents rather than as the usual segment concatenation with a copying or
spreading rule to ensure that the consonant "takes on" the appropriate
characteristics of the vowel. Employing the box notation used above, this can
be shown for the English words keep, cart, and coot, where the observed details
of the initial occlusive portion involve (in part) respectively, tongue-fronted
articulation, tongue-retracted articulation, and lip-rounded articulation
(Greek symbols are used to index the constituents whose phonetic exponents
we are considering):
(9)
T exponents a exponents

K exponents n exponents K exponents x exponents

k keep k cart

v exponents
K exponents i exponents

coot
fc
202
8 John Local
Notice that the temporal-overlay account (when interpreted parametrically)
will result in just the right kind of "vocalic" nucleus characteristics through-
out the initial occlusive part of the syllable even though onset and nucleus are
not sisters in the phonological structure. Notice, too, that we consider such
"coarticulation" as phonological and not simply some phonetic-mechanical
effect, since it is easy to demonstrate that it varies from language to language,
and, for that matter, within English, from dialect to dialect. A similar
interpretation can be given to the onsets of words such as split and sprit,
where the initial periods of friction are qualitatively different as are the
vocalic portions (this is particularly noticeable in those accents of English
which realize the period of friction in spr- words with noticeably lip-rounded
tongue-tip retracted palato-alveolarity). In these cases what we want to say is
that the liquid constituent dominates the initial cluster so its exponents are
coextensive with both the initial friction and occlusion and with the early
portion of the exponents of the nucleus. One aspect of this account that I
have not made explicit is the tantalizing possibility that only overlaying/
cocatenation need be postulated and that apparently concatenative pheno-
mena are simply a product of different temporal gluings; only permitting one
combinatorial operation is a step towards making the model genuinely more
restrictive.
In this context consider now the typical putative process phenomenon of
deletion. As an example consider versions of the words tyrannical and
torrential as produced by one (British English) speaker. Conventional
accounts (e.g. Gimson 1970; Lass 1985; Dalby 1986; Mohanan 1986) of the
tempo/stylistic reduced/elided pronunciations of the first, unstressed syll-
ables of tyrannical and torrential as
(10) \r tyx*
would argue simply that the vowel segment had been deleted. But where does
this deletion take place? In the phonology? Or the phonetics? Or both? The
notion of phonological features "changing" or being "deleted" is, as I
indicated earlier, highly problematical. However, if one observes carefully
the phonetic detail of such purportedly "elided" utterances, it is actually
difficult to find evidence that things have been deleted. The phonetic detail
suggests, rather, that the same material has simply been temporally redistri-
buted (i.e. these things are not phonologically different; they differ merely in
terms of their temporal phonetic interpretation). Even a cursory listening
reveals that the beginnings of the "elided" forms of tyrannical and torrential
do not sound the same. They differ, for instance, in the extent of their
liprounding, and in terms of their resonances. In the "elided" form of tyr the
liprounding is coincident with the period of friction, whereas in tor it is
observable from the beginning of the closure; tor has markedly back

203
Segment
resonance throughout compared with tyr, which has front of central reso-
nance throughout. (This case is not unusual: compare the "elided" forms of
suppose, secure, prepose, propose, and the do ~ dew and cologne ~ clone
cases discussed by Kelly and Local 1989: part 4.) A deletion-process account
of such material would appear to be a codification of a not particularly
attentive observation of the material.
By contrast, a cocatenation account of such phenomena obviates the need
to postulate destructive rules and allows us to take account of the observed
phonetics and argue that the phonological representation and ingredients of
elided and nonelided forms are the same; all that is different is the temporal
phonetic interpretation of the constituents. The "unreduced" forms have a
temporal organization schematically represented as follows:
(11)
i exponents o exponents
T exponents p exponents T exponents p exponents

while the reduced forms may have the exponents of nucleus of such a
duration that their end is coincident with the end of the onset exponents:
(12)
i exponents o exponents

T exponents T exponents
p exponents p exponents

In the "reduced" forms, an overlaying interpretation reflects exactly the fact


that although there may be no apparently "sequenced" "vowel" element the
appropriate vocalic resonance is audible during the initial portion.

8.4.2 Assimilation
Having provided a basic outline of some of the major features of a "non-
segmental," "nonprocess" approach to phonology, I will now examine the
ways such an approach can deal with the phenomenon of "assimilation."
Specifically, I want to consider those "assimilations" occurring between the
end of one word and the beginning of the next, usually described as involving
"alveolar consonants." The standard story about assimilation, so-called, in
English can be found in Sweet (1877), Jones (1940), and Gimson (1970),
often illustrated with the same examples. Roach (1983:14) provides a recent
formulation:
For example, thefinalconsonant in "that" daet is alveolar t. In rapid casual speech the
t will become p before a bilabial consonant... Before a dental consonant, t changes

204
8 John Local
to a dental plosive... Before a velar consonant, the t will become k . . . In similar
contexts d would become, b, d, g, respectively, and n would become m, n and r j . . . S
becomes, j, and z becomes 3 when followed by J or j .
I want to suggest that this, and formulations like it (see below), give a
somewhat misleading description of the phenomenon under consideration.
The reasons for this are manifold: in part, the problem lies in not drawing a
clear distinction between phonology and phonetics; in part, a tacit assump-
tion that there are "phonological segments" and/or "phonetic segments"; in
part, not having very much of interest to say in the phonetic domain; in part,
perhaps, a lack of thoroughgoing phonetic observation. Consider the follow-
ing (reasonably arbitrary) selection of quotes and analyses centered on
"assimilation":
1 "the underlying consonant is /t/ and that we have rules to the effect:
t ->k/ [ + cons
— ant
— cor]"
— sonorant
' [ + nasal ] -• [ a place ]/ — continuant (Mohanan 1986:106).
a place
"The change from /s/ to [s] . . . takes place in the postlexical module in I mi[s]
you" (Mohanan 1986:7).
' T h e [mp] sequence derived from np [ten pounds] is phonetically identical to
the [mp] sequence derived from mp" (Mohanan 1986:178).
"the alveolars . . . are particularly apt to undergo neutralization as redund-
ant oppositions in connected speech" (Gimson 1970:295).
"Assimilation of voicing may take place when a word ends in a voiced
consonant before a word beginning with a voiceless consonant" (Barry
1984:5).
"We need to discover whether or not the phonological processes discernible
in fast speech are fundamentally different from those of careful slow
speech... as you in rapid speech can be pronounced [9z ja] or [339]" (Lodge
1984:2).
d b b b

son — son
cnt - cnt

+ ant + ant
+ cor — cor

0 (Nathan, 1988:311)
205
Segment
The flavor of these quotes and analyses should be familiar. I think they are a
fair reflection of the generally received approach to handling assimilation
phenomena in English. They are remarkably uniform not only in what they
say but in the ways they say it. Despite some differences in formulation and in
representation, they tell the same old story: when particular things come into
contact with each other one of those things changes into, or accommodates
in shape to, the other. However, this position does beg a number of
questions, for example: what are these "things"? (variously denoted: t ~ k ~
[ + cons, —ant, —cor] [1]; /s/ ~ [s] [3]; [mp] ~ np ~ [mp] ~ mp [4]; alveolars
[5]; voiced ~ voiceless consonant [6]. Where are these "things"? - in the
phonology or the phonetics or both or somewhere in between? Let us
examine some of the assumptions that underlie what I have been referring to
as "the standard story about assimilation." The following features represent
the commonalities revealed by the quotations above:
Concatenative-segmental: the accounts all assume strings of segments at
some level of organization (phonological and/or phonetic).
Punctual-contrastive: the accounts embody the notion that phonological
distinctions and relationships, as realized in their phonetic exponents,
are organized in terms of a single, unique cross-parametric, time slice.
Procedural-destructive: as construed in these accounts, assimilation involves
something which starts out as one thing, is changed/changes and ends
up being another. These accounts are typical in being systematically
equivocal with respect to the level(s) of description involved: no
serious attempt is made to distinguish the phonetic from the
phonological.
Homophony-producing: these accounts all make the claim (explicitly or
implicitly) that assimilated forms engender neutralization of opposit-
ions. That is, in assimilation contexts, we find the merger of a contrast
such that forms which in other contexts are distinct become phonet-
ically identical.

8.4.3 Some acoustic and EPG data


Although I have presented these commonalities as if they were separate and
independent, it is easy to see how they might all be said toflowfrom the basic
assumption that the necessary and sufficient form of representation is strings
of segments (this clearly holds even for the autosegmental type of represen-
tation pictured here). But, as we have seen, there is no need to postulate
segments at any level of organization: in the phonology we are dealing with
unordered distinctions and oppositions at various places in structure; in the
phonetics with parametric exponents of those oppositions. Thus the
language used here in these descriptions of assimilation is at odds with what
really needs to be said: expressions of the kind "alveolar consonants,"
206
8 John Local

"bilabial consonant," "voiced/voiceless consonant" are not labels for


phonological contrasts nor are they phonetic descriptors - at the best they
are arbitrary, cross-parametric classificatory labels for IPA-type segmental
reference categories. There are no "alveolar consonants" or "voiceless
consonants" in the phonology of English; by positing such entities we simply
embroil ourselves in a procedural-destructive-type account of assimilation
because we have to change the "alveolar consonant" into something else (see
Repp's discussion of appropriate levels of terminology in speech research
[1981]). Nor is it clear that where these descriptions do make reference to the
phonetics, the claims therein are always appropriate characterisations of the
phenomena.
I will consider two illustrative cases where spectrographic and electro-
palatographic (EPG) records (figures 8.1-8.4) suggest that, at least for some
speakers, assimilated forms do not have the same characteristics as other
"similar" articulatory/acoustic complexes. There is nothing special about
these data. Nor would I wish to claim that the records presented here show
the only possible pronunciation of the utterances under consideration. Many
other things can and do happen. However, the claims I make and the analytic
account I offer are in no way undermined by the different kinds of phonetic
material one can find in other versions of the same kinds of utterance. The
utterance from which these records was taken was In that case I'll take the
black case.4 The data for these speakers is drawn from an extensive
collection of connected speech gathered in both experimental and nonexperi-
mental contexts at the University of York. In auditory impressionistic terms,
the junction of the words that and case and the junction of black and case
exhibit "stretch cohesion" marked by velarity for both speakers (K and L).
However, there are also noticeable differences in the quality of the vocalic
portion of that as opposed to black. These auditory differences find reflection
in the formant values and trajectories in the two words. For K we see an
overall higher F p F2, and F3 for that case as opposed to black case and a
different timing/trajectory for the F2/F3 "endpoint." For L we see a similar
difference in F p F2, and F3 frequencies. That is, the vocalic portion in the
syllable closed with the "assimilatory" velarity does not lose its identity (it
does not sound like "the same vowel" in the black syllable - it is still different
from that in the syllable with the lexical velarity). It is, as it were, just the
right kind of vocalic portion to have occurred as an exponent of the nucleus
in a syllable where the phonological opposition at its end did not involve
velarity.
4
I am indebted to the late Eileen Whitley (unpublished lecture notes, SOAS, and personal
communication) for drawing my attention to this sentence. The observations I make derive
directly from her important work. In these utterances that and black were produced as
accented syllables.

207
Segment

(a) .,

(d)
Figure 8.1 Spectrograms of (a) that case and (b) black case produced by speaker K and (c) that
case and (d) black case produced by speaker L

208
8 John Local
Speaker (L)

40 4! 42 43 44 45 46 47 48 49 50 51 52 53

54 55 56 57

that_case

40 41 42 43 44 45 4i 47 48 49 50 51 52 53

54 55 56

/5/ac/c case

Figure 8.2 Electropalatographic records for the utterances that case and black case produced by
speaker L shown in figure 8.1

Examination of the EPG records (for L - seefig.8.2) reveals that for these
same two pairs, although there is indeed velarity at the junction of that case
and black case, the nature and location of the contact of tongue to back of
the roof of the mouth is different. In black case we can see that the tongue
contact is restricted to the back two rows of electrodes on the palate. By
contrast, in that case the contact extends forward to include the third from
back row (frames 43-5, 49-52) and for three frames (46-8) the fourth from
back row; there is generally more contact, too, on the second from back row
(frames 45-9) through the holding of the closure. Put in general terms, there
is an overall fronter articulation for the junction of that case as compared
with that for black case. Such auditory, spectrographic, and EPGfindingsare
209
Segment
routine and typical for a range of speakers producing this and similar
utterances5 (though the precise frontness/backness relations differ [see
Kelly and Local 1986]).
Consider now the spectrograms in figure 8.3 of three speakers (A, L, W)
producing the continuous utterance This shop's a fish shop.6 The portions of
interest here are the periods of friction at the junction of this shop and fish
shop. A routine claim found in assimilation studies is that presented earlier
from Roach: "s becomes J, and z becomes 3 when followed by J or j . "
Auditory impressionistic observation of these three speakers' utterances
reveals that though there are indeed similarities between the friction at the
junction of the word pairs, and though in the versions of this shop produced
by these speakers it does not sound like canonical apico-alveolar friction, the
portions are not identical and a reasonably careful phonetician would feel
uneasy about employing the symbol J for the observed "palatality" in the this
shop case. The spectrograms again reflect these observed differences. In each
case the overall center of gravity for the period of friction in this shop is
higher than that in fish shop (see Shattuck-Hufnagel, Zue, and Bernstein
1978; and Zue and Shattuck-Hufnagel 1980).
These three speakers could also be seen to be doing different things at the
junction of the word pairs in terms of the synchrony of local maxima of lip
rounding relative to other articulatory components. For this shop the onset
of lip rounding for all three speakers begins later in the period of friction than
in fish shop, and it gets progressively closer through to the beginning of shop.
In fish shop for these speakers lip rounding is noticeably present throughout
the period of final friction in shop. (For a number of speakers that we have
observed, the lip-rounding details discussed here are highlighted by the
noticeable lack of lip rounding in assimilated forms of this year.)
Notice too that, as with the that case and black case examples discussed
earlier, there are observable differences in the Fx and F3 frequencies in the
vocalic portions of this and shop. Most obviously, in this Fl is lower than in
fish and F3 is higher (these differences show consistency over a number of
productions of the same utterance by any given speaker). The vocalic portion
in this even with "assimilated" palatality is not that of a syllable where the
palatality is the exponent of a different (lexically relevant) systemic oppo-
sition. While the EPG records for one of these speakers, L (seefig.8.4), offer
us no insight into the articulatory characteristics corresponding to the
impressionistic and acoustic differences of the vocalic portions they do show
5
Kelly and Local (1989: part 4.7) discuss a range of phenomena where different forms are said
to be homophonous and neutralization of contrast is said to take place. They show that
appropriate attention to parametric phonetic detail forces a retraction of this view. See also
Dinnsen (1983).
6
In these utterances 'this' and 'fish' were produced as accented syllables.

210
8 John Local

iutiitiSif

ran
I
Figure 8.3 Spectrograms of the utterances This shop's a fish shop produced by speakers W, A,
and L

211
Segment
40 41 42 A3 44 45 4i 47 ., 48, „ 4J
'" !!!!!! !!";; !!!!!! !!!!!! D"'!" O!!!!'fl 00

69 70 71 72 73 74 75 76

this shop ('assimilated')

! I I lo!
6? 70 71 73 74 75 76 77

82 83 84 85

fish shop

Figure 8.4 Electropalatographic records of the utterances this shop and fish shop produced by
speaker L shown in figure 8.3

that the tongue-palate relationships are different in the two cases. The
palatographic record of this shop shows that while there is tongue contact
around the sides of the palate front-to-back throughout the period corres-
ponding to the friction portion, the bulk of the tongue-palate contact is
oriented towards the front of the palate. Compare this with the equivalent
period taken fromfishshop. The patterning of tongue-palate contacts here is
very different. Predominantly, we find that the main part of the contact is

212
8 John Local
oriented towards the back of the palate, with no contact occurring in the first
three rows. This is in sharp contrast to the EPG record for this shop where the
bulk of the frames corresponding to the period of friction show tongue-
palate contact which includes the front three rows. Moreover, where this shop
exhibits back contact it is different in kind from that seen for fish shop in
terms of the overall area involved.7 The difference here reflects the
difference in phonological status between the ("assimilatory") palatality at
the end of this, an exponent of a particular kind of word juncture, and that of
fish, which is an exponent of lexically relevant palatality.
What are we to make of these facts? First, they raise questions about the
appropriacy of the received phonetic descriptions of assimilation. Second,
they lead us to question the claims about "neutralization" of phonological
oppositions. (Where is this neutralization to be found? Why should vocalic
portions as exponents of "the same vowel" be different if the closures they
cooccur with were exponents of "the same consonant"?) Third, even if we
were to subscribe to the concatenative-segment view, these phonetic differ-
ences suggest that whatever is going on in assimilation, it is not simply the
replacement of one object by another.

8.5 A nonprocedural interpretation


Can we then give a more appropriate formulation of what is entailed by the
notion "assimilation"? Clearly, there is some phenomenon to account for,
and the quotations I cited earlier do have some kind of a ring of common
sense about them. The least theory-committed thing we can say is that in the
"assimilated" and "nonassimilated" forms (of the same lexical item) we have
context-dependent exponents of the same phonological opposition (if it were
not the same then presumably we would be dealing with a different lexical
item). Notice, too, that this description fits neatly "consonant-vowel"
coarticulation phenomena described earlier. And as with "coarticulation,"
we can see that, at least in the cases under consideration, we are dealing here
not simply with some kind of context-dependent adjacency relationship
defined over strings, but rather with a structare-dependent relationship;
"assimilation" involves some constraint holding over codas and onsets,
typically where these constituents are themselves constituents of a larger
structure (see Cooper and Paccia-Cooper 1980; Scott and Cutler 1984; Local
and Kelly 1986). Notice that the phonetic details of the "assimilated"
consonant are specific to its assimilation environment in terms of the precise
place of articulation, closure, and duration characteristics, and its relations
with the exponents of the syllable nucleus. As I have suggested, the precise
7
Similar EPG results for "assimilations" have been presented by Barry (1985), Kerswill and
Wright (1989), and Nolan (this volume).

213
Segment
cluster of phonetic features which characterize "assimilated consonants" do
not, in fact, appear to be found elsewhere. And just as with coarticulation, we
can give assimilation a nonprocedural interpretation. In order to recast the
canonical procedural account of assimilation in rather more "theory-
neutral" nonprocedural fashion we need minimally:
to distinguish phonology from phonetics - we do not want to talk of the
phonology changing. Rather we want to be able to talk about
different structure-dependent phonetic interpretations of particular
coda-onset relations;
a way of representing and accessing constituent-structure information about
domains over which contrasts operate: in assimilation, rime expo-
nents are maintained; what differences there are are to be related to
the phonetic interpretation of coda;
a way of interpreting parameters (in temporal and quality terms) for
particular feature values and constellations of feature values.
The approach to phonology and phonetics which I described earlier (section
8.3) has enabled us at York to produce a computer program which embodies
all these prerequisites. It generates the acoustic parameters of English
monosyllables (and some two-syllable structures) from structured phono-
logical representations. The system is implemented in Prolog - a program-
ming language particularly suitable for handling relational problems. This
allows successful integration of, for example, feature structures, operations
on graphs, and unification with specification of temporal relations obtaining
between the exponents of structural units. In the terms sketched earlier, the
program formally and explicitly represents a phonology which is non-
segmental, abstract, structured, monostratal, and monotonic. Each statement
in the program reflects a commitment to a particular theoretical position. So-
called "process" phenomena such as "consonant-vowel" coarticulation are
modeled in exactly the way described earlier; exponents of onset constituents
are overlaid on exponents of syllable-rime-nucleus constituents. In order to
do this it is necessary to have a formal and explicit representation of
phonological relations and their parametric phonetic exponents. Having
such parametric phonetic representations means that (apart from not operat-
ing with cross-parametric segmentation) it becomes possible to achieve
precise control over the interpretation of different feature structures and
phonetic generalizations across feature structures.
As yet, I have had nothing to say about how we cope with the destructive-
process (feature-changing) orientation of the accounts I have considered so
far. I will deal with the "process" aspect first. A moment's consideration
reveals a very simple declarative, rather than procedural, solution. What we
want to say is something very akin to the overlaying in coarticulation
discussed above. Just as the onset and rime are deemed to share particular

214
8 John Local
vocalic characteristics so, in order to be able to do the appropriate phonetic
interpretation of assimilation, we want to say that the coda and onset share
particular features. Schematically we can picture the "sharing" in an
"assimilation context" such as that case as follows:

(13)
a, exponents a 2 exponents

coda, exponents onset2 exponents

velarity exponents

We can ensure such a sharing by expressing the necessary constraint equation


over the coda-onset constituents, e.g.:

(14) syllable syllable..

rime

onset

coda,

a, exponents a2 exponents

coda, = onset,

velarity exponents

While this goes some way to removing the process orientation by stipulating
the appropriate structural domain over which velarity exponents operate,
there is still the problem of specifying the appropriate feature structures.
How can we achieve this sharing without destructively changing the feature
structure associated with codaj? At first glance, this looks like a problem, for
change the structure we must if we are going to share values. Pairs of feature
structures cannot be unified if they have conflicting information - this would
certainly be the case if the codat has associated with it a feature structure with
information about (alveolar) place of articulation. However, if we consider
the phonological oppositions operating in system at this point in structure we
can see that alveolarity is not relevantly statable. If we examine the phonetic
215
Segment

exponents of the coda in various pronunciations of that, for example, we can


see that the things which can happen for "place of articulation" are very
different from the things which can occur as exponents of the onset in words
such as tot, dot, and not. In the onsets of tof-type words we do not find
variant, related pronunciations with
glottality (e.g. ?h ?-) or
labiality (e.g. p~: ?w~) or
velarity (e.g. k~ ?k- :).
The reason for this is that at this (onset) plaee in structure alveolarity versus
bilabiality versus velarity are systemically in opposition. The obvious way out
of the apparent dilemma raised by assimilation, then, is to treat the (alveolar)
place of articulation as not playing any systemically distinctive role in the
phonology at this (coda) place in structure. This is in fact what Lodge (1984)
proposes, although he casts his formulation within a process model. He writes:
"In the case of the alveolars, so-called, the place of articulation varies
considerably, as we have already seen. We can reflect this fact by leaving /t/, /d/
and /n/ unspecified for place and having process and realization rules supply
the appropriate feature" (1984: 123). And he gives the following rewrite-rule
formulation under the section headed "The process rules":
1 If [stop]
[ 0place ]

then [ aplace ]/ [ aplace ]

2 If [stop]
[ 0place ]

then [ alv ]

Although Lodge (1984: 131) talks of "realization rules" (rule 2 here), the
formulation here, like the earlier formulations, does not distinguish between
the phonetic and the phonological (it again trades on a putative, naive
interpretation of the phonological features). "Realization" is not being
employed here in a "phonetic-interpretation" sense; rather, it is employed in
the sense of "feature-specification defaults" (see Chomsky and Halle 1968;
Gazdar et al 1985).
The idea of not specifying some parts of phonological representations has
recently found renewed favor under the label of "underspecification theory"
(UT) (Archangeli 1988). However, the monolithic asystemic approach sub-
216
8 John Local
sumed under the UT label is at present too coarse an instrument to handle
the phenomena under consideration here (because of its across-the-board
principles, treating the codas of words like bit, bid and bin as not being
specified for "place of articulation" would entail that all such "alveolars" be
so specified; as I will show, though, the same principle of not specifying some
part of the phonological representation of codas of words such as this and his
is appropriate to account for palatality assimilation as described earlier, it is
not "place" of articulation that is involved in these cases).
Within the phonological model described earlier, we can give "un(der)-
specification of place" a straightforward interpretation. One important
characteristic of unification-based formalisms employing feature structures
of the kind I described earlier (and which are implemented in our model) is
their ability to share structure. That is, two or more features in a structure
may share one value (Shieber 1986; Pollard and Sag 1987). When two
features share one value, then any increment in information about that value
automatically provides further specification for both features. In the present
case, what we require is that coda-constituent feature structures for bit, bid,
and bin, for instance, should have "place" features undefined. A feature
whose value is undefined is denoted: [feat: 0]. Thus we state, in part, for coda
exponents under consideration:
(15)
grv: 0
cmp: 0
This, taken with the coda-onset constraint equation above, allows for the
appropriate sharing of information. We can illustrate this by means of a
partial feature structure corresponding to // cut:
(16)
src:
coda: cons: |_J_J
voc:

src:

onset: cons: 11 1fgrv: +


cmp: 4-

voc:

277
Segment
The sharing of structure is indicated by using integer coindexing in feature-
vaiue notation. In this feature structure CD indexes the category [grv: -f,
cmp: +], and its occurrence in [coda; [cons: Q]]] indicates that the same value
is here shared. Note that this coindexing indicating sharing of category
values at different places in structure is not equivalent to multiple occur-
rences of the same values at different places in structure. Compare the
following feature structure:

(17)
src:

coda: cons: grv:

voc:

src:

onset: cons: grv:

voc:

This is not the same as the preceding feature structure. In the first case we
have one shared token of the same category [cons: [grv: +, cmp: +]],
whereas in the second there are two different tokens. While the first of
these partial descriptions corresponds to the coda-onset aspects of it cut,
the second relates to the coda-onset aspects of an utterance such as
black case. By sharing, then, we mean that the attributes have the same token
as their value rather than two tokens of the same type. This crucial
difference allows us to give an appropriate phonetic interpretation in the
two different cases (using the extant exponency data in the synthesis
program).
As the standard assimilation story reflects, the phonological coda opposit-
ions which have, in various places, t, d, n as their exponents share certain
commonalities: their exponents can exhibit other than alveolarity as place of
articulation; in traditional phonetic terms their exponents across different
assimilation contexts have manner, voicing, and source features in common.
218
8 John Local
Thus the exponents of the coda of ten as in ten peas, ten teas, ten keys all have
as part of their make-up voice, closure, and nasality; the exponents of the
coda of it in it tore, it bit, it cut all have voicelessness, closure, and non-
nasality as part of theirs. In addition, the "coarticulation" of the assimilated
coda exponents binds them to the exponents of the nucleus constituent in the
same syllable. At first glance, the situation with codas of words such as this
and his might appear to be the same as that just described. That is, we require
underspecification of "place of articulation." In conventional accounts (e.g.
Roach above and Gimson 1970) they are lumped for all practical purposes
with the "alveolars" as an "assimilation class." In some respects they are
similar, but in one important respect the exponents of these codas behave
differently. Crucially, we do not find the same place of articulation pheno-
mena here that wefindwith the bit, bid, and bin codas. Whereas in those cases
we find variants with alveolarity, labiality, velarity, and dentality, for
example, we do not find the same possibilities for this and his type of codas.
For instance, in our collection of material, we have not observed (to give
broad systemic representations) forms such as:
(18) 6iffif difman digdsn diOOir)
for this fish, this man, this then, and this thing. However, we do find
"assimilated" versions of words such as this and his with palatality (as
described above, e.g. this shop, this year). In these cases we do not want to say
that the values of the [ens] part of the coda feature-structure are undefined in
their entirety. Rather, we want to say that we have sharing of "palatality."
(Although the production of palatality as an exponent of the assimilated
coda constituent of this may appear to be a "place-of-articulation" matter, it
is not [see Lass 1976: 4.3, for a related, though segmental, analysis of
alveolarity with friction, on the one hand, and palato-alveolarity with
friction, on the other]). The allocation of one kind of "underspecification" to
the codas of bit, bid, and bin forms and another to the codas of this and his
forms represents part of an explicit implementation of the Firthian poly-
systemic approach to phonological interpretation.
Given the way we have constructed our feature descriptions, the partial
representation for the relevant aspects of such feature structures will look
like this:
(19)

grv: grv:
cons: cons:
emp: emp:

219
Segment
U here denotes the unification of the two structures. This will give us the
large structure corresponding to the "assimilated" version of this shop:

(20)

grv:
coda: cons:
cmp:

voc:

grv: +
onset:
cmp: +

Once we have done this, it is possible to give again an appropriate kind of


phonetic interpretation such that the synthesized tokens of this shop and fish
shop have appropriate kinds of friction portion associated with them (for a
given particular version a particular kind of "assimilation" piece). Because
we have a componential parametric phonetic representation, sharing of
"compactness" across the exponents of structure, as described here, is a
trivial task. Compare the spectrograms of the synthesis output below of
tokens of this with fish, and this shop ("unassimilated" and "assimilated")
with this and with fish shop. (The unassimilated version of this shop was
generated by not allowing the sharing of compactness across coda-onset
constituents in the description of the structural input to the synthesis
program.) A number of points are worth comment. First, the periods of
friction corresponding to the "assimilated" coda in this and the coda offish
are not identical, nor are the portions corresponding to the vocalic part of
these syllables. By contrast there are clear similarities between the portions
corresponding to the vocalic part of the "unassimilated" and "assimilated"
versions of this and the isolated form of this. Compare these spectrograms
with the natural-speech versions in figure 8.3 above. The synthesis output,
220
8 John Local

Figure S.5(a) and (b). For caption see page 223.

221
Segment

hi M l A Milt

Figure 8.5(c) and (d). For caption see facing page.

222
8 John Local

Figure 8.5 Spectrograms of synthetic versions of (a) this, (b) fish, (c) this shop
(unassimilated), (d) this shop (assimilated), and (e)fish shop

while differing in many details from the natural versions, nonetheless has in it
just the same kinds of relevant similarities and differences.

8.6 Conclusion
I have briefly sketched an approach to phonological representation and
parametric phonetic interpretation which provides an explicit and computa-
tionally tractable model for nonsegmental, nonprocedural, (rewrite-)rule-
free speech synthesis. I have tried to demonstrate that it is possible to model
"process" phenomena within this approach in a nonderivational way that
provides a consistent and coherent account of particular assimilatory aspects
of speech. The success of this approach in high-quality speech synthesis
suggests that this model of phonological organization will repay extensive
further examination.

223
Segment
Comments on chapter 8
KLAUS KOHLER
Summary of Local's position
The main points of Local's paper are:
1 His phonological approach does not admit of segmental entities at any level
of representation.
2 It distinguishes strictly between phonology and phonetics.
Phonology is abstract.
It is monostratal, i.e. there is only one level of phonological representation.
There are no derivational steps, therefore no processes (e.g. the
conversion of a high tone into a rising tone following a low tone).
Instead, phonological categories have phonetic exponents, which are
descriptions of physical, temporal events formulated in the physical
domain, i.e. phonetic representations in terms of component para-
meters and their synchronization in time. The precise form of
phonetic representations has no bearing on the form of phonological
representations.
Phonological representations are structured; there is no time or sequence in
them, but only places in structure. Phonology deals with unordered
labeled graph-objects instead of linearly ordered strings of symbols.
Talking about sequence only makes sense in terms of temporal
phonetic interpretation.
Feature structures must be distinguished from parametric phonetic rep-
resentations in time. Deletion and insertion are treated as different
kinds of parameter synchronization in the phonetic interpretation of
the phonological representation. Phonological operations are mono-
tonic; i.e. the combinatorial operations with which phonological
structures are constructed can only add information to represen-
tations and cannot remove anything.
3 As a corollary to the preceding, there are no rewrite rules. The assimilations
of, for example, alveolars to following labials and velars at word boundaries
in English are not treated as changes in phonology, because there are no
"alveolar consonants" in the phonology of English; "by positing such
entities we simply embroil ourselves in a procedural-destructive-type
account of assimilation because we have to change the 'alveolar consonant'
into something else" (p. 207). Quite apart from this there is no homophony
in, e.g., that case and black case or this shop andfishshop, so the empirical
facts are not reported correctly.

224
8 Comments
Critical comments
My reply to Local's paper follows the line of arguments set out above.
1 If phonological elements are nonsegmental and unordered, only struc-
tured, and if phonetic exponency shows ordering in time, then Local has to
demonstrate how the one is transformed into the other. He has not done this.
To say that component parameters are synchronized in time in phonetic
representations is not enough, because we require explicit statements as to
the points in sequence where this synchronization occurs, where certain
parameters take on certain values.
Local does not comprehensively state how the generation of acoustic
parameters from structured phonological representations is achieved. What
is, for instance the input into the computer that activates these exponency
files, e.g. in black case and that easel Surely Local types in the sequence of
alphabetic symbols of English spelling, which at the same time reflects a
phonetic order: k and t come before c of case. This orthographic sequence is
then, presumably, transformed into Local's phonological structures, which,
therefore, implicitly contain segmental-order information because the struc-
tures are derived from a sequential input. So the sequential information is
never lost, and, consequently, not introduced specially by the exponency files
activated in turn by the structural transforms. When the exponency files are
called upon to provide parametric values and time extensions the segmental
order is already there. Even if Local denies the existence of segments and
sequence in phonology, his application to synthesis by rule in his computer
program must implicitly rely on it.
2 If phonology is strictly separated from phonetics, and abstract, how can
timeless, unordered, abstract structures be turned into ordered, concrete time
courses of physical parameters? Features must bear very direct relations to
parameters, at least in a number of cases at focal points, and there must be
order in the phonological representations to indicate whether parameter
values occur sooner or later, before or after particular values in some other
parameter. Moreover, action theory (e.g. Fowler 1980) has demonstrated
very convincingly that temporal information must be incorporated in
phonological representations. This is also the assumption Browman and
Goldstein (1986, this volume) work on. And many papers in this volume (like
Firth himself, e.g. Firth 1957) want to take phonology into the laboratory
and incorporate phonetics into phonology. The precise form of phonetic
representations does indeed have a bearing on the form of phonological
representations (cf. Ohala 1983). For example, the question as to why
alveolars are assimilated to following labials and velars, not vice versa, and
why labials and velars are not assimilated to each other in English or

225
Segment
German, finds its answer in the phonetics of speech production and
perception, and the statement of this restriction of assimilation must be part
of phonology (Kohler 1990).
If the strict separation of phonetics and phonology, advocated by Local, is
given up, phonology cannot be monostratal, and representations have to be
changed. It is certainly not an empirical fact that information can only be
added to phonological representations, never removed. The examples that
case/black case and this shop/fish shop do not prove the generality of Local's
assertion. We can ask a number of questions with regard to them:
1 How general is the distinction? The fact that it can be maintained is no proof
that is has to be maintained. The anecdotal reference to these phrases is not
sufficient. We want statistical evaluations of a large data base, not a few
laboratory productions in constructed sentences by L and K, which may
even stand for John Local and John Kelly.
2 Even if we grant that the distinction is maintained in stressed that case vs.
stressed black case, what happens in It isn't in the bag, it's in that easel
3 What happens in the case of nasals, e.g. in You can get it or pancake?
4 What happens in other vowel contexts, e.g. in hot cooking?
5 In German mitkommen, mitgehen the traces of /t/ are definitely removable,
resulting in a coalescence with the clusters in zuruckkommen, zuruckgehen;
similarly, ankommen/langkommen, angehen/langgehen.
6 Even if these assimilations were such that there are always phonetic traces of
the unassimilated structures left, this cannot be upheld in all cases of
articulatory adjustments and reductions. For instance, German mit dem
Auto can be reduced to a greater or lesser extent in the function words mit
and dem. Two realizations at different ends of the reduction scale are:
[mit de-m '?aoto:]
[mim 'Paotoi]
There is no sense in saying that the syllable onset and nucleus of [ deTm] are
still contained in [m], because the second utterance has one syllable less. If,
however, the two utterances are related to the same lexical items and a
uniform phonological representation, which is certainly a sensible thing to
do, then there has to be derivation and change. And this derivation has to
explain phonologically and phonetically why it, rather than any other
derivation, occurs. Allowing derivations can give these insights, e.g. for the
set of German weak form reductions [mit deTm], (mit ctam], [mitm], [mipm],
[mibm], [mimm], [mim] along a scale from least reduced and most formal to
most reduced and least formal, which can be accounted for by a set of
ordered rules explaining these changes with reference to general phonetic
principles, and excluding all others (Kohler 1979).

3 Rewrite rules are thus not only inevitable, they also add to the explanatory
power of our phonological descriptions. This leads me to a further question.

226
8 Comments
What are phonologies of the type Local proposes useful for? He does not
deal with this issue, but I think it is fair to say that he is not basically
concerned with explanatory adequacy, i.e. with the question as to why things
are the way they are. His main concern is with descriptive adequacy, and it is
in this respect that the acuteness of phonetic observations on prosodic lines
can contribute a lot, and has definitely done so. This is the area where
prosodic analysis should continue to excel, at the expense of the theorizing
presented by Local.

Comments on chapter 8
MARIO ROSSI
The phonology-phonetics distinction
I agree with Local on the necessity of a clear distinction between the two
levels of matter (substance) and form. But it seems to me that his conception
is an old one derived from a misinterpretation of De Saussure. Many of the
structuralists thought that the concept of language in De Saussure was
defined as a pure form; but an accurate reading of De Saussure, whose ideas
were mostly influenced by Aristotle's Metaphysics, shows that the "langue"
concept is a compound of matter and form, and that the matter is organized
by the form. So it is the organization of acoustic cues in the matter, which is a
reflection of the form, that allows in some way the perception and decoding
of the linguistic form. Consequently, Local's assumption, "it makes no sense
to talk of structure or systems in phonetics" (p. 193), is a misconception of
the relationship between matter and form. Matter is not an "amorphous"
substance. The arbitrary relationship between matter and form (I prefer
"matter and form" to "phonetics and phonology") means that the para-
meters of the matter and the way in which they are organized are not
necessarily linked with the form; that is, the form imposes an organization on
the parameters of the matter according to the constraints and the specific
modes of the matter. So we are justified in looking for traces of form values in
the matter.
At the same time, we have to remember that the organization of matter is
not isomorphic with the structure of the form. In other words, the acoustic/
articulatory cues are not structured as linguistic features, as implied in
Jakobson, Fant, and Halle (1952). In that sense, Local is right when he says
"phonological features do not have implicit denotations" (p. 196). In reality,
the type of reasoning Local uses in the discussion of the assimilation process
in this shop and fish shop demonstrates that he is looking at the acoustic
227
Segment
parameters as organized parameters, parameters that reflect the formal
structure.
I agree in part with the assumption that the search for invariant phonetic
correlates of phonological categories is implied by the erroneous belief that
the phonological categories have some kind of implicit "naive phonetic
denotation." However, I think that the search for invariants by some
phoneticians is not "naive," but more complexly tied to two factors:
1 The structuralist conception of language as pure form, and the phonology
derived from this conception. In this conception a phonological unit does
not change: "once a phoneme always a phoneme."
2 The lack of a clear distinction in American structuralism between cues and
features, derived from the misconception that emphasizes the omnipotence
of the form embedded in the matter.
Finally, to say that "the relationship between phonology and phonetics is
arbitrary" (p. 193) is to overlook the theory of natural phonology. The
concept of the arbitrary relationship needs to be explained and clarified.
Consider Hooper's (1976) "likelihood condition"; if this condition did not
hold, the phonological unit would have phonetic exponents that would be
totally random.

Underspecification
How can an "unspecified" coda affect the onset exponents of the structure
(p. 215). Local defines some codas as "underspecified"; he says that phono-
logical descriptions are algebraic objects and the precise form of phonetic
representation has no bearing on the form of the phonological represen-
tations. I see a contradiction in this reasoning: underspecification is posited
in order to account for phonetic assimilation processes. So phonetic
representations have bearing on form! I see no difference between under-
specification and the neutralization that Local wants to avoid.

228
9
Lexical processing and phonological representation

ADITI LAHIRIand WILLIAM MARSLEN-WILSON

9.1 Introduction
In this paper, we are concerned with the mental representation of lexical
items and the way in which the acoustic signal is mapped onto these
representations during the process of recognition. We propose here a
psycholinguistic model of these processes, integrating a theory of processing
with a theory of representation. We take the cohort model of spoken word-
recognition (Marslen-Wilson 1984, 1987) as the basis for our assumptions
about the processing environment in which lexical processing takes place,
and we take fundamental phonological assumptions about abstractness as
the basis for our theory of representation. Specifically, we assume that the
abstract properties that phonological theory assigns to underlying represen-
tations of lexical form correspond, in some significant way, to the listener's
mental representations of lexical form in the "recognition lexicon," and that
these representations have direct consequences for the way in which the
listener interprets the incoming acoustic-phonetic information, as the speech
signal is mapped into the lexicon.
The paper is organized as follows. We first lay out our basic assumptions
about the processing and representation of lexical form. We then turn to two
experimental investigations of the resulting psycholinguistic model: the first
involves the representation and the spreading of a melodic feature (the
feature [nasal]); and the second concerns the representation of quantity,
specifically geminate consonants. In each case we show that the listener's
performance is best understood in terms of very abstract perceptual represen-
tations, rather than representations which simply reflect the surface forms of
words in the language.

229
Segment

9.2 Outline of a theory of lexical processing and representation


In the following two sections we will outline our basic assumptions about the
abstract nature of mental representations in the recognition lexicon, and
about the processing environment within which these representations have
their perceptual function.

9.2.1 Assumptions about representation


The psycholinguistically relevant representations of lexical form must be
abstract in nature - that is, they must in some way abstract away from the
variabilities in the surface realization of lexical form. This means that our
account of the properties of the recognition lexicon must be an account in
terms of some set of underlying mental representations. The only systematic
hypotheses about the properties of these underlying representations are those
that derive from phonological theory, and it is these that we take as our
starting point here. We do not assume that there is a literal and direct
translation from phonological analysis to claims about mental represen-
tation, but we do assume that there is a functional isomorphism between the
general properties of these mental representations and the general properties
of lexical-form representations as established by phonological analysis.
From the consensus of current opinion about phonological represen-
tations we extract three main assumptions. First, we assume hierarchical
representation of features, such that underlying segments are not defined as
unordered lists. Such a featural organization attributes features to different
tiers and allows the relationships between two features or sets of features to
be expressed in terms of links between the tiers (Clements 1985). An
advantage of hierarchical organization of features is that it is possible to
express the fact that certain groups of features consistently behave as
functional units. From the perspective of lexical access and the recognition
lexicon, this means that the input to the lexicon will need to be in terms of
features rather than segments, since there is no independent level of represen-
tation corresponding to segments which could function as the access route to
the lexicon.
The second assumption concerns the representation of quantity. The
autosegmental theory of length has assumed that the feature content of the
segment is represented on a different level (the melody tier) from the quantity
of the segment. Long segments, both vowels and consonants, are single
melody units containing all featural information doubly linked to two
abstract timing units on the skeletal tier (see Hayes 1986). Researchers differ
as to the exact nature of the unit of representation (see McCarthy and Prince
1986; Lahiri and Koreman 1988; Hayes 1989), but for our purposes, it is
sufficient to assume that in the representation of lexical items in the mental
230
9 Aditi Lahiri and William Marslen- Wilson
lexicon, the featural content is separate from its quantity. This has impli-
cations for the perceptual processing of quantity as opposed to quality
information, as we discuss in more detail in section 9.4 below.
The third assumption concerns the amount of featural content present in
the underlying representation. A theory of feature specification must deter-
mine what features are present underlyingly and which values of these
features are specified. We assume here an underspecified lexicon, where only
unpredictable information is represented. First, only distinctive features
which crucially differentiate at least two segments are present, and second,
only one value of a feature (the marked value) is specified. These claims are
by no means uncontroversial. However, a full discussion of the different
views on underspecification is beyond the scope of this paper. Since we are
primarily concerned with the feature [nasal] in the research we present here,
we will discuss later in the paper the representation of this feature in the
relevant languages.
From the perspective of the processing model and the recognition lexicon,
these assumptions mean that the lexical representations deployed in speech
recognition will contain only distinctive and marked information. The
crucial consequence of this, which we will explore in the research described
below, is that the process of lexical access will assign a different status to
information about nondistinctive, unmarked properties of the signal than it
will to information that is directly represented in the recognition lexicon.
This means that neutralized elements (at least, those that arise from
postlexical feature-spreading rules) are interpreted with respect to their
underlying representation and not their surface forms.

9.2.2 Assumptions about processing


To be able to evaluate claims about representation in a psycholinguistic
framework, they need to be interpreted in the context of a model of the
processing environment for lexical access. The theory that we assume here is
the cohort model of spoken-word recognition (Marslen-Wilson 1984, 1987).
The salient features of this model are as follows.
The cohort model distinguishes an initial, autonomous process of lexical
access and selection, responsible for the mapping of the speech signal onto
the representations of word forms in the mental lexicon. The model assumes
that there is a discrete, computationally independent recognition element for
each lexical unit. This unit represents the functional coordination of the
bundle of phonological, morphological, syntactic, and semantic properties
defining a given lexical entry. Here we are concerned only with the phono-
logical aspects of the representation. The "recognition lexicon," therefore, is
constituted by the entire array of such elements.
231
Segment

A second property of the system is that it allows for the simultaneous,


parallel activation of each lexical element by the appropriate input from the
analysis of the acoustic signal. This is coupled with the further assumption
that the level of activation of each element reflects the goodness of fit of the
input to the form specifications for each element. As more matching input
accumulates, the level of activation will increase. When the input pattern fails
to match, the level of activation starts to decay.
These assumptions lead to the characteristic cohort view of the form-based
access and selection process. The process begins with the multiple access of
word candidates as the beginning of the word is heard. All of the words in the
listener's mental lexicon that share this onset sequence are assumed to be
activated. This initial pool of active word candidates forms the "word-initial
cohort," from among which the correct candidate will subsequently be
selected. The selection decision itself is based on a process of successive
reduction of the active membership of the cohort of competitors. As more of
the word is heard, the accumulating input pattern will diverge from the
form specifications of an increasingly higher proportion of the cohort
membership.
This process of reduction continues until only one candidate remains still
matching the speech input - in activation terms, until the level of activation
of one recognition element becomes sufficiently distinct from the level of
activation of its competitors. At this point the form-based selection process is
complete, and the word form that best matches the speech input can be
identified.
For our current concerns, the most important feature of this processing
model is that it is based on the concept of competition among alternative
word candidates. Perceptual choice, in the cohort approach, is a contingent
choice. The identification of any given word does not depend simply on the
information that this word is present. It also depends on the information that
other words are not present, since it is only at the point in the word where no
other words fit the sensory input - known as the "recognition point" - that
the unique candidate emerges from among its competitors.
The recognition of a word does not depend on the perceptual availability
of a complete specification of that word in the sensory input, either where
individual segments or features are concerned, or where the whole word is
concerned. The information has to be sufficient to discriminate the word
from its competitors, but this is a relative concept. This makes the basic mode
of operation of the recognition system compatible with the claim that the
representations in the recognition lexicon contain only distinctive infor-
mation. This will be sufficient, in a contingent, competition-based recogni-
tion process, to enable the correct item to be recognized.
The second important aspect of a cohort approach to form-based process-
232
9 Aditi Lahiri and William Marslen- Wilson
ing is its emphasis on the continuous and sequential nature of the access and
selection process. The speech signal is based on a continuous sequence of
articulatory gestures, which result in a continuous modulation of the signal.
Recent research (Warren and Marslen-Wilson 1987, 1988) shows that this
continuous modulation of the speech signal is faithfully tracked by the
processes responsible for lexical access and selection. As information
becomes available in the signal, its consequences are immediately felt at the
lexical level.
This means that there is no segment-by-segment structuring of the
relationship between the prelexical analysis of the speech signal and the
interpretation of this analysis at the level of lexical choice. The system does
not wait until segmental labels can be assigned before communicating with
the lexicon. Featural cues start to affect lexical choice as soon as they become
available in the speech input. The listener uses featural information to select
words that are compatible with these cues, even though the final segment
cannot yet be uniquely identified (Warren and Marslen-Wilson 1987, 1988).
On a number of counts, then, the cohort view of lexical access is
compatible with the phonological view of lexical representation that we
outlined earlier. It allows for an on-line process of competition between
minimally specified elements, where this minimal specification is still suffi-
cient to maintain distinctiveness; and second, it allows for this competition to
be conducted, with maximal on-line efficiency, in terms of a continuous
stream of information about the cues that the speech signal provides to
lexical identity, where these cues are defined in featural terms.1
Given this preliminary sketch of our claims about lexical representation in
the context of a model of lexical processing, we now turn to a series of
experimental investigations of the psycholinguistic model that has emerged.

9.3 Processing and representation of a melodic feature


Our fundamental claim is that the processes of lexical access and selection are
conducted with respect to abstract phonological representations of lexical
form. This means that listeners do not have available to them, as they process
the speech input, a representation of the surface phonetic realization of a
given word form. Instead, what determines their performance is the underly-
ing mental representation with respect to which this surface string is being
interpreted.
We will test this claim here in two ways, in each case asking whether
information in the speech signal is interpreted in ways that follow from the
1
The Trace model (McClelland and Elman 1986) is an example of a computationally
implemented model with essentially these processing properties - although, of course, it
makes very different assumptions about representation.

233
Segment

claims that we have made about the properties of the underlying represen-
tations. In the first of these tests, to which we now turn, we investigate the
interpretation of the same surface feature as its underlying phonological
status varies across different languages. If it is the underlying representation
that controls performance, then the interpretation of the surface feature
should change as its underlying phonological status changes. In the second
test, presented in section 9.4, we investigate the processing and represen-
tation of quantity.

9.3.1 The orallnasal contrast in English and Bengali


The feature that we chose to concentrate on was the oral/nasal contrast for
vowels. This was largely because of the uncontroversial status of the feature
[nasal] in natural languages. Nasal vowels usually occur in languages which
have the oral counterpart, and are thus considered to be marked. In terms of
underlying feature specification, distinctive nasal vowels are assumed to be
marked underlyingly as [ + nasal], whereas the oral vowels are left unspecified
(e.g. Archangeli 1984).
The second reason for choosing the feature [nasal] was because the
presence of vowel nasalization does not necessarily depend on the existence
of an underlyingly nasal vowel. A phonetically nasal vowel can reflect an
underlying nasal vowel, or it can be derived, by a process of assimilation,
from a following nasal consonant. Such regressive nasal-assimilation pro-
cesses are widespread cross-linguistically.
This derived nasalization gives us the basic contrasts that we need, where
the same surface feature can have a varying phonological status, contrasting
both within a given language and between different languages. This leads to
the third reason for choosing the feature [nasal]: namely, the availability of
two languages (English and Bengali) which allowed us to realize these
contrasts in the appropriate stimulus sets. The relevant facts are summarized
in table 9.1.
English has only underlying oral vowel segments. In the two cases
illustrated (CVC and CVN) the vowel is underlyingly oral. An allophonic
rule of vowel nasalization nasalizes all oral vowels when followed by a nasal
consonant, giving surface contrasts like ban [baen] and bad [baed]. In
autosegmental terms, assimilation is described as spreading where an asso-
ciation line is added. This is illustrated in the differing surface realization of
the oral vowels in the CVNs and the CVCs in table 9.1. The assumption that
the vowel in the CVN is underlyingly oral follows from the fact that surface
nasalization is always predictable, and therefore need not be specified in the
abstract representation.
Bengali has both underlyingly oral and nasal vowel segments. Each of the
234
9 Aditi Lahiri and William Marslen- Wilson
Table 9.1 Underlying and surface representation in Bengali and English

Bengali
CVN cvc cvc
Underlying V C V C V C

[ + nas ] [ + nas ]

CVN cvc CVC


Surface V C V C V C
\\
\
[ -Knas ] [ + nas ]

English
CVN cvc
Underlying V C V C

[ + nas]

Surface V C V C

[ -Knas ]

seven oral vowels in the language has a corresponding nasal vowel, as in the
minimal pairs [pak] "slime" and [pak] "cooking" (Ferguson and Chowdhury
1960). A postlexical process of regressive assimilation is an additional source
of surface nasalization applying to both monomorphemic and heteromor-
phemic words.2

/kha + o/ -> [khao] "you (familiar) eat"


/kha + n/ -> [khan] "you (honorific) eat"
/kha + s/ - [khas] "you (familiar, younger) eat"
/kan/ -> [kan] "ear"
We make two assumptions regarding the specification of the [nasal] feature in
the underlying representation of Bengali. First, only one value is specified, in
this instance [ +nasal]; second, the vowels in monomorphemic CVN words,
which always surface as CVN, are underlyingly oral and are therefore not
specified for nasality. The surface nasality of these vowels is entirely
2
Our experiments on Bengali nasalization in fact only use monomorphemic words; we illustrate
here the application of the nasal-assimilation rule on heteromorphemic words to show how it
works in the language in general.

235
Segment

predictable, and since the rule of nasal assimilation is independently needed


for heteromorphemic words (as illustrated above), our assumptions about
underspecified underlying forms (especially with unmarked values [Kiparsky
1985: 92]) suggest that for all monomorphemic VN sequences, the underlying
representation of the vowel segment should be an unmarked oral vowel.
Note that the experiment reported below uses only monomorphemic CVN
sequences of this type.
Thus, assuming that the vowels in CVNs are underlyingly oral, and given
the nasal-assimilation rule, this leads to the situation shown in the upper half
of table 9.1. Surface nasality is ambiguous in Bengali since the vowels in both
CVNs and in CVCs are realized as nasal. Unlike English, therefore, the
nasal-assimilation rule in Bengali is neutralizing, and creates potential
ambiguity at the lexical level.

9.3.2 Experimental predictions


The pattern of surface and underlying forms laid out in table 9.1 allows us to
discriminate directly the predictions of a theory of the type we are advocating
- where underlying mental representations are abstract and underspecified -
from alternative views, where the recognition process is conducted in terms
of some representation of the surface form of the word in question.3 Note
that the surface-representation theory that we are assuming here will itself
have to abstract away, at least to some degree, from the phonetic detail of a
word's realization as a spoken form. We interpret "surface representation,"
therefore, as being equivalent to the representation of a word's phonological
form after all phonological rules have applied - in the case of assimilatory
processes like derived nasalization, for example, after the feature [nasal] has
spread to the preceding consonant. Surface representation means, then, the
complete specification of a word's phonetic form, but without any of the
details of its realization by a given speaker in a given phonetic environment.
The differences in the predictions of the surface and the underlying
hypotheses only apply, however, to the period during which the listener is
hearing the oral or nasal vowel (in monosyllables of the type illustrated in
table 9.1). Once the listener hears the following consonant, then the
interpretation of the preceding vowel becomes unambiguous.
The experimental task that we will use allows us to establish how the
listener is responding to the vowel before the consonant is heard. This is the
gating task (Grosjean 1980; Tyler and Wessels 1983), in which listeners are
presented, at successive trials, with gradually incrementing information

3
This view is so taken for granted in current research on lexical access that it would be
invidious to single out any single exponent of it.

236
9 Aditi Lahiri and William Marslen- Wilson

about the word being heard. At each increment they are asked to say what
word they think they are hearing, and this enables the experimenter to
determine how the listener is interpreting the sensory information presented
up to the point at which the current gate terminates. Previous research
(Marslen-Wilson 1984, 1987) shows that performance in this task correlates
well with recognition performance under more normal listening conditions.
Other research (Warren and Marslen-Wilson 1987) shows that gating
responses are sensitive to the presence of phonetic cues such as vowel
nasalization, as they become available in the speech input.4
We will use the gating task to investigate listeners' interpretations of
phonetically oral and phonetically nasal vowels for three different stimulus
sets. Two sets reflect the structure laid out in table 9.1: a set of CVC, CVN,
and CVC triplets in Bengali, and a set of CVC and CVN doublets in English.
To allow a more direct comparison between English and Bengali, we will also
include a set of Bengali doublets, consisting of CVN and CVC pairs, where
the lexicon of the language does not contain a CVC beginning with the same
consonant and vowel as the CVN/CVC pair. This will place the Bengali
listeners, as they heard the CVN stimuli, in the same position, in principle, as
the English listeners exposed to an English CVN. In each case, the item is
lexically unambiguous, since there are no CVC words lexically available.
As we will lay out in more detail in section 9.3.4 below, the underlying-
representation hypothesis predicts a quite different pattern of performance
than any view of representation which includes redundant information (such
as derived nasalization). In particular, it predicts that phonetically oral
vowels will be ambiguous between CVNs and CVCs for both Bengali and
English, whereas phonetically nasal vowels will be unambiguous for both
languages but in different ways - in Bengali vowel nasalization will be
interpreted as reflecting an underlying nasal vowel followed by an oral
consonant, while in English it will be interpreted as reflecting an underlying
oral vowel followed by a nasal consonant.
Notice that these predictions for nasalized vowels also follow directly from
the cohort model's claims about the immediate uptake of information in the
speech signal - as interpreted in the context of this specific set of claims about
the content of lexical-form representations. For the Bengali case, vowel
nasalization should immediately begin to be interpreted as information
about the vowel currently being heard. For the English case, where the

4
Ohala (this volume) raises the issue of "hysteresis" effects in gating: namely, that because the
stimulus is repeated several times in small increments, listeners become locked in to particular
perceptual hypotheses and are reluctant to given them up in the face of disconfirming
information. Previous research by Cotton and Grosjean (1984) and Salasoo and Pisoni (1985)
shows that such effects are negligible, and can be dismissed as a possible factor in the current
research.

237
Segment
presence of nasalization cannot, ex hypothesis be interpreted as information
about vowels, it is interpreted as constraining the class of consonants that
can follow the vowel being heard.5 This is consistent with earlier research
(Warren and Marslen-Wilson 1987) showing that listeners will select from
the class of nasal consonants in making their responses, even before the place
of articulation of this consonant is known.
Turning to a surface-representation hypothesis, this makes the same
predictions for nasalized vowels in English, but diverges for all other cases. If
the representation of CVNs in the recognition lexicon codes the fact that the
vowel is nasalized, then phonetically oral vowels should be unambiguous
(listeners should never interpret CVCs as potential CVNs) in both Bengali
and English, while phonetically nasal vowels should now be ambiguous in
Bengali - ceteris paribus, listeners should be as willing to interpret vowel
nasalization as evidence for a CVN as for a CVC.

933 Method
933.1 Materials and design
Two sets of materials were constructed, for the Bengali and for the English
parts of the study. We will describe first the Bengali stimuli.
The primary set of Bengali stimuli consisted of twenty-one triplets of
Bengali words, each containing a CVC, and CVN, and a CVC, where each
member of the triplet shared the same initial oral consonant (or consonant
cluster), and the same vowel (oral or nasal) but different in final consonant
(which was either oral or nasal). An example set is the triplet /kap/, /kam/,
/kap/. As far as possible the place of articulation of the word-final consonant
was kept constant. The vowels [a, o, D, ae, e] and their nasal counterparts were
used.
We also attempted to match the members of each triplet for frequency of
occurrence in the language. Since there are no published frequency norms for
Bengali, it was necessary to rely on the subjective familiarity judgments of a
native speaker (the first author). Judgments of this type correlate well with
objective measures of frequency (e.g. Segui et al. 1982).
The second set of Bengali stimuli consisted of twenty doublets, containing
matched CVCs and CVNs, where there was no word in the language
beginning with the same consonant and vowel, but where the vowel was a
nasal. An example is the pair /lorn/, /lop/, where there is no lexical item in the
language beginning with the sequence /16/. The absence of lexical items with
the appropriate nasal vowels was checked in a standard Bengali dictionary
5
Ohala (this volume) unfortunately misses this point, which leads him to claim, quite wrongly,
that there is an incompatibility between the general claims of the cohort model and the
assumptions being made here about the processing of vowel nasalization in English.

238
9 Aditi Lahiri and William Marslen- Wilson
(Dev 1973). As before, place of articulation of the final consonant in each
doublet was kept constant. We used the same vowels as for the triplets, with
the addition of [i] and [u].
Given the absence of nasal vowels in English, only one set of stimuli was
constructed. This was a set of twenty doublets, matched as closely as possible
to the Bengali doublets in phonetic structure. The pairs were matched for
frequency, using the Kucera and Francis (1967) norms, with a mean
frequency for the CVNs of 18.2 and for the CVCs of 23.4.
All of the stimuli were prepared in the same way for use in the gating task.
The Bengali and English stimuli were recorded by native speakers of the
respective languages. They were then digitized at a sampling rate of 20 kHz
for editing and manipulation in the Max-Planck speech laboratory.
Each gating sequence was organized as follows. All gates were set at a zero
crossing. We wanted to be able to look systematically at responses relative
both to vowel onset and to vowel offset. The first gate was therefore set, for
all stimuli, at the end of the fourth glottal pulse after vowel onset. This first
gate was variable in length. The gating sequence then continued through the
vowel in approximately 40 msec, increments until the offset of the vowel was
encountered. A gate boundary was always set at vowel offset, with the result
that the last gate before vowel offset also varied in length for different stimuli.
If the interval between the end of the last preceding gate and the offset of the
vowel was less than 10 msec, (i.e., not more than one glottal pulse), then this
last gate was simply increased in length by the necessary amount. If the
interval to vowel offset was more than 10 msec, then an extra gate of variable
length was inserted. After vowel offset the gating sequence then continued in
steady 40 msec, increments until the end of the word. Figure 9.1 illustrates
the complete gating sequence computed for one of the English stimuli.
The location of the gates for the stimuli was determined using a high-
resolution visual display, assisted by auditory playback. When gates had
been assigned to all of the stimuli, seven different experimental tapes were
then constructed. Three of these were for the Bengali triplets, and each
consisted of three practice items followed by twenty-one test items. The tapes
were organized so that each tape contained an equal number of CVCs,
CVNs, and CVCs, but only one item from each triplet, so that each subject
heard a given initial CV combination only once during the experiment. A
further two tapes were constructed for the Bengali doublets, again with three
practice items followed by twenty test items, with members of each doublet
assigned one to each tape. The final two tapes, for the English doublets,
followed in structure the Bengali doublet tapes.
On each tape, the successive gates were recorded at six-second intervals. A
short warning tone preceded each gate, and a double tone marked the
beginning of a new gating sequence.
239
Segment

-8 -7 -6 -5 -4 -3 -2 -1 0 +1 +2 +3

Figure 9.1 The complete gating sequence for the English word grade. Gate 0 marks the offset of
the vowel

9.3.3.2 Subjects and procedure


For the English materials, twenty-eight subjects were tested, fourteen for
each of the two experimental tapes. All subjects were native speakers of
British English and were paid for their participation. For the Bengali
materials, a total of sixty subjects were tested, thirty-six for the three triplet
tapes, and twenty-four for the two doublet tapes. No subject heard more
than one tape. The subjects were literate native speakers of Bengali, tested in
Calcutta. They were paid, as appropriate, for their participation.
The same testing procedure was followed throughout. The subjects were
tested in groups of two to four, seated in a quiet room. They heard the stimuli
over closed-ear headphones (Sennheiser HD222), as a binaural monophonic
signal. They made their responses by writing down their word choices (each
in their own script), with accompanying confidence rating, in the response
booklets on the desk in front of them. The booklets were organized to allow
for one gating sequence per page, and consisted of a series of numbered
blank lines, with each line terminating in the numbers 1 to 10. The number 1
was labeled "Complete Guess" and the number 10 was labeled "Very
Certain" (or the equivalent in Bengali).
The subjects began the testing session by reading a set of written
instructions, which explained the task, stressing the importance of (a) making
a response to every gating fragment and of (b) writing down a complete word
as a response every time and not just a representation of the sounds they
thought they heard - even if they felt that their response was a complete
guess. They were then questioned to make sure that they had understood the
task. The three practice items then followed. The subjects' performance was
240
9 Aditi Lahiri and William Marslen-Wilson
checked after each practice sequence, to determine whether they were
performing correctly. The main test sequence then followed, lasting for 30-35
minutes.

9.3.4 Results and discussion


The subjects' gating responses were analyzed so as to provide a breakdown,
for each item, of the responses at each gate. All scoreable responses were
classified either as CVCs, CVNs, or CVCs.
The results are very clear, and follow exactly the pattern predicted by the
underlying-representation hypothesis. Whether they are listening to phoneti-
cally nasal vowels or to oral vowels, listeners behave as if they are matching
the incoming signal against a lexical representation which codes the vowels in
CVCs and CVNs as oral and only the vowels in CVCs as nasal. We will
consider first the Bengali triplets.

93.4.1 Bengali triplets


Figure 9.2 gives the mean number of different types of response (CVC, CVC,
or CVN) to each type of stimulus, plotted across thefivegates up to the offset
of the vowel (gate 0 in the figure) and continuing for five gates into the final
consonant. The top panel of figure 9.2 shows the responses to CVN stimuli,
the middle panel the responses to CVC stimuli, and the bottom panel the
responses to CVC stimuli.
The crucial data here are for the first five gates. Once the listeners receive
unambiguous information about the final consonant (starting from gate
+ 1), then their responses no longer discriminate between alternative theories
of representation. What is important is how listeners respond to the vowel
before the final consonant. To aid in the assessment of this we also include a
statistical summary of the responses over the first five gates. Table 9.2 gives
the overall mean percentage of responses of different types to each of the
three stimulus types.
Consider first the listeners' responses to the nasalized vowels in the CVNs
and CVCs. For both stimulus types, the listeners show a strong bias towards
CVC responses. They interpret the presence of nasalization as a cue that they
are hearing a nasal vowel followed by an oral cosonant (a CVC) and not as
signaling the nasality of the following consonant. This is very clear both from
the response distributions in figure 9.2 and from the summary statistics in
table 9.2.
This lack of CVN responses to CVN stimuli (up to gate 0) cannot be
attributed to any lack of nasalization of the vowel in these materials. The
presence of the CVC responses shows that the vowel was indeed perceived as
nasalized, and the close parallel between the CVC response curves for the
241
Segment

-5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5

-5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5

Gates
Figure 9.2 Bengali triplets: mean percentage of different types of response (CVC, CVC, or
CVN) to each type of stimulus, plotted across thefivegates up to offset of the vowel (gate 0) and
continuing forfivegates into the consonant. The top panel gives the responses to CVN stimuli,
the middle panel the responses to CVC stimuli, and the bottom panel plots the responses to CVC
stimuli

242
9 Aditi Lahiri and William Marslen- Wilson
Table 9.2 Bengali triplets: percent responses up to
vowel offset

Type of response

CVC CVC CVN


Stimulus
CVC 80.3 0.7 13.4
CVC 33.2 56.8 5.2
CVN 23.5 63.0 7.9

CVN and CVC stimuli shows that the degree of perceived nasalization was
approximately equal for both types of stimulus.
These results are problematic for a surface-representation hypothesis. On
such an account, vowel nasalization is perceptually ambiguous, and res-
ponses should be more or less evenly divided between CVNs and CVCs. To
explain the imbalance in favor of CVC responses, this account would have to
postulate an additional source of bias, operating postperceptually to push
the listener towards the nasal-vowel interpretation rather than the oral-
vowel/nasal-consonant reading. This becomes implausible as soon as we look
at the pattern of responses to CVC stimuli, where oral vowels are followed by
oral consonants.
Performance here is dominated by CVC responses. Already at gate — 5 the
proportion of CVC responses is higher than for the CVN or CVC stimuli,
and remains fairly steady, at around 80 percent, for the next five gates.
Consistent with this, there are essentially no CVC responses at all. In
contrast, there is a relatively high frequency of CVN responses over the first
five gates. Listeners produce more than twice as many CVN responses, on
average, to CVC stimuli than they do to either CVN or CVC stimuli.
This is difficult to explain on a surface-representation account. If CVNs
are represented in the recognition lexicon as containing a nasalized vowel
followed by a nasal consonant, then there should be no more reason to
produce CVNs as responses to CVCs than there is to produce CVCs. And
certainly, there should be no reason to expect CVN responses to be
significantly more frequent to oral vowels than to nasalized vowels. On a
surface-representation hypothesis these responses are simply mistakes -
which leaves unexplained why listeners do not make the same mistake with
CVC responses.
On the underlying-representation hypothesis, the pattern of results for
CVC stimuli follows directly. The recognition lexicon represents CVCs as

243
Segment
having a nasal vowel. There is therefore no reason to make a CVC response
when an oral, non-nasalized vowel is being heard. Both CVCs and CVNs,
however, are represented as having an oral vowel (followed in the one case by
an oral consonant and in the other by a nasal consonant). As far as the
listener's recognition lexicon is concerned, therefore, it is just as appropriate
to give CVNs as responses to oral vowels as it is to give CVCs. The exact
distribution of CVC and CVN responses to CVCs (a ratio of roughly 4 to 1)
presumably reflects the distributional facts of the language, with CVCs being
far more frequent than CVNs.

9.3.4.2 Bengali doublets


The second set of results involve the Bengali doublets. These were the
stimulus sets composed of CVCs and CVNs, where there was no CVC in the
language that shared the same initial consonant or vowel. Figure 9.3 gives the
results across gates, showing the number of responses of different types to the
two sets of stimuli, with the CVN stimuli in the upper panel and the CVC
stimuli in the lower panel. Table 9.3 summarizes the overall mean percentage
of responses of each type for the five gates leading up to vowel closure.
Again, the results follow straightforwardly from the underlying-represen-
tation hypothesis and are difficult to explain on a surface-representation
hypothesis.
The CVC stimuli elicit the same response pattern as we found for the
triplets. There are no nasal-vowel responses, an average of over 80 percent
CVC responses, and the same percentage as before of CVN responses,
reaching nearly 15 percent. The CVN stimuli elicit a quite different response
pattern. Although the listeners again produced some CVN responses (aver-
aging 16 percent over the first five gates), they also produce a surprising
number of CVC responses (averaging 17 percent). In fact, for gates — 3 to 0,
they produce more CVC responses than they do CVN responses. The way
they do this is, effectively, by inventing new stimuli. Instead of producing the
CVN that is lexically available, the listeners produce as responses CVCs that
are closely related, phonetically, to the consonant-vowel sequence they are
hearing. They either produce real words, whose initial consonant or medial
vowel deviates minimally from the actual stimulus, or they invent nonsense
words.
This striking reluctance to produce a CVN response, even when the input
is apparently unambiguous, cannot be explained on a surface-representation
hypothesis. If the listener knows what a CVN sounds like, then why does he
not produce one as a response when he hears one - and when, indeed, the
lexicon of the language does not permit it to be anything else? In contrast,
this difficulty in producing a CVN response follows directly from the
underlying-representation hypothesis, where nasalization on the surface is

244
9 Aditi Lahiri and William Marslen- Wilson

Figure 9.3 Bengali doublets: mean percentage of different types of response (CVC, CVC, or
CVN) plotted across gates, gate 0 marking the offset of the vowel. The upper panel gives
responses to CVN stimuli and the lower panel the responses to CVC stimuli

Table 9.3 Bengali doublets: percent responses up to


vowel offset

Type of response

CVC CVC CVN

Stimulus
CVC 82.6 0.0 14.7
CVN 64.7 17.0 15.6

245
Segment

Response type
CVC
CVN

Response type
CVC
CVN

Figure 9.4 Mean percentage of different types of responses to the two English stimulus sets
across gates, gate 0 marking the vowel offset. Responses to CVN stimuli are plotted on the upper
panel and to CVC stimuli on the lower panel

interpreted as a cue to an underlyingly nasal vowel. For the doublets, this will
mean that the listener will not be able to find any perfect lexical match, since
the CVN being heard is represented in the recognition lexicon as underly-
ingly oral, and there is no lexically available CVC. This predicts, as we
observed here, that there should not be a large increase in CVN responses
even when the CVN is lexically ambiguous. A CVC which diverges in some
other feature from the input will be just as good a match as the CVN, at least
until the nasal consonant is heard.

9.3.4.3 English doublets


English has no underlying nasal-vowel segments so that vowel nasalization

246
9 Aditi Lahiri and William Marslen- Wilson
Table 9.4 English Doublets: percent responses up to
vowel offset

Type of response

CVC CVN
Stimulus
CVC 83.4 16.6
CVN 59.3 40.7

appears only as an allophonic process, with the nasal feature spreading to an


oral vowel from the following nasal consonant. This means that the
interpretation of vowel nasalization in a CVN should be completely unambi-
guous. Figure 9.4 plots the responses across gates, showing the number of
responses of different types to the two stimulus sets, with CVN stimuli in the
upper panel and CVC stimuli in the lower. Table 9.4 summarizes the overall
percentage of responses of each type for the five gates up to vowel offset.
The responses to the CVN stimuli straightforwardly follow from the
phonological and phonetic conditions under which the stimuli are being
interpreted. There is already a relatively high proportion of CVN responses
at gate — 5, indicating an early onset of nasalization, and the proportion of
CVN responses increases steadily to vowel offset.6 The overall proportion
of CVN responses for these first five gates (at 41 percent) is quite similar to
the combined total of nonoral responses (33 percent) for the Bengali CVNs.
This suggests that the stimulus sets in the two languages are effectively
equivalent in degree of vowel nasalization.
The overall pattern of responses to the English CVC stimuli closely
parallels the pattern of responses for the Bengali doublets. There is the same
overall proportion of CVC responses, and the number of CVN responses, at
17 percent, is very close to the 15 percent found for the Bengali doublet
CVCs. This is, again, a pattern which follows much more naturally from an
underlying-representation hypothesis than from a surface-representation
account. If English listeners construct their representation of CVNs on the
basis of their sensory experience with the phonetic realization of CVNs, then
this representation should code the fact that these words are produced with
nasalized vowels. And if this was captured in the representation in the
recognition lexicon, then the CVN responses to phonetically oral vowels can
only be explained as mistakes.
6
Note that, contrary to Ohala (this volume), the total amount of nasalization, either at vowel
onset or vowel offset, is not at issue here. What is important is how nasalization is interpreted,
not whether or not it occurs.

247
Segment
Table 9.5 CVN responses to CVC stimuli: place effects across gates (percent
response)

Gates

-5 -4 _3 -2 -1 0 +1 +2

Correct place 12.0 14.5 15.5 11.5 13.5 10.0 21.0 1.5
Incorrect place 9.5 5.0 6.0 5.0 2.5 1.5 0.0 0.0

In contrast, on the underlying-representation story, the listener simply has


no basis for discriminating CVCs from CVNs when hearing an oral vowel.
The underlying representation is unspecified in terms of the feature [ +nasal].
This leads to a basic asymmetry in the information value of the presence as
opposed to the absence of nasality. When an English vowel is nasalized, this
is an unambiguous cue to the manner of articulation of the following
consonant, and as more of the vowel is heard the cue gets stronger, leading to
an increased proportion of CVN responses, as we see infigure9.4. But this is
not because the cue of nasalization is interpreted in terms of the properties of
the vowel. It is interpreted in terms of the properties of the following
consonant, which is specified underlyingly as [ + nasal]. The processing
system seems well able to separate out simultaneous cues belonging to
different segments, and here we see a basis for this capacity in terms of the
properties of the representation onto which the speech input is being
mapped.
The absence of nasality is not informative in the same way. Hearing more
of an oral vowel does not significantly increase the number of CVC responses
or decrease the number of CVN responses. The slight drop-off we see in
figure 9.4 up to gate 0 reflects the appearance of cues to the place of
articulation of the following consonant, rather than the accumulation of cues
to orality. Some of the CVN responses produced to CVC stimuli at the earlier
gates do not share place of articulation with the CVC being heard (for
example, giving bang as a response to bad). It is these responses that drop out
as vowel offset approaches, as table 9.5 illustrates.
Table 9.5 gives the CVN responses to CVC stimuli, listed according to the
correctness of the place of articulation of the CVN response. The lack of
change in correct place responses over the five gates to vowel offset (gate 0)
emphasizes the uninformativeness of the absence of nasality. This follows
directly from the underlying-representation hypothesis, and from the proper-
ties of this representation as suggested by phonological theory. Only unpre-
dictable information is specified in the underlying representation, and since

248
9 Aditi Lahiri and William Marslen-Wilson
orality is the universally unmarked case for vowels, oral vowels have no
specification along the oral/nasal dimension. The underlying representation
of the vowel in CVCs and CVNs is therefore blind to the fact that the vowel is
oral. The listener will only stop producing CVNs as responses when it
becomes clear that the following consonant is also oral.
An important aspect, finally, of the results for the English doublets is that
they provide evidence for the generality of the claims we are making here.
Despite the contrasting phonological status of nasality in the Bengali vowel
system as opposed to the English, both languages treat oral vowels in the
same way, and with similar consequences for the ways in which the speakers
of these languages are able to interpret the absence of nasality in a vowel.
Although vowel nasalization has a very different interpretation in Bengali
than in English, leading to exactly opposite perceptual consequences, the
presence of an oral vowel leads to very similar ambiguities for listeners in
both languages. In both Bengali and English, an oral vowel does not
discriminate between CVC and CVN responses.

9.4 Processing and representation of length


In the preceding section, we investigated the processing of the feature [nasal]
in two different languages, showing that the interpretation of phonetically
nasal and oral vowels patterned in the ways predicted by a phonologically
based theory of the recognition lexicon. In this section we consider the
processing and representation of a different type of segmental contrast -
length. We will be concerned specifically with consonantal length and the
contrast between single and geminate consonants. Length is not represented
by any feature [long] (see section 9.2.1); rather, the featural specifications of a
geminate and a single consonant are exactly the same. The difference lies in
the linking between the melody and the skeletal representations - dual for
geminates and single for nongeminates.
The language we will be studying is again Bengali, where nongeminate and
geminate consonants contrast underlyingly in intervocalic position, as in
/pala/ "turn" /paha/ "scale, competition." The predominant acoustic cue
marking geminate consonants is the duration of the consonantal closure
(Lahiri and Hankamer 1988). This is approximately twice as long in geminate
as opposed to nongeminate consonants (251 vs. 129 msec, in the materials
studied by Lahiri and Hankamer, and 188 vs. 76 msec, in the sonorant liquids
and nasals used in this study). As we discussed earlier, spectral information
like nasality is interpreted with respect to the appropriate representation as
soon as it is perceived. For geminates, the involvement of the skeleton along
with the melody predicts a different pattern of listener responses. Unlike
nasality, interpretation of surface duration will depend not on the lexically

249
Segment

marked status of the feature but on the listener's assessment of the segment
slots and therefore of the prosodic structure.
This means that information about duration will not be informative in the
same way as spectral information. For example, when the listener hears
nasal-murmur information during the closure of the geminate consonant in a
word like [kania], the qualitative information that the consonant is a nasal
can be immediately inferred. But the quantity of the consonant (the fact that
it is geminated) will not be deduced - even after approximately 180 msec, of
the nasal murmur - until after the point of release of the consonant, where
this also includes some information about the following vowel.
Since geminates in Bengali are intervocalic, duration can only function as a
cue when it is structurally (or prosodically) plausible - in other words, when
there is a following vowel. If, in contrast, length had been marked as a feature
on the consonant in the same way as standard melodic features, then
duration information ought to function as a cue just like nasality. Instead, if
quantity is independently represented, then double-consonant responses to
consonantal quantity will only emerge when there is positive evidence that
the appropriate structural environment is present.

9.4.1 Method
The stimuli consisted of sixteen pairs of matched bisyllabic geminate/
nongeminate pairs, differing only in their intervocalic consonant (e.g. [pala]
and [palia]). The contrasting consonants consisted of eight [l]s, seven [n]s,
and one [m]. The primary reason for choosing these sonorant consonants was
that the period of consonantal closure was not silent (as, for example, in
unvoiced stops), but contained a continuous voiced murmur, indicating to
the listener the duration of the closure at any gate preceding the consonantal
release at the end of the closure. The geminates in these pairs were all
underlying geminates. We attempted to match the members of each pair for
frequency, again relying on subjective familiarity judgments.
The stimuli were prepared for use in the gating task in the same way as in
the first experiment. The gating sequence was, however, organized differ-
ently. Each word had six gates (as illustrated in figure 9.5). The first gate
consisted of the CV plus 20 msec, of the closure. The second gate included an
extra 40 msec, consonantal-closure information. The third gate was set at the
end of the closure, before any release information was present. This gate
differed in length for geminate and nongeminates, since the closure was more
than twice as long for the geminates. It is at this gate, where listeners have
heard a closure whose duration far exceeds any closure that could be
associated with a nongeminated consonant, that we would expect geminate
responses if listeners were sensitive to duration alone. The fourth gate

250
9 Aditi Lahiri and William Marslen- Wilson

Jilliuiiiu
—Viewpoint width: 550 msec.-

1
H rri HI" If! 111111'
kana
50 msec. I 1 1, 1 i i i

1 23 4 5 6
—Viewpoint width: 550 msec.-*

Hipp Hi
Wffl
lyiiiM
PPPPWWvv-
kanna
50 msec.
1 i i i i i i 1
1 2 3 4 5
Gates

Figure 9.5 The completing gating sequence for the Bengali pair kana and kanna

included the release plus two glottal pulses - enough information to indicate
that there was a vowel even though the vowel quality was not yet clear. The
fifth gate contained another four glottal pluses - making the identity of the
vowel quite clear. The sixth and last gate included the whole word.
Two tapes were constructed, with each tape containing one member of
each geminate/nongeminate pair, for a total of eight geminates and eight
nongeminates. Three practice items preceded the test items. A total of
twenty-eight subjects were tested, fourteen for each tape. No subject heard
more than one tape. The subjects were literate native speakers of Bengali,
tested in Calcutta. The testing procedure was the same as before (see section
9.3.3), except for the fact that here the subjects were instructed to respond
with bisyllabic words. Bisyllabic responses allowed us to determine whether

251
Segment

listeners interpreted quantity information at each gate as geminate or


nnnapminatp;

9.4.2 Results and discussion


The questions that we are asking in this research concern the nature of the
representation onto which durational information is mapped. Do listeners
interpret consonant length as soon as the acoustic information is available to
them (like nasality or other quality cues) or do they wait till they have
appropriate information about the skeletal structure? To answer this ques-
tion, all scoreable responses for the geminate/nongeminate stimuli pairs were
classified as CVCV (nongeminate) or CVCiV (geminate). Two pairs of words
were discarded at this point. In one pair, the third gate edge had been
incorrectly placed for one item, and for the second pair, the initial vowel of
the nongeminate member turned out to be different for some of the subjects.
The subsequent analyses are based on fourteen pairs of geminate/nongemi-
nate pairs.
In figure 9.6, we give the mean percentage of geminate responses to the
geminate and nongeminate pairs plotted across the six gates. The crucial data
here are for the third and fourth gates. At the third gate, the entire closure
information was present both for the CVCV (nongeminate) words as well as
the CVCiV words (geminates). We know from earlier experiments (Lahiri
and Hankamer 1988) that cross-splicing the closure from geminates to
nongeminates and vice versa, with all other cues remaining constant, can
change the listeners' percept. The duration of closure overrides any other cue
in the perception of geminates. One might expect that at the third gate the
listeners may use this cue and respond with geminates. But that is not the
case. There is still a strong preference for nongeminate responses (83 percent
nongeminate vs. 17 percent geminate) even though listeners at this gate are
hearing closures averaging 170 msec, in duration, which could never be
associated with a nongeminate consonant. Instead, durational information is
not reliably interpreted till after the fourth gate, where they hear the release
and two glottal pulses of the vowel, indicating that this is a bisyllabic word.
Note that although listeners were reluctant to give bisyllabic geminated
words as responses until,gate 4, the same restrictions did not apply to
bisyllabic words which had intervocalic consonant clusters, with the second
member being an obstruent. Four of the geminate/nongeminate pairs used
here had competitors of this type - for example, the word [palki], sharing its
initial syllable with [paha] and [pala]. For this subset of the stimuli, we find
that in gate 3 of the geminates (with all the closure information available),
CVCCV responses average 45 percent - as opposed to only 16 percent
CVCiV responses. Listeners prefer to interpret the speech input up to this
252
9 Aditi Lahiri and William Marslen- Wilson

100 -i

—*— Geminates

"•*•" Nongeminates

Figure 9.6 Mean percentage of geminate responses for the geminate and nongeminate stimuli
plotted across the six gates

point as a single linked C either followed by a vowel (CVCV) or by a


consonant (CVCCVC). Dually linked Cs (i.e. geminates) are reliably given as
responses only after there is information to indicate that there is a following
vowel - at which point, of course, CVCCV responses drop out completely.
The infrequency of geminated responses at gate 3 - and the frequency of
CVCV and CVCCV (when available) responses - is compatible with the
hypothesis that the speech input is interpreted correctly with respect to the
melodic information, and that bisyllabic words given in response contain a
separate melodic segment. The [pal] part of [paha] is matched to all sequences
with [pal] including the geminate. But the information available to the
listener at that point is that it is an [1], and the word candidates that are more
highly activated are those that can follow a single [1], namely a vowel or
another consonant - that is, another singly linked segment like [palki] or
[pala], leading to the observed preference for nongeminate words as res-
ponses. Even if the closure is long and ought to trigger geminate responses,
the melodic input available is incompatible with the available structural
information - a geminate [h] evidently cannot be interpreted as such unless
both the melodic and the structural information is made available. But as
soon as it is clear that there is a following vowel, then the acoustic input can
be correctly interpreted.

253
Segment

9.5 Conclusions
In the paper we have sketched the outline of a psycholinguistic model of the
representation and processing of lexical form, combining the cohort model of
lexical access and selection with a set of assumptions about the contents of
the recognition lexicon that derive from current phonological theory. The
joint predictions of this model, specifying how information in the speech
signal should be interpreted relative to abstract, multilevel, underspecified,
lexical representations, were evaluated in two experiments. The first of these
investigated the processing of a melodic feature, under conditions where its
phonological status varied cross-linguistically. The second investigated the
interpretation of quantity information in the signal, as a function of cues to
the structural organization of the word being heard. In each case, the
responses patterned in a way that was predicted by the model, suggesting that
the perceptually relevant representations of lexical form are indeed functio-
nally isomorphic to the kinds of representations specified in phonological
theory.
If this conclusion is correct, then there are two major sets of implications.
Where psycholinguistic models are concerned, this provides, perhaps for the
first time, the possibility of a coherent basis for processing models of lexical
access and selection. These models cannot be made sufficiently precise
without a proper specification of the perceptual targets of these processes,
where these targets are the mental representations of lexical form. Our
research suggests that phonological concepts can provide the basis for a
solution to this problem.
The second set of implications are for phonological theory. If it is indeed
the case that the contents of the mental-recognition lexicon can, in broad
outline, be characterized in phonological terms, then this suggests a much
closer link between phonological theory and experimental research into
language processing than has normally been considered by phonologists (or
indeed by psychologists). Certainly, if a phonological theory has as its goal
the making of statements about mental representations of linguistic know-
ledge, then the evaluation of its claims about the properties of these
representations will have to take into account experimental research into the
properties of the recognition lexicon.

254
9 Comments
Comments on chapter 9
JOHN J. OHALA
I applaud Lahiri and Marslen-Wilson for putting underspecification and
other phonological hypotheses into the empirical arena.* I share their view
that phonological hypotheses that purport to explain how language is
represented in the brain of the speaker have to be evaluated via psycho-
linguistic tests. It is a reasonable proposal that the way speakers recognize
heard words should be influenced by the form of their lexical representation,
about which underspecification has made some very specific claims.
Still, this paper leaves me very confused. Among its working hypotheses
are two that seem to me to be absolutely contradictory. The first of these is
that word recognition will be performed by making matches between the
redundant input form and an underlying lexical form which contains no
predictable information. The second is that the match is made between the
input form and a derived form which includes surface redundancies. Previous
work by Marslen-Wilson provides convincing support for the second
hypothesis. He showed that as soon as a lexically distinguishing feature
appeared in a word, listeners would be able to make use of it in a word-
recognition task; e.g., the nasalization of the vowel in drown could be used to
identify the word well before the actual nasal consonant appeared. Similar
results had earlier been obtained by Ali et al. (1971). Such vowel nasalization
is regarded by Lahiri and Marslen-Wilson as predictable and therefore not
present in the underlying representation.
In the Bengali triplet experiment the higher number of CVN responses
(than CVC responses) to CVC stimuli is attributed to the vowel in CVN's
underlying representation being unspecified for [nasal], i.e., that to the
listener the oral vowel in CVC could be confused with the underlying "oral"
vowel in CVN. A similar account is given for the results of the Bengali
doublet experiment (the results of which are odd since they include responses
for CVC even though there were supposedly no words of this sort in the
language). On the other hand, the progressive increase in the CVN responses
to English CVN stimuli is interpreted as due to listeners being able to take
advantage of the unambiguous cue which the redundant vowel nasalization
offers to the nature of the postvocalic consonant. Whatever the results, it
seems, one could invoke the listener having or not having access to
predictable surface features of words to explain them.
Actually, given the earlier results of Ali et al. that a majority of listeners,
hearing only the first half of a CVC or CVN syllable, could nevertheless
discriminate them, Lahiri and Marslen-Wilson's results in the English

*I thank Anne Cutler for bibliographic leads.

255
Segment

experiment (summarized in fig. 9.4) are puzzling. Why did CVN responses
exceed the CVC responses only at the truncation point made at the VN
junction? One would have expected listeners to be able to tell the vowel was
nasalized well before this point. I suspect that part of the explanation lies in
the way the stimuli were presented, i.e. the shortest stimulus followed by the
next shortest, and so on through to the longest and least truncated one. What
one is likely to get in such a case is a kind of "hysteresis" effect, where
subjects' judgments on a given stimulus in the series is influenced by their
judgment on the preceding one. This effect has been studied by Frederiksen
(1967), who also reviews prior work. In the present case, since a vowel in a
CVN syllable is least nasalized at the beginning of the vowel, listeners' initial
judgments that the syllable is not one with a nasalized vowel is reasonable
but then the hysteresis effect would make them retain that judgment even
though subsequent stimuli present more auditory evidence to the contrary.
Whether this is the explanation for the relatively low CVN responses is easily
checked by doing this experiment with the stimuli randomized and
unblocked, although this would require other changes in the test design so
that presentation of the full stimulus would not bias judgments on the
truncated stimuli. This could be done in a number of ways, e.g. by presenting
stimuli with different truncations to different sets of listeners.
Lahiri and Marslen-Wilson remark that the greater number of CVC over
CVN responses to the CVN stimuli in the Bengali triplet study
are problematic for a surface-representation hypothesis. On such an account, vowel
nasalization is perceptually ambiguous, and responses should be more or less evenly
divided between CVNs and CVCs. To explain the imbalance in favor of CVC
responses, this account would have to postulate an additional source of bias,
operating postperceptually.

Such a perceptual bias has been previously discussed: Ohala (1981b, 1985a,
1986, forthcoming) and Ohala and Feder (1987) have presented phonological
and phonetic evidence that listeners are aware of predictable cooccurrences
of phonetic events in speech and react differently to predictable events than
unpredictable ones. If nasalization is a predictable feature of vowels adjacent
to nasal consonants, it is discounted in that it is noticed less than that on
vowels not adjacent to nasal consonants. Kawasaki (1986) has presented
experimental evidence supporting this. With a nasal consonant present at the
end of a word, the nasalization on the vowel in camouflaged. When Lahiri
and Marslen-Wilson's subjects heard the CVN stimuli without the final N
they heard nasalized vowels uncamouflaged and would thus find the nasali-
zation more salient and distinctive (since the vowel nasalization cannot be
"blamed" on any nearby nasal). They would therefore think of CVC stimuli
first. This constitutes an extension to a very fine-grained phonetic level of the

256
9 Comments
same kind of predictability that Warren (1970) demonstrated at lexical,
syntactic, and pragmatic levels. Even though listeners may show some bias in
interpreting incoming signals depending on the redundancies that exist
between various elements of the signal, it does not seem warranted to jump to
the conclusion that all predictable elements are unrepresented at the deepest
levels.
The result in the geminate experiment, that listeners favored responses
with underlying geminates over derived geminates, was interpreted by Lahiri
and Marslen-Wilson as due to listeners opting preferentially for a word with
a lexical specification of gemination (and thus supporting the notion that
derived geminates, those that arose from original transmorphemic heterorga-
nic clusters, e.g. -rl- > -11-, have a different underlying representation.
However, there is another possible interpretation for these data that relies
more on morphology than phonology. Taft (1978, 1979) found that upon
hearing [deiz] subjects tended to identify the word as daze rather than days,
even though the latter was a far more common word. He suggested that
uninflected words are the preferred candidates for word identification when
there is some ambiguity between inflected and uninflected choices. Giinther
(1988) later demonstrated that, rather than being simply a matter of inflected
vs. uninflected, it was more that base forms are preferred over oblique forms.
By either interpretation, a Bengali word such as /paha/ "scale" would be the
preferred response over the morphologically complex or oblique form /paho/
"to be able to (third-person past)."
At present, then, I think the claims made by underspecification theory and
metrical phonology about the lexical representation of words are unproven.
Commendably, Lahiri and Marslen-Wilson have shown us in a general way
how perceptual studies can be brought to bear on such phonological issues;
with refinements, I am sure that the evidential value of such experiments can
be improved.

Comments on chapter 9
CATHERINE P. BROWMAN
Lahiri and Marslen-Wilson's study explores whether the processes of lexical
access and selection proceed with respect to underlying abstract phonological
representations or with respect to surface forms. Using a gating task, two
constrasts are explored: the oral/nasal contrast in English and Bengali, and
the single/geminate contrast in Bengali. Lahiri and Marslen-Wilson argue
that for both contrasts, it is the underlying form (rather than the surface
form) that is important in explaining the results of the gating task, where the

257
Segment
underlying forms are assumed to differ for the two contrasts, as follows
(replacing their CV notation with X notation):
(1) Bengali oral/nasal contrast:

VN VC VC
XX XX XX
I I
[ + nas] [ + nas]

(2) Bengali single/geminate contrast

C: C
XX X

The English contrast is like the Bengali, but is only two-way (there are no
lexically distinctive nasalized vowels in English). On the surface, the vowel
preceding a nasal consonant is nasalized in both Bengali and English.
The underlying relation between the x-tier and the featural tiers is the same
for the nasalized vowel and the singleton consonant - in both cases, the
relevant featural information is associated with a single x (timing unit). This
relation differs for the nasal consonant and the geminate consonant - the
relevant featural information ([nas]) for the nasal consonant is associated
with a single x, but the featural information for the geminate is associated
with two timing units. However, this latter assumption about underlying
structure misses an important generalization about similarities between the
behaviour, on the gating task, of the nasals and the geminates. These
similarities can be captured if the nasalization for the nasal consonant is
assumed to be underlyingly associated with two timing units, as in (3), rather
than with a single timing unit, as in (1).
(3) Proposed Bengali oral/nasal contrast:

VN VC VC
XX XX XX
i I I
[ + nas ] [ + nas]

There is some articulatory evidence that is suggestive as to similarity


between syllable-final nasals and (oral) geminates. Krakow (1989) demon-

258
9 Comments

Velum

Tongue

Speech envelope

N
(a)

Tongue

Speech envelope

C:
(b)

Figure 9.7 Schematic relations among articulations and speech envelope, assuming a consonant
that employs the tongue: (a) nasal consonant preceded by nasalized vowel (schematic tongue
articulation for consonant only); (b) geminate consonant

strates that, in English, the velopharyngeal port opens as rapidly for syllable-
final nasals as for syllable-initial nasals, but is held open much longer
(throughout the preceding vowel). This is analogous to the articulatory
difference between Finnish singleton and geminate labial consonants, cur-
rently being investigated at Yale by Margaret Dunn.
If the nasal in VN sequences is considered to be an underlying geminate,
then the similar behavior of the nasals and oral geminates on the gating task
can be explained in the same way: lexical geminates are not accessed until
their complete structural description is met. The difference between nasal

259
Segment

(VN) and oral (C:) geminates then lies in when the structural description is
met, rather than in their underlying representation. In both cases, it is the
next acoustic event that is important. As can be seen in figure 9.7, this event
occurs earlier for VN than for Ci. For VN, it is clear that the nasal is a
geminate when the acoustic signal changes from the nasalized vowel to the
nasal consonant, whereas for d , it is not until the following vowel that the
acoustic signal changes.
The preference for oral responses during the vowel of the VN in the
doublets experiment would follow from the structural description for the
nasal geminate not yet being met, combined with a possible tendency for
"oral" vowels to be slightly nasalized (Henderson 1984, for Hindi and
English). This interpretation would also predict that the vowel quality should
differ in oral responses to VN stimuli, since the acoustic information
associated with the nasalization should be (partially) interpreted as vowel-
quality information rather than as nasalization.

260
10

The descriptive role of segments:


evidence from assimilation

FRANCIS NOLAN

10.1 Introduction
Millennia of alphabetic writing can leave little doubt as to the utility of
phoneme-sized segments in linguistic description.* Western thought is so
dominated by this so successful way of representing language visually that
the linguistic sciences have tended to incorporate the phoneme-sized segment
(henceforth "segment") axiomatically. But, as participants in a laboratory-
phonology conference will be the last to need reminding, the descriptive
domain of the segment is limited.
All work examining the detail of speech performance has worried about
the relation between the discreteness of a segmental representation, on the
one hand, and, on the other, the physical speech event, which is more nearly
continuous and where such discrete events as can be discerned may corres-
pond poorly with segments. One response has been to seek principles
governing a process of translation1 between a presumed symbolic, segmen-
tal representation as input to the speech-production mechanism, and the
overlapping or blended activities observable in the speech event.
In phonology, too, the recognition of patternings which involve domains
other than the segment, and the apparent potential for a phonetic component
to behave autonomously from the segment(s) over which it stretches, have
been part of the motivation for sporadic attempts to free phonological
description from a purely segmental cast. Harris (1944), and Firth (1948),

*The work on place assimilation reported here, with the exception of that done by Martin Barry,
was funded as part of grants C00232227 and R000231056 from the Economic and Social
Research Council, and carried out by Paul Kerswill, Susan Wright, and Howard Cobb,
successively. I am very grateful to the above-named for their work and ideas; they may not, of
course, be fully in agreement with the interpretations given in this paper.
1
The term "translation" was popularized by Fowler (e.g. Fowler et al. 1980). More traditio-
nally, the process is called "(phonetic) implementation."

261
Segment
may be seen as forerunners to the current very extensive exploration of the
effects of loosening the segmental constraints on phonological description
under the general heading of autosegmental phonology. Indeed, so successful
has this current phonological paradigm been at providing insights into
certain phonological patterns that its proponents might jib at its being called
a "sporadic attempt."
Assimilation is one aspect of segmental phonology which has been
revealingly represented within autosegmental notation. Assimilation is taken
to be where two distinct underlying segments abut, and one "adopts"
characteristics of the other to become more similar, or even identical, to it, as
in cases such as [griim peint] green paint, [reg ka:] red car, [bae4 Ooits] bad
thoughts. A purely segmental model would have to treat this as a substitution
of a complete segment. Most modern phonology, of course, would treat the
process in terms of features. Conventionally this would mean an assimilation
rule of the following type:2

(1) + coronal a coronal a coronal


+ anterior P anterior (3 anterior
— continuant y distr y distr

Such a notation fails to show why certain subsets of features, and not other
subsets, seem to operate in unison in such assimilations, thus in this example
failing to capture the traditional insight that these changes involve assim-
ilation of place of articulation. The notation also implies an active matching
of feature values, and (to the extent that such phonological representations
can be thought of as having relevance to production) a repeated issuing of
identical articulatory "commands."
A more attractive representation of assimilation can be offered within an
essentially autosegmental framework. Figure 10.1, adapted from Clements
(1985), shows that if features are represented as being hierarchically
organized on the basis of functional groupings, and if each segmental "slot"
in the time course of an utterance is associated with nodes at the different
levels of the hierarchy, then an assimilation can be represented as the deletion
of an association to one or more lower nodes and a reassociation to the
equivalent node of the following time slot. The hierarchical organization of
the features captures notions such as "place of articulation"; and the
autosegmental mechanism of deletion and reassociation seems more in tune
with an intuitive conception of assimilation as a kind of programmed "short-

2
This formulation is for exemplification only, and ignores the question of the degree of
optionality of place assimilation in different contexts.

262
10 Francis Nolan

Timing tier
Root tier

Laryngeal tier
Supralaryngeal tier

Place tier

Figure 10.1 Multitiered, autosegmental representation of assimilation of place of articulation


(after Clements 1985: 237)

cut" in the phonetic plan to save the articulators the bother of making one
part of a complex gesture.
But even this notation, although it breaks away from a strict linear
sequence of segments, bears the hallmark of the segmental tradition. In
particular, it still portrays assimilation as a discrete switch from one subset of
segment values to another. How well does this fit the facts of assimilation?
Unfortunately, it is not clear how much reliance we can place on the "facts"
as known, since much of what is assumed is based on a framework of
phonetic description which itself is dominated by the discrete segment.
This paper presents findings from work carried out in Cambridge aimed at
investigating assimilation experimentally. In particular, the experiments
address the following questions:
1 Does articulation mirror the discrete change implied by phonetic and
phonological representations of assimilation?
2 If assimilation turns out to be a gradual process, how is the articulatory
continuum of forms responded to perceptually?
In the light of the experimental findings, the status of the segment in phonetic
description is reconsidered, and the nature of representation input to the
production mechanism discussed.

10.2 The articulation of assimilation


The main tool used in studying the articulatory detail of assimilation has
been electropalatography (EPG). This involves a subject wearing an artificial
palate in which are embedded sixty-two electrodes. Tongue contact with any
electrode is registered by a computer, which stores data continuously as the
subject speaks. "Frames" of data can then be displayed in the form of a plan

263
Segment

of the palate with areas of contact marked. Each frame shows the pattern of
contact for a 1/100 second interval. For more details of the technique see
Hardcastle (1972).
The experimental method in the early EPG work on assimilation, as
reported for instance in Kerswill (1985) and Barry (1985), exploited (near)
minimal pairs such as . . . maid couldn't... and . . . Craig couldn't... In these
the test item {maid couldn't) contains, lexically, an alveolar at a potential
place-of-articulation assimilation site, that is, before a velar or labial, and the
control item contains lexically the relevant all-velar (or all-labial) sequence.
It immediately became obvious from the EPG data that the answer to
question 1 above, whether articulation mirrors the discrete change implied by
phonetic and phonological representations of assimilation, is "no." For
tokens with lexical alveolars, a continuum of contact patterns was found,
ranging from complete occlusion at the alveolar ridge to patterns which were
indistinguishable from those of the relevant lexical nonalveolar sequence.
Figure 10.2 shows, for the utterance . . . late calls... spoken by subject WJ,
a range of degrees of accomplishment of the alveolar occlusion, and for
comparison a token by WJ of the control utterance . . . make calls . . . In all
cases just the medial consonant sequence and a short part of the abutting
vowels are shown. In each numbered frame of EPG data the alveolar ridge is
at the top of the schematic plan of the palate, and the bottom row of
electrodes corresponds approximately to the back of the hard palate. Panel
(a) shows a complete alveolar occlusion (frames 0159 to 0163). Panels (b) and
(c) show tokens where tongue contact extends well forward along the sides of
the palate, but closure across the alveolar ridge is lacking. Panel (d) shows the
lexical all-velar sequence in make calls.
Figure 10.3 presents tokens spoken by KG of . . . boat covered... and
. . . oak cupboard... In (a) there is a complete alveolar closure; almost
certainly the lack of contact at the left side of the palate just means that the
stop is sealed by this speaker, in this vowel context, rather below the leftmost
line of electrodes, perhaps against the teeth. The pattern in (b) is very similar,
except that the gesture towards the alveolar ridge has not completed the
closure. In (c), however, although it is a third token of boat covered, the
pattern is almost identical to that for oak cupboard in (d). In particular, in
both (c) and (d), at no point in the stop sequence is there contact further
forward than row 4.
Both Barry (1985) and Kerswill (1985) were interested in the effect of
speaking rate on connected-speech processes (CSPs) such as assimilation,
and their subjects were asked to produce their tokens at a variety of rates. In
general, there was a tendency to make less alveolar contact in faster tokens,
but some evidence also emerged that when asked to speak fast but "care-
fully," speakers could override this tendency. However, the main point to be

264
10 Francis Nolan

>166 0167 0168

(a) ...latecalls...

.00 0 00 00. .

.00 0000.000 0000

0378 0379 0380 0381 0382 0383 0384 0385 0386 0387 0388 0389

(b) ...latecalls...

0 00...0 0 00...0

.. .0 0. . . 0 0 0. o " " • •

0 0 0 0 0. . . . . .0 0. . . . . .0 0 ....00 0. . . . . 0 0 0. . . . . 0 0 0 0 . . . . 0 00. . . . o o O : : : :

0
.
°
0
°
0 00 0

0570 0571 0572

0 0

(c) ...latecalls...
.180 1181 1182

(d) ...make calls...

Figure 10.2 EPG patterns for consonant sequences (subject WJ)

265
Segment

(a) ...boatjpvered..

(b) ...boat_covered...

101 0902 0903

(c) ...boat_cpvered...

0 00 0 0 . . . . 0 0

(d) ...oakcupboard...
Figure 10.3 EPG patterns for consonant sequences (subject KG)

266
10 Francis Nolan

made here is that a continuum of degrees of realization exists for an alveolar


in a place-assimilation context.
The problem arises of how to quantify the palatographic data. The
solution has been to define four articulation types. In this paper these will be
referred to as full alveolar, residual alveolar, zero alveolar, and nonalveolar.
Thefirstthree terms apply to tokens with a potentially assimilable alveolar in
their lexical form, and the last term indicates tokens of a control item with an
all-velar or all-labial sequence in its lexical form. A full-alveolar token shows
a complete closure across one or more rows near the top of the display. A
residual-alveolar token lacks median closure at the top of the display, but
shows contact along the sides of the palate a minimum of one row further
forward than in the control non-alveolar token for the same rate by the same
speaker. A zero-alveolar token shows no contact further forward than the
appropriate nonalveolar control.3
To give an indication of the distribution of these articulation types in
different speaking styles, Table 10.1 summarizes data from Barry (1985) and
Kerswill (1985). On the left are shown the number of occurrences in normal
conversational reading and in fast reading of the three alveolar articulation-
types. The data is here averaged over three EPG subjects who exhibited
rather different "centers of gravity" on the assimilatory continuum, but all of
whom varied with style. Each speaker produced two tokens of each of eight
sentences containing place-assimilation sites, plus controls. A trend towards
more assimilation in the fast style is evident. On the right are show the
number of occurrences of the three articulation-types for Kerswill's speaker
ML, who read in four styles: slowly and carefully, normally, fast, and then
fast but carefully. The speaker produced in each style twenty assimilable
forms. A clear pattern emerges of increasing assimilation as the style moves
away from slow careful, which was confirmed statistically by Kerswill using
Kendall's tau nonparametric correlation test (tau = 0.456, p< 0.001).
There are, of course, limits on what can be inferred from EPG data.
Because it records only tongue contact, it gives no information on possible
differences in tongue position or contour short of contact. This means, in
particular, that it is premature at this stage to conclude that even zero-
alveolar tokens represent a complete assimilation of the tongue gesture of the
first consonant to that of the second. X-rayfilming,perhaps supplemented by
electromyography of tongue muscles, might shed light on whether, even in
EPG-determined zero alveolars, there remains a trace of the lexical alveolar.
3
I have in the past (e.g. Nolan 1986) used the term "partial assimilation" to refer to cases where
residual-alveolar contact is evident. I now realize that phonologists use "partial assimilation"
to indicate e.g. [gk] as the output of the assimilation of [dk], in contrast to the "completely"
assimilated form [kk]. I shall therefore avoid using the term "partial assimilation" in this
paper.

267
Segment
Table 10.1 EPG analysis of place assimilation: number of occurrences of
articulation types in different speaking styles. On the left, the average of three
speakers (Barry 1985: table 2) speaking at normal conversational speed and
fast; and on the right speaker ML (Kerswill 1985: table 2) speaking slowly and
carefully, normally, fast but carefully, and fast

Normal Fast Slow and careful Normal Fast and careful Fast
Full alveolar 23 15 10 2 3 0
Residual alveolar 14 15 5 8 3 2
Zero alveolar 11 18 5 10 14 18

These techniques are outside the scope of the present projects; but the
possibility of alveolar traces remaining in zero alveolars is approached via an
alternative, perceptual, route in section 10.3.
Despite these limitations, these experiments make clear that the answer to
question 1 is that assimilation of place of articulation is, at least in part, a
gradual process rather than a discrete change.

10.3 The perception of assimilation


Since assimilation of place of articulation has been shown, in articulatory
terms, to be gradual, question 2 above becomes relevant: how is the
articulatory continuum of forms responded to perceptually? For instance,
when does the road collapsed begin consistently to be perceived, if ever, as the
rogue collapsed! Is a residual alveolar sufficient to cue perception of a lexical
alveolar? Can alveolars be identified even in the absence of any signs of them
on the EPG pattern?
An ideal way to address this issue would be to take a fully perfected,
comprehensive, and tractable articulatory synthesizer; create acoustic stimuli
corresponding to the kinds of articulation noted in the EPG data; and elicit
responses from listeners. Unfortunately, as far as is known, articulatory
synthesizers are not yet at a stage of development where they could be relied
on to produce such stimuli with adequate realism; nor, furthermore, is the
nature of place assimilation well enough understood yet at the articulatory
level for it to be possible to provide the data to control the synthesizer's
parameters with confidence that human activity was being simulated accur-
ately. An alternative strategy was therefore needed.
The strategy was to use stimuli produced by a human speaker. All
variables other than the consonant sequence of interest needed to be
controlled, and, since there is a strong tendency for the range of forms of

268
10 Francis Nolan
Table 10.2 Sentence pairs used in the identification
and analysis tests

d+k The road collapsed,


g -I- k The rogue collapsed.
d+k Did you go to the Byrd concert on Monday?
g+k Did you go to the Berg concert on Monday?
d+k Was the lead covered, did you notice?
g+k Was the leg covered, did you notice?
d+k Did that new fad catch on?
g+k Did that new fag catch on?*
d+g They did gardens for rich people,
g +g They dig gardens for rich people.
d+g It's improper to bed girls for money,
g +g It's improper to beg girls for money.
d+m A generous bride must be popular,
b+m A generous bribe must be popular.

Note: *Fag is colloquial British for cigarette.

interest to be associated with rate changes, it was decided to use a phone-


tician rather than a naive speaker to produce the data. The phonetician was
recorded reading a set of "minimal-pair" sentences. The discussion below
focuses on the sentences containing a lexical /d/ at the assimilation site, and
these are given in table 10.2 with their paired nonalveolar controls. Each
sentence was read many times with perceptually identical rate, rhythm, and
prosody, but with a variety of degrees of assimilation (in the case of the
lexical alveolar sentences). In no case was the first consonant of a sequence
released orally. The utterances were recorded not only acoustically, but also
electropalatographically.
The EPG traces were used to select, for each lexical alveolar sentence, a
token archetypical of each of the categories full alveolar, residual alveolar,
and zero alveolar as observed previously in EPG recordings. A non alveolar
control token was also chosen. The tokens selected were analyzed spectro-
graphically to confirm, in terms of a good duration match, the similarity of
rate of the tokens for a given sentence.
A test tape was constructed to contain, in pseudo-randomized order
(adjacent tokens of the same sentence, and adjacent occurrences of the same
articulation type, were avoided), four repetitions of the nonalveolar token,
and two repetitions of each of the full-alveolar, residual-alveolar, and zero-
alveolar version of the lexical alveolar sentence.

269
Segment

Two groups of subjects were used in the identification test: fourteen


professional phoneticians from the universities of Cambridge, Leeds, Read-
ing, and London; and thirty students (some with a limited amount of
phonetic training). All subjects were naive as to the exact purpose of the
experiment.
In the identification task, subjects heard each token once and had to
choose on a response sheet, on which was printed the sentence context for
each token, which of the two possible words shown at the critical point they
had heard. The phoneticians subsequently performed a more arduous task,
an analysis test, which involved listening to each sentence as often as
necessary, making a narrow transcription of the relevant word, performing a
"considered" lexical identification, and attempting to assign its final con-
sonant to one of the four EPG articulation types, these being explained in the
instructions for the analysis test. The results for the group of phoneticians are
discussed in detail in Kerswill and Wright (1989).
Results for the identification test are summarized in figure 10.4. Perform-
ance of phoneticians (filled squares) and students (open squares) is very
similar, which suggests that even if the phonetician's skills enable them to
hear more acutely, the conditions of this experiment did not allow scope for
the application of those skills. As might be expected, articulation type 1 (full
alveolar) allows the alveolar to be identified with almost complete reliability.
With type 2 (residual alveolar), less than half the tokens are correctly
identified as words with lexical alveolars. With type 3 (zero alveolar),
responses are not appreciably different from those for nonalveolars (type 4).
The main reason that not all nonalveolar tokens are judged correctly (which
would result in a 0 percent value in fig. 10.4) is presumably that subjects are
aware, from both natural-language experience and the ambiguous nature of
the stimuli in the experiment, that assimilation can take place, and are
therefore willing to "undo" its effect and report alveolars even where there is
no evidence for them.
So far, the results seem to represent a fairly simple picture. Tokens of
lexical alveolars which are indistinguishable in EPG traces from control
nonalveolar sequences also appear to sound like the relevant nonalveolar
tokens. Tokens of lexical alveolars with residual-alveolar gestures are at least
ambiguous enough not to sound clearly like nonalveolars, and are sometimes
judged "correctly" to be alveolar words.
But there are hints in the responses as a whole that the complete picture
may be more complicated. Table 10.3 shows the responses to types 3 (zero
alveolar) and 4 (nonalveolar) broken down word pair by word pair. The
responses are shown for the student group and for the phoneticians identify-
ing under the same "one-pass" condition, and then for the phoneticians
making their "considered" identifications when they were allowed to listen
270
10 Francis Nolan
100

phoneticians
80
Si students

2? 6 0

CD

f 40
CD
O

£
20

2 3 4
Articulation type

Figure 10.4 Identification test: percentage of /d/ responses (by phoneticians and students
separately) broken down by articulation type. The articulation type of a token was determined
by EPG. The articulation types are (1) full alveolar, (2) residual alveolar, (3) zero alveolar, (4)
nonalveolar

repeatedly. The "difference" columns show the difference (in percentage)


between "alveolar" responses to type 3 and type 4 stimuli, and a large
positive value is thus an indication of more accurate identification of the
members of a particular pair. It is striking that in the phoneticians'
"considered" identifications, two pairs stand out as being more successfully
identified, namely leg-lead (/led/) and beg-bed, with difference values of 40
and 36 respectively.
Is there anything about these pairs of stimuli which might explain their
greater identifiability? Examination of their EPG patterns suggests a possible
answer. The patterns for these pairs are shown in figure 10.5. The normal
expectation from the earlier EPG studies would be that the lexical alveolar
member of a pair can have contact further forward on the palate, particularly
at the sides, than its nonalveolar counterpart. Of course, this possibility is
excluded here by design, otherwise the lexical alveolar token would not be
"zero alveolar" but "residual alveolar." But the interesting feature is that the
«o«alveolar words {leg, beg) show evidence of contact further forward on the
palate than their zero-alveolar counterparts. This could, perhaps, reflect
random variation, exposed by the criteria used to ensure that zero-alveolar
tokens genuinely were such, and not residual alveolars. Bit if that were the

271
Segment

Table 10.3 Percentage of 'lexical alveolar" identifications for zero-alveolar


(z-a) and nonalveolar (n-a) tokens by students and phoneticians in one-pass
listening, and by phoneticians in making "considered" judgments, broken down
by minimal pairs. The difference column (diff.) shows the difference in
percentage between the two articulation types. The percentage values for
students derive from 60 observations (2 responses to each token by each of
30 students), and for phoneticians in each condition from 28 observations
(2 responses to each token by each of 14 phoneticians). The pairs are ordered
by the phoneticians' "considered" identifications in descending order

Students Phoneticians

one-pass one-pass considered


identification identification identification

z-a n-a aW- z-a n-a diff. z-a n-a diff-


lead/leg 18 8 10 18 18 0 54 14 40
bed/beg 10 5 5 21 7 14 57 21 36
fad/fag 28 19 9 7 11 -4 29 14 15
did/dig 18 16 2 29 25 4 57 45 12
bride/bribe 25 23 2 32 27 5 54 43 11
road/rogue 18 12 6 39 25 14 57 50 7
Byrd/Berg 15 10 5 14 27 -13 43 48 -5

explanation for the patterns, it is unlikely that any enhanced identification


would have emerged for these pairs.
A more plausible and interesting hypothesis is that the tongue con-
figuration in realizing lexical /dg/ sequences, regardless of the extent to which
the alveolar closure is achieved, is subtly different from that for /gg/
sequences. Specifically, it may be that English alveolars involve, in conjunc-
tion with the raising of the tongue tip, a certain hollowing of the tongue front
which is incompatible with as fronted a velar closure as might be usual in a
velar following a front vowel. The suggestion, then, is that the underlying
alveolar specification is still leaving a trace in the overall articulatory gesture,
even though the target normally thought of as primary for an alveolar is not
being achieved.
Auditorily, it seems that the vowel allophone before the lexical velar is
slightly closer than before the lexical alveolar. Spectrographic analysis of the
tokens, however, has not immediately revealed an obvious acoustic differ-
ence, but more quantitative methods may prove more fruitful.
The strong hypothesis which is suggested by these preliminaryfindingsis
that differences in lexical phonological form will always result in distinct

272
10 Francis Nolan

Nonalveolar (leg)

2141 2142 2143 2144 2145 2146 2147 2148 2149 21S

Zero alveolar (lead)

. .00 00 00 00. .

Nonalveolar {beg)

2928 2929 2930 2931

Zero alveolar (bed)


Figure 10.5 EPG patterns for the tokens of leg/lead and beg/bed used in the identification and
rating experiments

273
Segment

articulatory gestures. This is essentially the same as the constraint on


phonological representations suggested, in the context of a discussion of
neutralization, by Dinnsen (1985: 276), to the effect that "every genuine
phonological distinction has some phonetic reflex, though not necessarily in
the segments which are the seat of the distinction." Testing this hypothesis in
the domain of assimilation will require a substantial amount of data from
naive subjects, and a combination of techniques, including perhaps radio-
graphy and electromyography. Careful thought will also have to be given to
whether an independently motivated distinction can be justified between two
classes of assimilation: the type under discussion here, which is in essence
optional in its operation; and a type which represents an obligatory morpho-
phonological process. Lahiri and Hankamer (1988), for instance, discuss the
process in Bengali whereby underlying /r + t/ (where + is a morpheme
boundary) undergoes total assimilation to create geminate [tt]. This sequence
is apparently phonetically identical to the realization of underlying /t +1/,
suggesting that there is no barrier to true neutralization in the course of the
morphophonologization of an assimilation process.
First, however, as a further step in investigating optional assimilations
short of more sophisticated instrumental techniques, an extension of the
present experiment was tried which could indirectly have a bearing on the
issue. It was reasoned that any evidence that listeners perceived more
alveolars for zero-alveolar tokens than for nonalveolar tokens would indi-
cate the presence of utilizable traces of the lexical alveolar. So far, there has
only been very slight evidence of this, in the "considered" judgments of the
phoneticians. But the identification task in the preceding experiment actually
poses quite a stiff task. Suppose the "trace" in effect consists of a fine
difference in vowel height. Whilst the direction of such an effect might be
expected to be consistent across speakers, such is the variation in vowel
height across different individuals that it might be very difficult to judge a
token in isolation.
It was therefore proposed to carry out the identification test again (on new
subjects), but structured as a comparison task. A test tape was constructed
using the stimuli as in the first identification experiment, but this time
arranged so that in each trial a "lexical alveolar" sentence token of whatever
articulation type was always heard paired one second before or after the
relevant "nonalveolar" control. In this way any consistent discriminable
trace would have the best chance of being picked up and anchored relative to
the speaker's phonetic system. The subjects' task was then to listen for a
target word (e.g. leg), and note on a response sheet whether it occurred in the
first or the second sentence heard in the trial. The design was balanced, so
that half the time the target was, e.g., leg and half the time lead; and half the
time the "correct" answer was thefirstsentence, and half the time the second.

274
10 Francis Nolan
100

CO 80

Zo 60

CD
40
CD
O
CD
20

Articulation type

Figure 10.6 "Comparison" identification test: percentage correct identification of target alveo-
lar/nonalveolar words from minimal-pair sentences. Each nonalveolar sentence token was
paired with tokens classified as (1) full alveolar, (2) residual alveolar, (3) zero alveolar

In all, ten sentence pairs were used (the seven in table 10.2 plus three with
final nasals), and with each of three "alveolar" articulation-types paired with
the nonalveolar control and presented in four conditions (to achieve balance
as described above), the test tape contained 120 trials.
The results, for twenty naive listeners, are summarized in figure 10.6. The
percentage of correct identifications as either an alveolar or a nonalveolar
word is shown for stimuli consisting of, respectively, a full-alveolar token (1),
a residual-alveolar token (2), and a zero-alveolar token (3) each paired with
the relevant nonalveolar control. It can be seen that even in the case of pairs
containing zero alveolars, correct identification, at 66 percent, is well above
chance (50 percent) - significantly so (p< 0.0001) according to preliminary
statistical analysis using a chi-squared test. Inspection of individual lexical
items reveals, as found in the phoneticians' "considered" identifications, that
some pairs are harder to identify than others.
The finding of better-than-chance performance in zero-alveolar pairs is
hard to explain unless the lexical distinction of alveolar versus nonalveolar
does leave traces in the articulatory gesture, even when there is no EPG
evidence of an alevolar gesture. More generally, the finding is in accord with
the hypothesis that differences in lexical phonological form will always result
in distinct articulatory gestures. Notice, furthermore, that listeners can not
only discriminate the zero-alveolar/nonalveolar pairs, but are able to relate

275
Segment

the nature of difference to the intended lexical form. This does not, of course,
conflict wth the apparent inability of listeners to identify zero alveolars
"correctly" in isolated one-pass listening; the active use of such fine phonetic
detail may only be possible at all when "anchoring" within the speaker's
system is available.

10.4 Summary and discussion


It is clear that the work reported only scratches the phonetic surface of the
phenomenon of assimilation of place of articulation, and, if it is admitted
under the heading of "experimental phonology," it is experimental not only
in the usual sense, but also in the sense that it tries out new methods whose
legitimacy may turn out to be open to question in the light of future work on
a grander scale. Nevertheless, the work has served to define what seem to be
fruitful questions, and provide provisional answers. To recap: the specific
questions, and their provisional answers, were:
1 Does articulation mirror the discrete change implied by phonetic and
phonological representations of assimilation?
No, (optional) place assimilation is, in articulatory terms, very clearly a
gradual process. Viewed electropalatographically, intermediate forms ("resi-
dual alveolars") exist in which the tongue appears to make the supporting
lateral gesture towards the alveolar ridge in varying degrees, but no median
closure is achieved at the alveolar ridge. Other tokens are indistinguishable
from the realizations of nonalveolar sequences, suggesting complete assimila-
tion; but with the proviso that the EPG data may not capture residual
configurational differences falling short of contact. In a few tokens the EPG
traces themselves give evidence of such configurational differences. Because
of the limitations of the technique, it has not been possible to show whether
phonologically distinct lexical forms are ever realized totally identically by
virtue of optional place-assimilation.
2 If assimilation turns out to be a gradual process, how is the articulatory
continuum of forms responded to perceptually?
As might be expected, identification of lexical alveolars where there is
complete alveolar closure is highly reliable. Residual alveolars appear to be
ambiguous; they are perceived "correctly" rather under half the time. As far
as zero-alveolar forms are concerned, there is no evidence that alveolar cues,
if any, can be utilized in normal "one-pass" listening, but in certain vowel
environments phoneticians may have some success under repeated listening.
On the other hand naive listeners were able to achieve better-than-chance
identification of zero alveolars when presented directly with the zero-

276
10 Francis Nolan
alveolar/nonalveolar contrast, showing both that residual cues to the place
distinction survive even here, and that listeners are at some level aware of the
nature of these cues.
How, then, should place assimilation be modeled? It is certain that the
phonetic facts are more complicated than the discrete switch in node
association implied by figure 10.1. This would only, presumably, account for
any cases where the realization becomes identical with the equivalent
underlying nonalveolar.
One straightforward solution presents itself within the notational frame-
work of autosegmental phonology. The node on the supralaryngeal tier for
the first consonant could be associated to the place node of the second
consonant without losing its association to its "own" place features. This
could be interpreted as meaning, for the first consonant, "achieve the place
target for the second segment, without entirely losing the original features of
the first," thus giving rise to residual alveolars (stage [b] below). A further
process would complete the place assimilation (stage [c]):

(2) Timing tier C C C C C C

Supralaryngeal tier • •

Place tier • • • • • •
(a) (b) (c)

Unfortunately, there is no reason other than goodwill why that notation


should be so interpreted. Since simultaneous double articulations (alveolar
plus velar, alveolar plus labial) are perfectly feasible, and indeed are the norm
for at least part of the duration of stop clusters in English when assimilation
has not taken place, there is no reason for either of the place specifications to
be downgraded in priority. Association lines do not come in different
strengths - they are association lines pure and simple; and there is no
independently justified principle that a node associated to only one node
higher in the hierarchy be given less priority in implementation than one
associated to two higher nodes.
It seems, then, that a phonological notation of this kind is still too bound
to the notions of discreteness and segmentality to be appropriate for
modeling the detail of assimilation. Given the gradual nature of assimilation,
it is therefore natural to turn to an account at the implementational level of
speech production; that is, the level at which effects arise because of
mechanical and physiological constraints within the vocal mechanism, and
inherent characteristics of its "programming."
Prime candidate for modeling assimilation here might be the notion of

277
Segment
"co-production" (e.g. Fowler 1985: 254ff.). This conceptualization of co-
articulation sees a segment as having constituent gestures associated with it
which extend over a given timespan. The constituent gestures do not
necessarily all start and stop at the same time, and they will certainly overlap
with gestures for abutting segments. Typically, vowels are spoken of as being
coproduced with consonants, but presumably in a similar way adjacent
consonants may be coproduced. The dentality of the (normally alveolar)
lateral, and the velarization of the dental fricative, in a word such as filth,
might be seen as the result of coproduction, presumably as a result of those
characteristics of tongue-tip, and tongue-body, gestures having respectively a
more extensive domain.
Problems with such an account, based on characteristics of the vocal
mechanism and its functioning, soon emerge when it is applied to what is
known about place assimilation. For one thing, the same question arises as
with the phonological solution rejected above: namely, that true coproduc-
tion would lead not to assimilation, but to the simultaneous achievement of
both place targets - double stops again. To predict the occurrence of
residual-alveolar and zero-alveolar forms it might be possible to enhance the
coproduction model with distinctions between syllable-final and syllable-
initial consonants, and a convention that targets for the former are given
lower priority; but this seems to be creeping away from the spirit of an
account based solely in the vocal mechanism. Even then, unless it were found
that the degree of loss of the first target were rigidly correlated with rate of
speech, it is not clear how the continuum of articulation types observed could
be explained mechanically.
More serious problems arise from the fact that place-assimilation behavior
is far from universal. This observation runs counter to what would be
expected if the behavior were the result of the vocal mechanism. Evidence of
variation in how adjacent stops of different places of articulation are treated
is far from extensive, but then it has probably not been widely sought up to
now. However, there are already indications of such variation. Kerswill
(1987: 42, 44) notes, on the basis of an auditory study, an absence or near
absence in Durham English of place assimilation where it would be expected
in many varieties of English. And in Russian, it has been traditionally noted
that dental/alveolar to velar place assimilation is much less extensive than in
many other languages, an observation which has been provisionally con-
firmed by Barry (1988) using electropalatography.
If it becomes firmly established that place assimilation is variable across
languages, it will mean that it is a phenomenon over which speakers have
control. This will provide further evidence that a greater amount of phonetic
detail is specified in the speaker's phonetic representation or phonetic plan than
is often assumed. Compare, for instance, similar arguments recently used in
278
10 Francis Nolan
connection with stop epenthesis (Fourakis and Port 1986) and the micro-
timing of voicing in obstruents (Docherty 1989).
It may, or may not, have been striking that so far this discussion has tacitly
accepted a view which has been commonplace among phonologists particu-
larly since the "generative" era: namely, that the "performance" domain
relevant to phonology is production. This, of course, has not always been the
case. Jakobson, Fant, and Halle (1952: 12) argued very cogently that the
perceptual domain is the one most relevant to phonology:
The closer we are in our investigation to the destination of the message (i.e. its
perception by the receiver), the more accurately can we gage the information
conveyed by its sound shape ... Each of the consecutive stages, from articulation to
perception, may be predicted from the preceding stage. Since with each subsequent
stage the selectivity increases, this predictability is irreversible and some variables of
any antecedent stage are irrelevant for the subsequent stage.

Does this then provide the key to the phonological treatment of assimilation?
Suppose for a moment that residual-alveolar forms were, like zero-alveolar
forms, indistinguishable from lexical nonalveolars in one-pass listening.
Would it then, on the "perceptual-primacy" view of the relation of phono-
logy to performance, be legitimate to revert to the treatment of assimilation
shown in figure 10.1 - saying in effect that as far as the domain most relevant
to the sound patterning of language is concerned, assimilation happens quite
discretely?
This is an intriguing possibility for that hypothetical state of affairs, but
out of step with the actual findings of the identification experiment, in which
residual-alveolar forms allowed a substantial degree of correct lexical identi-
fication. For a given stimulus, of course, the structure of the experiment
forces a discrete response (one word or the other); but the overall picture is
one in which up to a certain point perception is able to make use of partial
cues to alveolarity.
Whatever the correct answer may be in this case, the facts of assimilation
highlight a problem which will become more and more acute as phonologists
penetrate further towards the level of phonetic detail: that is, the problem of
what precisely a phonology is modeling.
One type of difficulty emerged from the fact that the level of phonetic
detail constitutes the interface between linguistic structure and speech
performance, and therefore between a discrete symbol system and an
essentially continuous event. It is often difficult to tell, from speech-
performance data itself (such as that presented here on assimilation), which
effects are appropriately modeled symbolically and which treated as
continuous, more-or-less effects.
A further difficulty emerged from the fact that phonetic detail does not

279
Segment
present itself for analysis directly and unambiguously. Phonetic detail can
only be gleaned by examining speech performance, and speech performance
has different facets: production, the acoustic signal, and perception, at least.
Perhaps surprisingly, these facets are not always isomorphic. For instance,
experimental investigation of phonetic detail appears to throw up cases
where produced distinctions are not perceived (see work on sound changes in
progress, such as the experiment of Costa and Mattingly [1981] on New
England [USA] English, which revealed a surviving measurable vowel-
duration difference in otherwise minimal pairs such as cod-card, which their
listeners were unable to exploit perceptually). In such cases, what is the
reality of the linguistic structure the phonologist is trying to model?

10.5 Conclusions
This paper has aimed to show that place assimilation is a fruitful topic of
study at the interface of phonology and experimental phonetics. It has been
found that place assimilation happens gradually, rather than discretely, in
production; and that residual cues to alveolars can be exploited with some
degree of success in perception.
It is argued that the facts of place assimilation can be neither modeled
adequately at a symbolic, phonological level, nor left to be accounted for by
the mechanics of the speech mechanism. Instead, they must be treated as one
of those areas of subcontrastive phonetic detail over which speakers have
control. The representation of such phenomena is likely to require a more
radical break from traditional segmental notions than witnessed in recent
phonological developments.
Clearly, much remains to be done, both in terms of better establishing the
facts of assimilation of place of articulation, and, of course, of other aspects
of production, and in terms of modeling them. It is hoped that this paper will
provoke others to consider applying their own techniques and talents to this
enterprise.

Comments on chapter 10
BRUCE HAYES
The research reported by Nolan is of potential importance for both phonetic
and phonological theory. The central claim is as follows. The rule in (3),
which is often taught to beginning phonology students, derives incorrect
outputs. In fluent speech, the /t/ of late calls usually does not become a /k/,
but rather becomes a doubly articulated stop, with both a velar and an

280
10 Comments
alveolar closure: [lei{t}kDilz]. The alveolar closure varies greatly in its
strength, from full to undetectable.

(3)
alveolar [ a place ] /. C
stop a place

Nolan takes care to point out that this phenomenon is linguistic and not
physiological - other dialects of English, and other languages, do not show
the same behavior. This means that a full account of English phonology and
phonetics must provide an explicit description of what is going on.
Nolan also argues that current views of phonological structure are
inadequate to account for the data, suggesting that "The representation of
such phenomena is likely to require a more radical break from traditional
segmental notions than witnessed in recent phonological developments."
As a phonologist, I would like to begin to take up this challenge: to suggest
formal mechanisms by which Nolan's observations can be described with
explicit phonological and phonetic derivations. In fact, I think that ideas
already in the literature can bring us a fair distance towards an explicit
account. In particular, I want to show first that an improved phonological
analysis can bring us closer to the phonetic facts; and second, by adopting an
explicit phonetic representation, we can arrive at least at a tentative account
of Nolan's data.
Consider first Nolan's reasons for rejecting phonological accounts of the
facts. In his paper, he assumes the model of segment structure due to
Clements (1985), in which features are grouped within the segment in a
hierarchical structure. For Clements, the place features are grouped together
under a single PLACE node, as in (4a). Regressive place assimilation would be
expressed by spreading the PLACE node leftward, as in (4b).

(4)a. PLACE

[ant] [cor] [distr] etc.

b. C C C

SUPRA SUPRA SUPRA SUPRA

PLACE PLACE PLACE PLACE

[ +cor] [ —cor] [ + cor] [ —cor]

281
Segment

A difficulty with this account, as Nolan points out, is that it fails to indicate
that the articulation of the coronal segment is weak and variable, whereas
that of the following noncoronal is robust. However, it should be remem-
bered that (4b) is meant as a phonological representation. There are good
reasons why such representations should not contain quantitative inform-
ation. The proper level at which to describe variability of closure is actually
the phonetic level.
I think that there is something more fundamentally wrong with the rule in
(4b): it derives outputs that are qualitatively incorrect. If we follow standard
phonological assumptions, (4b) would not derive a double articulated
segment, but rather a contour segment. The rule is completely analogous to
the tonal rule in (5), which derives a contour falling tone from a High tone by
spreading.
(5) = falling tone + low tone

Following this analogy, the output of rule (4b) would be a contour segment,
which would shift rapidly from one place of articulation to another.
If we are going to develop an adequate formal account of Nolan's findings,
we will need phonological and phonetic representations that can depict
articulation in greater detail. In fact, just such representations have been
proposed in work by Sagey (1986a), Ladefoged and Maddieson (1989), and
others. The crucial idea is shown in (6): rather than simply dominating a set
of features, the PLACE node dominates intermediate nodes, corresponding to
the three main oral articulators: LABIAL for the lips, CORONAL for the tongue
blade, and DORSAL for the tongue body.
(6) PLACE

LABIAL

[round] [ant] [distr] [back] [high] [low]


The articulator nodes are not mutually exclusive; when more than one is
present, we get a complex segment. One example is the labiovelar stop,
depicted as the copresence of a LABIAL and a DORSAL node under the same
PLACE node.

(7) [gjb]: PLACE

LABIAL DORSAL

282
10 Comments
Note that the LABIAL and DORSAL nodes are intended to be simultaneous, not
sequenced.
Representations like these are obviously relevant to Nolan's findings,
because he shows that at the surface, English has complex segments: for
example, the boldface segment in late calls [lei{ t}ko:lz] is a coronovelar, and
the boldface segment in good batch [go{J}baetJ] is a labiocoronal.
A rule to derive the complex segments of English is stated in (8). The rule
says that if a syllable-final coronal stop is followed by an obstruent, then the
articulator node of the following obstruent is spread leftward, sharing the
PLACE node with the CORONAL node.

(8) Place Assimilation


Spread the articulator node of a following obstruent leftward, onto a
syllable-final [-SX£S£nt] segment.4
In (9) is an illustration of how the rule works. If syllable-final /t/ is
followed by /k/, as in late calls, then the DORSAL articulator node of the /k/ is
spread leftward. In the output, it simultaneously occupies a PLACE node with
the original CORONAL node of the /t/. The output of the rule is therefore a
corono-dorsal complex segment.

(9) late calls: / t k / -> [{\}k]

C C

PLACE PLACE
i

COR DORS

To complete this analysis, we have to provide a way of varying the degree


of closure made by the tongue blade. Since phonological representations are
discrete rather than quantitative, they are inappropriate for carrying out this
task. I assume then, following work by Pierrehumbert (1980), Keating (1985,
1988a), and others, that the grammar of English contains a phonetic
component, which translates the autosegments of the phonology into quanti-
tative physical targets. The rule responsible for weakening alveolar closures
is a phonetic rule, and as such it manipulates quantitative values.

There is an additional issue involved in the expression of (8): the rule must generalize over the
class of articulator nodes without actually spreading the PLACE node that dominates them.
Choi (1989), based on evidence from Kabardian, suggests that this may in fact be the normal
way in which class nodes operate: they define sets of terminal nodes that may spread, but do
not actually spread themselves.

283
Segment
The form of rules in the phonetic component is an almost completely
unsettled issue. For this reason, and for lack of data, I have stated the
phonetic rule of Alveolar Weakening schematically as in (10):
(10) Alveolar Weakening
Depending on rate and casualness of speech, lessen the degree of closure for
a COR autosegment, if it is [ — continuant] and syllable-final.
In (11) is a sketchy derivation showing how the rule would apply. We start
with the output of the phonology applying to underlying /kt/, taken from (9).
Next, the phonetic component assigns degree-of-closure targets to the
CORONAL and DORSAL autosegments. Notice that the target for the DORSAL
autosegment extends across two C positions, since this autosegment has
undergone spreading. Finally, the rule of Alveolar Weakening lessens the
degree of closure for the CORONAL target. It applies variably, depending on
speech style and rate, but I have shown just one possible output.

(11) a. Output of b. Transla tio


phonology quantitative weakening
targets

C C C C C C

PLACE PLACE 1 ALVEOLAR


X
closure
C6R DORS value
0

1 DORSAL
xxxxx xxxxx closure
value
0

This analysis is surely incomplete, and indeed it may turn out to be entirely
wrong. But it does have the virtue of leading us to questions for further
research, especially along the lines of how the rules might be generalized to
other contexts.
To give one example, I have split up what Nolan treats as a single rule into
two distinct processes: a phonological spreading rule, Place Assimilation (8);
plus a phonetic rule, Alveolar Weakening (10). This predicts that in principle,
one rule might apply in the absence of the other. I believe that this is in fact
true. For example, the segment /t/ is often weakened in its articulation even
when no other segment follows. In such cases, the weakened /t/ is usually

284
10 Comments
"covered" with a simultaneous glottal closure. Just as in Nolan's data, the
degree of weakening is variable, so that with increasingly casual speech we
can get a continuum like the following for what: [wAt], [wAt?], [WA?1], [WA?].
It is not clear yet how Alveolar Weakening should be stated in its full
generality. One possibility is that the alveolar closures that can be weakened
are those that are "covered" by another articulation. A full, accurate
formulation of Alveolar Weakening would require a systematic investigation
of the behavior of syllable-final alveolars in all contexts.
My analysis also raises a question about whether Nolan is right in claiming
that Place Assimilation is a "gradual process." It is clear from his work that
the part of the process I have called Alveolar Weakening is gradual. But what
about the other part, which we might call "Place Assimilation Proper"? In
my analysis, Place Assimilation Proper is predicted to be discrete, since it is
carried out by a phonological rule. The tokens that appear in Nolan's paper
appear to confirm this prediction, which I think would be worth checking
systematically. If the prediction is not confirmed, it is clear that the theory of
phonetic representation will need to be enriched in ways I have not touched
on here.
Another area that deserves investigation is what happens when the
segment that triggers Place Assimilation is itself coronal, as in the dental
fricatives in get Thelma, said three, and ten things. Here, it is impossible to
form a complex segment, since the trigger and the target are on the same tier.
According to my analysis, there are two possible outcomes. If the CORONAL
autosegment on the left is deleted, then the output would be a static dental
target, extending over both segments, as in (12b). But if there is no delinking,
as in (12c), then we would expect the /t/ to become a contour segment, with
the tongue blade sliding from alveolar to dental position.

(12) a. Applying place b. Output with c. Output without


assimilation to jtOj delinking: [tG] delinking: [tt6]

PLACE PLACE PLACE PLACE PLACE PLACE

COR COR COR COR COR


II I I I
[ - distr ] [ + distr ] [ + distr ] [ - distr ] [ + distr ]
My intuitions are that both outcomes are possible in my speech, but it is clear
that experimental work is needed.
To sum up, I think Nolan's work is important for what it contributes to
the eventual development of a substantial theory of phonetic rules. I have
also tried to show that by adopting the right phonological representation as

285
Segment

the input to the phonetic component (i.e. an autosegmental one, with


articulator nodes), the task of expressing the phonetic rules can be simplified.

Comments on chapter 10
JOHN J. OHALA
Nolan's electropalatographic study of lingual assimilation has given us new
insight into the complexities of assimilation, a process which phonologists
thought they knew well but which, the more one delves into it, turns out to
have completely unexpected aspects.
To understand fully place assimilation in heterorganic medial clusters I
think it is necessary to be clear about what kind of process assimilation is. I
presume all phonologists would acknowledge that variation appears in
languages due to "on-line" phonetic processes and due to sound changes. The
former may at one extreme be purely the result of vocal-tract constraints; for
example, Lindblom (1963) made a convincing case for certain vowel-reduction
effects being due to inertial constraints of the vocal mechanism. Sound change,
at the other extreme, may leave purely fossilized variant pronunciations in the
language, e.g., cow and bovine, both from Proto-Indo-European *gwous.
Phonetically caused variation may be continuous and not represent a change
in the instructions for pronunciation driving the vocal tract. Sound change, on
the other hand, yields discrete variants which appear in speech due to one or
the other variant form having different instructions for pronunciation. There
are at least two complicating factors which obscure the picture, however. First,
it is clear that most sound changes develop out of low-level phonetic variation
(Ohala 1974, 1983), so it may be difficult in many cases to differentiate
continuous phonetic variation from discrete variation due to sound change.
Second, although sound changes may no longer be completely active, they can
exhibit varying degrees of productivity if they are extrapolated by speakers to
novel lexical items, derivations, phrases, etc. It was a rather ancient sound
change which gave us the k > s change evident in skeptic ~ skepticism but this
does not prevent some speakers from extending it in novel derivations like
domesticism (with a stem-final [s]) (Ohala 1974).
I think place assimilation of medial heterorganic clusters in English may
very well be present in the language due to a sound change, but one which is
potentially much more productive than velar softening. Nevertheless, its full
implementation could still be discrete. In other words, I think there may be a
huge gap between the faintest version of an alveolar stop in red car and the
fully assimilated version [reg ka:].
Naturally, the same phonetic processes which originally gave rise to the
286
10 Comments
sound change can still be found in the language, i.e. imperfect articulation of
Cl and thus the weakening of the place cues for that consonant vis-a-vis the
place cues for C2. In the first Laboratory Phonology Conference (Ohala
1990a) I presented results of an experiment that showed that artificial
heterorganic stop + stop and nasal + stop clusters may be heard as
homorganic if the duration of the closures is less than a certain threshold
value; in these cases it was the place of C2 which dominated the percept. In
that case there was no question of the Cl's being imperfectly articulated:
rather simply that the place cues for C2 overshadowed those of Cl. The
misinterpretation of a heterorganic cluster C1C2 as a homorganic one was
discrete and, I argued, a duplication of the kind of phonetic event which led
to such sound changes as Late Latin okto > Italian otto. I think a similar
reading can be given to the results of Nolan's perceptual test (summarized in
his fig. 10.4). For the residual-alveolar tokens, slightly less than half of all
listeners (linguists and nonlinguists combined) identified the word as having
an alveolar Cl. I am not saying that this discrete identification change is the
discrete phonological process underlying English place assimilation, just that
it mirrors the process that gave rise to it.
Another example may clarify my point. No doubt, most would agree that
the alternation between /t/ and /tf/ as in act and actual([aekt] [aektjual]) is due to
a sound change. (That most speakers do not "derive" actual from act is
suggested by their surprise or amusement on being told that the two are
historically related.) Nevertheless, we can still find the purely phonetic
processes which gave rise to such a change by an examination of the acoustic
properties of [t]s released before palatal glides or even palatal vowels: in
comparison to the releases before other, more open, vowels, these releases are
intense and noisy in a way that mimics a sibilant fricative. No doubt the t—>tj
sound change arose due to listeners misinterpreting and thus rearticulating
these noisily released [t]s as sibilant affricates (Ohala, 1989). What wefindin a
synchronic phonetic analysis may very well be the "seeds" of sound change but
it was the past "germination" of such seeds which gave rise to the current
discrete alternations in the language; the presence of the affricate in actual is
not there due any change per se being perpetrated by today's speakers or
listeners.

Comments on chapter 10
CATHERINE P. BROWMAN
Nolan's study presents articulatory evidence indicating that assimilation is
not an all-or-none process. It also presents perceptual evidence indicating
that lexical items containing alveolars can be distinguished from control
287
Segment

utterances (with no alveolars) even when there is no electropalatographic


evidence for tongue-tip articulations. The study addresses the question of
how such articulations differ from the control utterances - an important
point. However, it is difficult to evaluate the evidence presented, since the
articulation type that is critically important to this type of assimilation -
"zero alveolar" - is not clearly defined, largely due to the lack of temporal
information.
Infigures10.7a-c, the top panels portray three possible definitions in terms
of articulatory gestures (see e.g. Browman and Goldstein 1986, 1989). The
top panel in figure 10.7a shows the simplest possibility, and the one suggested
by the name "zero alveolar": gestural deletion, effectively the end of the
continuum of reduction. However, testing this possibility requires that the
behavior of the velar gestures in the control utterances (e.g. ... make
calls...) be clearly understood. What is the nature of the velar-velar
configuration to which the assimilation is being compared? Are there two
partially overlapping velar gestures, as in the top panel of figure 10.7, or have
the velar gestures slid so as to be completely superimposed, as in the bottom
panel? In the former case, there would be a clear durational difference
between zero-alveolar assimilations, defined as deleted gestures, and the
control utterances; in the latter, the two would not be clearly distinguishable
in terms of duration.
The top panel in figure 10.7b portrays a second possible definition of "zero
alveolar," one discussed by Nolan: a deleted tongue-tip articulation, with a
secondary tongue-front articulation remaining. This case would be clearly
distinguishable from both velar-velar configurations. The top panel in figure
10.7c portrays a third possibility, suggested by the notation of autosegmental
phonology: the tongue-tip gesture is replaced by a tongue-body gesture
(relinking on the place tier). As in the case of gestural deletion (figure 10.7a),
this case of gestural replacement can be evaluated only if the nature of the
control velar-velar configuration is known. Only if the two velars are
completely overlapping would there be an obvious difference between
gestural replacement and the control velar-velar configuration.
It should be noted that there is another possible source of perceptual
assimilation: increased gestural overlap. This is portrayed in the bottom
panels of figures 10.7a-c, which show full articulations for both the tongue
tip and the tongue body, but completely overlapping. Zsiga and Byrd (1988)
provided a preliminary indication of the validity of a similar analysis of
assimilation, using GEST, a computational gestural model being developed
at Haskins Laboratories (e.g. Browman et al. 1986; Saltzman et al. 1988a).
Zsiga and Byrd created a continuum of increasing overlap between the
gestures in the two words in the utterance bed ban, and measured the values
of F2 and F3 at the last point before the closure for the [d]. As the bilabial
288
10 Comments
Alveolar-velar Velar-velar
Reduction
Full Residual Zero

Partial

t
CD
TB

6 Complete TT

TB

(a)

(b)

Partial

Complete
TB

(C)

Figure 10.7 Three possible definitions of "zero-alveolar" in terms of articulatory


gestures: (a) deletion of tongue-tip gesture; (b) deletion of tongue-tip gesture, with
secondary tongue-front articulation remaining; (c) replacement of tongue-tip gesture
by tongue-body gesture
gesture (beginning ban) overlapped the alveolar gesture (ending bed) more
and more, the formant values at the measurement point moved away from
the values for the utterance bed dan, and towards the values of the utterance
beb ban, even prior to complete overlap. Most importantly, when the two
gestures were completely synchronous, the formant values were closer to
those for [b] than those for [d]. Thus, in this case the acoustic consequences of
complete overlap were compatible with the hypothesis that assimilation
could be the result of increasing overlap between articulatory gestures.

289
II
Psychology and the segment

ANNE CUTLER

Something very like the segment must be involved in the mental operations
by which human language users speak and understand.* Both processes-
production and perception - involve translation between stored mental rep-
resentations and peripheral processes. The stored representations must be
both abstract and discrete.
The necessity for abstractness arises from the extreme variability to which
speech signals are subject, combined with the finite storage capacity of
human memory systems. The problem is perhaps worst on the perceiver's
side; it is no exaggeration to say that even two productions of the same
utterance by the same speaker speaking on the same occasion at the same
rate will not be completely identical. And within-speaker variability is tiny
compared to the enormous variability across speakers and across situations.
Speakers differ widely in the length and shape of their vocal tracts, as a
function of age, sex, and other physical characteristics; productions of a
given sound by a large adult male and by a small child have little in common.
Situation-specific variations include the speaker's current physiological state;
the voice can change when the speaker is tired, for instance, or as a result of
temporary changes in vocal-tract shape such as a swollen or anaesthetized
mouth, a pipe clenched between the teeth, or a mouthful of food. Other
situational variables include distance between speaker and hearer, interven-
ing barriers, and background noise. On top of this there is also the variability
due to speakers' accents or dialects; and finally, yet more variability arises
due to speech style, or register, and (often related to this) speech rate.
But the variability problem also exists in speech production; we all vary
our speech style and rate, we can choose to whisper or to shout, and the

"This paper was prepared as an overall commentary on the contributions dealing with segmental
representation and assimilation, and was presented in that form at the conference.

290
11 Anne Cutler
accomplished actors among us can mimic accents and dialects and even
vocal-tract parameters which are not our own. All such variation means that
the peripheral processes of articulation stand in a many-to-one relationship
to what is uttered in just the same way as the peripheral processes of
perception do.
If the lexicon were to store an exact acoustic and articulatory represen-
tation for every possible form in which a given lexical unit might be heard or
spoken, it would need infinite storage capacity. But our brains simply do not
have infinite storage capacity. It is clear, therefore, that the memory
representations of language which we engage when we hear or produce
speech must be in a relatively abstract (or normalized) form.
The necessity for discreteness also arises from the finite storage capacity of
our processing systems. Quite apart from the infinite range of situational and
speaker-related variables affecting how an utterance is spoken, the set of
potential complete utterances themselves is also infinite. A lexicon - that is,
the stored set of meaning representations-just cannot include every utter-
ance a language user might some day speak or hear; what is in the lexicon
must be discrete units which are smaller than whole utterances. Roughly, but
not necessarily exactly, lexical representations will be equivalent to words.
Speech production and perception involve a process of translation between
these lexical units and the peripheral input and output representations.
Whether this process of translation in turn involves a level of representation
in terms of discrete sublexical units is an issue which psycholinguists have
long debated.
Arguments in favor of sublexical representations have been made on the
basis of evidence both from perception and from production. In speech
perception, it is primarily the problem of segmentation which has motivated
the argument that prelexical classification of speech signals into some sub-
word-level representation would be advantageous. Understanding a spoken
utterance requires locating in the lexicon the individual discrete lexical units
which make up the utterance, but the boundaries between such units - i.e. the
boundaries between words-are not reliably signaled in most utterances;
continuous speech is just that-continuous. There is no doubt that a
sublexical representation would help with this problem, because, instead of
being faced with an infinity of points at which a new word might potentially
commence, a recognizer can deal with a string of discrete units which offer
the possibility of a new word beginning only at those points where a new
member of this set of sublexical units begins.
Secondly, arguments from speech perception have pointed out that the
greatest advantage of a sublexical representation is that the set of potential
units can be very much smaller than the set of units in the lexicon. However
large and heterogeneous the lexical stock (and adult vocabularies run into
291
Segment

many tens if not hundreds of thousand items), with sublexical represen-


tations any lexical item could be decomposed into a selection from a small
and finite set of units. Since a translation process between the lexicon and
more peripheral processes is necessary in any case, translation into a small set
of possibilities will be far easier than translation into a large set of
possibilities.
Exactly similar arguments have been made for the speech-production
process. If all the words in the lexicon can be thought of as being made up of
a finite number of building blocks in various permutations, then the
translation process from lexical representation to the representation for
articulation need only know about how the members of the set of building
blocks get articulated, not how all the thousands of entries in the lexicon are
spoken.
Obvious though these motivating arguments seem, the point they lead to is
far from obvious. Disagreement begins when we attempt to answer the next
question: what is the nature of the building blocks, i.e. the units of sublexical
representation? With excellent reason, the most obvious candidates for the
building-block role have been the units of analysis used by linguists. The
phoneme has been the most popular choice because (by definition) it is the
smallest unit into which speech can be sequentially decomposed. I wish that it
were possible to say at this point: the psycholinguistic evidence relating to the
segment is unequivocal. Unsurprisingly, however, equivocality reigns here as
much as it does in phonology.
On the one hand, language users undoubtedly have the ability to manipu-
late speech at the segmental level, and some researchers have used such
performance data as evidence that the phoneme is a level of representation in
speech processing. For instance, language games such as Pig Latin frequently
involve movement of phoneme-sized units within an utterance (so that in one
version of the game, pig latin becomes ig-pay atin-lay). At a less conscious
level, slips of the tongue similarly involve movement of phoneme-sized
units - "by far the largest percentage of speech errors of all kinds" says
Fromkin (1971: 30), involve units of this size: substitution, exchange,
anticipation, perseveration, omission, addition - all occur more often with
single phonemes than with any other linguistic unit. The on-line study of
speech recognition has made great use of the phoneme-monitoring task
devised by Foss (1969), which requires listeners to monitor speech for a
designated target phoneme and press a response key as soon as the target is
detected; listeners have no problem performing this task (although as a
caveat it should be pointed out that the task has been commonly used only
with listeners who are literate in a language with alphabetic orthography).
Foss himself has provided (Foss and Blank 1980; Foss, Harwood and Blank
1980; Foss and Gernsbacher 1983) the strongest recent statements in favor of
292
11 Anne Cutler

the phoneme as a unit of representation in speech processing: "the speech


perception mechanisms compute a representation of the input in terms of
phoneme-sized units" (Foss, Harwood, and Blank 1980: 185). This argument
is based on the fact that phoneme targets can be detected in heard speech
prior to contact with lexical representations, as evidenced by the absence of
frequency effects and other lexical influences in certain types of phoneme-
monitoring experiment.
Doubt was for a time cast on the validity of phoneme-monitoring
performance as evidence for phonemic representations in processing, because
it was reported that listeners can detect syllable-sized targets faster than
phoneme-sized targets in the same speech material (for English, Savin and
Bever 1970; Foss and Swinney, 1973; Mills 1980; Swinney and Prather 1980;
for French, Segui, Frauenfelder, and Mehler 1981). However, Norris and
Cutler (1988) noted that most studies comparing phoneme- and syllable-
monitoring speed had inadvertently allowed the syllable-detection task to be
easier than the phoneme-detection task - when the target was a syllable, no
nontarget items were presented which were highly similar to the target, but
when the target was a phoneme, nontarget items with very similar phonemes
did occur. To take an example from Foss and Swinney's (1973) experiment,
the list giant valley amoral country private middle bitter protect extra was
presented for syllable monitoring with the target bit-, and for phoneme
monitoring with the target b-. The list contains no nontarget items such as
pitcher, battle, or bicker which would be very similar to the syllable target,
but it does contain nontarget items beginning with/?-, which is very similar to
the phoneme target. In effect, this design flaw allowed listeners to respond in
the syllable-target case on the basis of only partial analysis of the input.
Norris and Cutler found that when the presence of items similar to targets of
either kind was controlled, so that listeners had to perform both phoneme
detection and syllable detection on the basis of equally complete analyses of
the input, phoneme-detection times were faster than syllable-detection times.
Thus there is a substantial body of opinion favoring phonemic units as
processing units.
On the other hand, there are other psycholinguists who have been
reluctant even to consider the phoneme as a candidate for a unit of sublexical
representation in production and perception because of the variability
problem. To what degree can it be said that acoustic cues to phonemes
possess constant, invariant properties which are necessarily present whenever
the phoneme is uttered? If there are no such invariant cues, they have argued,
how can a phonemic segmentation process possibly contribute to processing
efficiency, since surely it would simply prove enormously difficult in its own
right? Moreover, at the phoneme level the variability discussed above is
further compounded by coarticulation, which makes a phoneme's spoken
293
Segment
form sensitive to the surrounding phonetic context-and the context in
question is not limited to immediately adjacent segments, but can include
several segments both forwards and backwards in the utterance. This has all
added up to what seemed to some researchers like insuperable problems for
phonemic representations.
As alternatives, both units above the phonemic level, such as syllables,
demisyllables, or diphones, and those below it, such as featural represen-
tations or spectral templates, have been proposed. (In general, though,
nonlinguistic units such as diphones or demisyllables have only been
proposed by researchers who are concerned more with machine implemen-
tation than with psychological modeling. An exception is Samuel's [1989]
recent defense of the demisyllable in speech perception.) The most popular
alternative unit has been the syllable (Huggins 1964; Massaro 1972; Mehler
1981; Segui 1984), and there is a good deal of experimental evidence in its
favor. Moreover, this evidence is very similar to the evidence which appar-
ently favors the phoneme; thus language games often use syllabically defined
rules (Scherzer 1982), slips of the tongue are sensitive to syllabic constraints
(MacKay 1972), and on-line studies of speech recognition have shown that
listeners can divide speech into syllables (Mehler et al. 1981).
Thus there is no unanimity at all in the psycholinguistic literature, with
some researchers favoring phonemic representations and some syllabic
representations, while others (e.g. Crompton 1982) favor a combination of
both, and yet others (e.g. Samuel 1989) opt for some more esoteric alterna-
tive. A consensus may be reached only upon the lack of consensus. Recent
developments in the field, moreover, have served only to sow further
confusion. It turns out that intermediate levels of representation in speech
perception can be language-specific, as has been shown by experiments
following up thefindingof Mehler et al. (1981) that listeners divide speech up
into syllables as they hear it. Mehler et al.'s study was carried out in French;
in English, as Cutler et al. (1986) subsequently found, its results proved
unreplicable. Cutler et al. pointed out that syllable boundaries are relatively
clearer in French than in English, and that this difference would make it
inherently more likely that using the syllable as a sublexical representation
would work better in French than in English. However, they discovered in
further experiments that English listeners could not divide speech up into
syllables even when they were listening to French, which apparently encour-
ages such a division; while French listeners even divided English up into
syllables wherever they could, despite the fact that the English language fails
to encourage such division. Thus it appears that the French listeners, having
grown up with a language which encourages syllabic segmentation, learnt to
use the syllable as an intermediate representation, whereas English listeners,
who had grown up with a hard-to-syllabify language, had learnt not to use it.
294
11 Anne Cutler
In other words, speakers' use of intermediate units of representation is
determined by their native language.
The reason that this finding muddied the theoretical waters is, of course,
that it means that the human language-processing system may have available
to it a range of sublexical units of representation. In such a case, there can be
no warrant for claims that any one candidate sublexical representation is
more basic, more "natural" for the human language-processing system, than
any other for which comparable evidence may be found.
What relevance does this have to phonology (in general, and laboratory
phonology in particular)? Rather little, perhaps, and that entirely negative: it
suggests, if anything, that the psychological literature is not going to assist at
all in providing an answer to the question of the segment's theoretical status
in phonology.
This orthogonality of existing psycholinguistic research to phonological
issues should not, in fact, be surprising. Psychology has concluded that while
the units of sublexical representation in language perception and production
must in terms of abstractness and discreteness resemble the segment, they
may be many and varied in nature, and may differ from language community
to language community, and this leaves phonology with no advance at all as
far as the theoretical status of the segment is concerned. But a psychological
laboratory is not, after all, the place to look for answers to phonological
questions. As I have argued elsewhere (Cutler 1987), an experiment can only
properly answer the question it is designed to answer, so studies designed to
answer questions about sublexical representations in speech processing are
ipso facto unlikely to provide answers of relevance to phonology. When the
question is phonological, it is in the phonology laboratory that the answer is
more likely to be found.

295
12
Trading relations in the perception of stops and their
implications for a phonological theory

LIESELOTTE SCHIEFER

12.1 Introduction

All feature sets used in the description of various languages are either
phonetically or phonemically motivated. The phonetically motivated fea-
tures are based on acoustic, articulatory, or physiological facts, whereas
phonemically motivated features take into account, for example, the compar-
ability of speech sounds with respect to certain phonemic and/or phono tactic
rules. But even if the authors of feature sets agree on the importance of
having phonetically adequate features, they disagree on the selection of
features to be used in the description of a given speech sample.
This is the situation with the description of Hindi stops. Hindi has a
complicated system of four stop classes (voiceless unaspirated, voice-
less aspirated, voice, and breathy voiced, traditionally called voiced aspir-
ated) in four places of articulation (labial, dental, retroflex, and velar),
plus a full set of four affricates in the palatal region. Since Chomsky
and Halle (1968), several feature sets have been put forward in order
to account for the complexity of the Hindi stop system (Halle and
Stevens 1971; Ladefoged 1971; M. Ohala 1979; Schiefer 1984). The feature
sets proposed by these authors have in common that they make use only of
physiologically motivated features such as "slack vocal cords" (see table
12.1).
In what follows, we concentrate on the features proposed by Ladefoged,
Ohala, and Schiefer. These authors differ not only according to their feature
sets but also with respect to the way these features are applied to the stop
classes. Ladefoged groups together (a) the voiceless unaspirated and voice-
less aspirated stops by assigning them the value "0" of the feature "glottal
stricture," and (b) the breathy voiced and voiceless aspirated stops by giving
them the values "2" of "voice-onset time." But he does not consider voiced
296
12 Lieselotte Schiefer
Table 12.1 Feature sets proposed by Chomsky and Halle (1968), Halle and
Stevens (1971), M. Ohala (1979), and Schiefer (1984)

1 Chomsky and Halle (1968) P Ph b bh


heightened subglottal
air pressure
voice
tense
2 Halle and Stevens (1971)

+ 1 1 1
spread glottis

+ 1 1 +
+ 1+ 1
constricted glottis
stiff cords +
slack cords
3 Ladefoged (1971)
glottal stricture 0 0 2 1
voice-onset time 1 2 0 2
4 M . Ohala (1979) P Ph b bh c
distinctive release
delayed release
voice-onset time 1 2 0 0
glottal stricture 0 0 2 1
vocal-cord tension
5 Schiefer (1984)
distinctive release
delayed release
voice-onset time 1 2 0 2
vocal-cord tension

and breathy voiced stops to form a natural class, since he assigns " 0 " for the
feature "voice-onset time" to the voiced stop.
Ohala uses four features (plus "delayed release"). The feature "distinctive
release" is used to distinguish the "nonaspirates," voiced and voiceless
unaspirated, from voiceless aspirated, breathy voiced, and the affricates. The
feature "voice-onset time" is used to build a natural class with the voiced and
breathy voiced stops, whereas "glottal stricture" allows her both to build a
class with voiceless unaspirated and voiceless aspirated stops, and to account
for the difference in the mode of vibration between voiced and breathy voiced
stops. Finally, she needs the feature "vocal-cord tension" "to distinguish the
voiced aspirates from all other stop types also," as "It is obvious that where
the voiced aspirate stop in Punjabi has become deaspirated, a low rising tone
on the following vowel has developed" (1979: 80).

297
Segment
Schiefer (1984) used the features "distinctive release," "delayed release,"
and "vocal-cord tension" in the same way as Ohala did, but followed
Ladefoged in the use of the feature "voice-onset time." She rejected the
feature "glottal stricture" on grounds which will be discussed later.
It is apparent that Ohala's analysis is based on phonetic as well as
phonemic arguments. She uses phonetic arguments in order to reject the
feature "heightened subglottal air pressure" of Chomsky and Halle (1968)
and to favor the features "glottal stricture" and "voice-onset time," which
she takes over from Ladefoged (1971). On the other hand, she makes use of a
phonemic argument in favor of the feature "delayed release," as "in Maithili
there is a rule which involves the de-aspiration of aspirates when they occur
in non-utterance-initial syllables, followed by a short vowel and either a
voiceless aspirate or a voiceless fricative" (1979: 80). Moreover, in contrast to
Ladefoged and Schiefer, Ohala assigns "onset of voicing" to the feature
"voice-onset time." This implies that the feature "voice-onset time" is
applied to two different acoustic (and physiological) portions of the breathy
voiced stop. That is, since Ohala treats the breathy voiced stop in the same
way as the voiced one, her feature applies to the prevoicing of the stop,
whereas Ladefoged and Schiefer apply the feature to the release of the stop.
As already mentioned, Ladefoged, Ohala, and Schiefer all use physiologi-
cal features, and their investigations are based on relatively small amounts of
physiological or acoustic data. None of these authors relies on perceptual
results - something which is characteristic of other phonological work as well
(see Anderson 1978). The aim of the present paper is therefore to use both an
acoustic analysis (based on the productions of four informants) and results
from perceptual tests as a source of evidence for specific features. Since I lack
physiological data of my own, I will concentrate especially on the feature
"voice-onset time." In doing so, I start from the following considerations, (a)
If Ohala is right in grouping the voiced and breathy voiced stops together
into one natural class by the feature "voice-onset time," then this phonetic
feature, namely prevoicing, is a necessary feature in the production as well as
in the perception of these stops. Otherwise the function of prevoicing differs
between the two stop classes, (b) If the main acoustic (as well as perceptual)
information about a breathy voiced stop is located in the release portion, the
influence of prevoicing in the perception of this stop should be less
important.

12.2 Acoustic properties of Hindi stops


12.2.1 Material and informants
The material consisted of 150 different Hindi words containing the breathy
voiced stops /bh, dh, dh, gh/ in word-initial position followed by the vowels /a,
298
12 Lieselotte Schiefer

e, o, i, u/. Each stop-vowel combination occurred about ten times in the


material. The data were organized in several lists and read in citation form.
The first and last member of each list was excluded from analysis in order to
control for list effects. Due to this procedure and some mispronunciations
(i.e. reading [dh] instead of [gh]) as well as repetitions caused by mistakes, the
number of examples which could be used in the acoustic analysis differed
between subjects. Four native speakers of Hindi served as informants:
S.W.A. (female, 35 years), born in Simla (Himachal Pradesh), raised in Simla
and New Delhi; M.A.N. (female, 23 years), born in Piprai (Uttar Pradesh);
R.P.J. (male, 40 years), born in New Delhi, and P.U.N. (male, 22 years),
born in Mirzapur (Uttar Pradesh). The informants thus come from different
dialect areas of Hindi, but it is usually assumed that those dialects belong to
the western Hindi dialect group (see Mehrota 1980). All informants speak
English and Germanfluently.Except for S.W.A. who speaks Punjabi as well,
none of the informants reported a thorough knowledge of any other Indian
language. S.W.A. was taped in the sound-proofed room of our institute on a
Telefunken M15 tape recorder using a Neumann U87 microphone. The
recordings of M.A.N., R.P.J., and P.U.N. were made in the language
laboratory of the Centre of German Studies, School of Languages of the
Jawaharlal Nehru University in New Delhi using a Uher Report 4002 and a
Senheisser microphone. The microphone was placed in front of the inform-
ants at a distance of 50 cm. The material was digitized on a PDP11/50,
filtered at 8 kHz, and segmented into single acoustic portions from which the
duration values were calculated (see for detail Schiefer 1986, 1988, 1989).

12.2.2 Results
Only those results which are relevant for this paper will be presented here;
further details can be found in Schiefer (1986, 1987, 1988, 1989).
Hindi breathy voiced stops can be defined as consisting of three acoustic
portions: the prevoicing during the stop closure ("voicing lead"), the release
of the stop or the burst, and a breathy voiced vowel following the burst,
traditionally called "voiced aspiration." This pattern can be modified under
various conditions: the voicing lead can be missing, and/or the breathy
voiced vowel can be replaced by a voiceless one (or voiceless aspiration).
From this it follows that four different realizations of Hindi breathy voiced
stops occur: the "lead" type, which is the regular one, having voiced during
the stop closure; and the "lag" type, which lacks the voicing lead. Both types
have two different subtypes: the burst can be followed either by a breathy
voiced vowel or by voiceless aspiration, which might be called the "voiceless"
type of breathy voiced stop.
Figure 12.1 displays three typical examples from R.P.J. Figures 12.1a
299
Segment

(a)

r
VV\/V \/\r\fl%/\f\J

(*)

Figure 12.1 Oscillograms of three realizations of breathy voiced stops from R.P.J.: (a) /b h ogna/,
(b) /b*av/, (c) /g"iya/

(/bhogna/) and 12.1b (/bhav/) represent the "lead" type of stop having regular
prevoicing throughout the stop closure, a more (fig. 12.1b) or less
(fig. 12.1a) pronounced burst, and a breathy voiced vowel portion following
the burst. Note that, notwithstanding the difference in the degree of
"aspiration" (to borrow the term preferred by Dixit 1987) between both
examples, the vocal cords remain vibrating throughout. Figure 12. lc (/ghiya/)
gives an example of the "voiceless" type. Here again the vocal cords are
vibrating throughout the stop closure. But unlike the regular "lead" type, the
burst is followed by a period of voiceless instead of voiced aspiration.1
The actual realization of the stop depends on the speaker and on
articulatory facts. The "lag" type of stop is both speaker-dependent and
articulatorily motivated, whereas the "voiceless" type is solely articulatorily
determined. My informants differed extremely with respect to the voicing
1
The results just mentioned caused me in a previous paper (Schiefer 1984) to reject the feature
"glottal stricture," as it is applicable to the regular type of the breathy voiced stop only.

300
12 Lieselotte Schiefer
Table 12.2 Occurrence of the voicing lead in breathy
voiced stops (in percent)

a e i o u X

(a) P.U.N.
Labial 100 93 87 82 71 87
Dental 85 — 93 100 94 93
Retroflex 100 100 80 100 100 96
Velar 100 100 100 88 92 96
X 96 97 90 92 89
(b) S.W.A.
Labial 60 78 78 96 70 76
Dental 85 — 100 100 95 95
Retroflex 91 100 93 100 100 96
Velar 93 100 100 90 90 94
X 82 92 93 96 89

lead. Out of four informants only two had overwhelmingly "lead" realiza-
tions: M.A.N., who produced "lead" stops throughout, and R.P.J., in whose
productions the voicing lead was absent only twice. The data for the other
two informants, P.U.N. and S.W.A., are presented in table 12.2 for the
different stop-vowel sequences, as well as the mean values for the places of
articulation and the vowels. The values are rounded to the nearest integer.
P.U.N. and especially S.W.A. show a severe influence of place of articulation
and vowel on the voicing lead, which is omitted especially in the labial place
of articulation by both of them and when followed by the vowel /a/ in
S.W.A.'s productions. This is interesting, as it is usually believed that
prevoicing is most easily sustained in these two phonetic conditions (Ohala
and Riordan 1980). The "voiceless" type usually occurs in the velar place of
articulation and/or before high vowels, especially before /i/ (see for detail
Schiefer 1984).
The average duration values for the acoustic portions for all informants
are given in figure 12.2 for the different places of articulation. It is interesting
to note that, notwithstanding the articulatorily conditioned differences
within the single acoustic portions, the duration of all portions together (i.e.
voicing lead + burst + breathy voiced vowel) differs only minimally, except
the values for /bh/ in S.W.A.'s productions. This points to a tendency in all
subjects to keep the overall duration of all portions the same or nearly the
same in all environments (see for detail Schiefer 1989). As the words were
read in citation form and not within a sentence frame, it is uncertain whether
these results mirror a general articulatory behavior of the subjects or whether
they are artifacts of the recording procedure.
301
Segment

S.W.A. lab 1 1 1
dent 1 1 1
refr 1 1 1
vel 1 1

MAN. lab
dent
retr D VLD
vel
• burst
R.P.J. lab
dent
retr • BRED
vel
P.U.N. lab
retr
vel
20 40 60 80 100 120 140 160
msec.

Figure 12.2 Durations of voicing lead (VLD), burst, and breathy voiced vowel (BRED)

(a)

(b) (c)

Figure 12.3 (a) In this figure displays the spectrogram of/b h alu/ - the beginning of the breathy
voiced vowel portion is marked by a cursor; (b) gives the oscillogram for the vowel immediately
following the cursor position in (a); and (c) shows the power spectrum, calculated over
approximately 67.5 msec, from cursor position to the right

302
12 Lieselotte Schiefer

(a)

(b) (c).

Figure 12.4 The beginning of the steady vowel is marked by a cursor in (a); (b): oscillogram
from cursor to the right; (c): power spectrum calculated over 67.5 msec, of the clear vowel

12.3 Perception of Hindi stops


12.3.1 Experiment 1
12.3.1.1 Material, method, and subjects
A naturally produced/bhalu/(voicing lead = 79.55 msec, burst = 5.2 msec,
breathy voiced vowel = 94.00 msec) produced by S.W.A. was selected as
point of departure for the manipulation of the test items. This item was used
for several reasons: (a) the item belongs to the "lead" type, having prevoicing
throughout the stop closure (see fig. 12.3a and b) and a breathy voiced vowel
portion following the stop release; (b) there is only minimal degrading of the
voicing lead towards the end of the closure; (c) the prevoicing is of a sufficient
duration; (d) the breathy voiced portion is long enough to allow the
generation of seven different test items; (e) the degree of "aspiration" is
less; and (f) there is a remarkable difference between the amplitude of
the fundamental and that of the second harmonic in the breathy

303
Segment
Table 12.3 Scheme for the reduction of the breathy voiced portion ("PP" =
pitch period; "—" = pitch periods deleted; "+" = remaining pitch periods)

Stimulus 1 original; PP1-PP18


Stimulus 2 - (PP1, PP8, PP15)
Stimulus 3 - (PP1, PP4, PP8, PP11, PP15, PP18)
Stimulus 4 - (PP1, PP3, PP5, PP7, PP9, PP11, PP13, PP15, PP17)
Stimulus 5 - (PP1, PP2, PP4, PP6, PP7, PP8, PP10, PP12, PP13, PP14, PP16, PP18)
Stimulus 6 + (PP6, PP12, PP18)
Stimulus 7 none

voiced vowel portion (see fig. 12.3c) compared with the steady one (see fig.
12.4c), which is one of the most efficient acoustic features in the perception of
breathy voiced phonation (Bickley 1982; Ladefoged and Antonanzas-Baroso
1985; Schiefer 1988).
The method used for manipulation was that of speech-editing. The first
syllable was separated from the rest of the word in order to avoid uncontroll-
able influences from context. The syllable was delimited before the transition
into /I/ was audible, and cut into single pitch periods. The first pitch period,
which showed clear frication, was defined as "burst" and was not subjected
to manipulation. The breathy voiced portion of the vowel was separated
from the clear vowel portion by inspection of the oscillogram combined with
an auditory check. The boundary between both portions is marked in figure
12.4a. Note that the difference between the amplitude of the fundamental
and that of the second harmonic in the clear vowel is small and thus
resembles that of modal voice (fig. 12.4c). The following portion of eighteen
pitch periods was divided into three parts containing six pitch periods each.
A so-called "basic-continuum," consisting of seven stimuli, was generated by
reducing the duration of the breathy voiced vowel portion in steps of three
pitch periods (approximately 15 msec). The pitch periods were chosen from
the three subportions of the breathy voiced vowel by applying the scheme
shown in table 12.3.
From this basic continuum eight tests were derived, where, in addition to
the manipulation of the breathy voiced portion, the duration of the voicing
lead was reduced by approximately 10 msec, (two pitch periods) each. Tests
1-8 thus covered the range of 79.55 to 0 msec, voicing lead.
The continua were tested in an identification task in which every stimulus
was repeated five times and presented in randomized order. All stimuli were
followed by a pause of 3.5 sec. while blocks of ten stimuli were separated by
10 sec. Answer sheets (in ordinary Hindi script) required listeners to choose
whether each stimulus was voiceless unaspirated, voiceless aspirated, voiced,
or breathy voiced (forced choice).

304
12 Lieselotte Schiefer
All tests were carried out in the Telefunken language laboratory of the
Centre of German Studies, School of Languages, of the Jawaharlal Nehru
University in New Delhi and were presented over headphones at a comfor-
table level. The twelve subjects were either students or staff members of the
Centre. They were paid for their participation (see for details Schiefer 1986).

12.3.1.2 Results
The results are plotted separately for the different response categories in
figures 12.5-12.8. The ordinate displays the identification ratios in percent,
the abscissa the duration of burst and breathy voiced vowel portion in
milliseconds for the single stimuli. Figure 12.5 shows that stimuli 1—4 elicit
breathy voiced responses in tests 1-5 (voicing lead (VLD) = 79.55 msec, to
32.9 msec). In test 6 (VLD = 22.5 msec.) only the first three stimuli of the
continuum elicited breathy voiced answers, whereas in tests 7 (VLD = 10.4
msec.) and 8 (no VLD) none of the stimuli was identified unambiguously as
breathy voiced. Thus the shortening of the voicing lead does not affect the
breathy voiced responses until the duration of this portion drops below
20 msec, in duration.
Stimuli 5-7 of tests 1 ^ (VLD = 79.55 msec, to 41.9 msec.) were judged as
voiced (see fig. 12.6). In test 5 (VLD = 32.9 msec), on the other hand, only
stimulus 6 was unambiguously perceived as voiced, the responses to stimulus
7 were at chance level. In tests 6-8 no stimulus was perceived as voiced. This
means that the lower limit for the perception of a voiced stop lies at 32.8
msec voicing lead. If the duration of the voicing lead drops below that value,
voiceless unaspirated responses are given, as shown in figure 12.7.
In comparing the perceptual results for the voiced and breathy voiced
category it is obvious that voiced responses require a longer prevoicing
than breathy voiced ones. The shortening of both portions, voicing lead and
breathy voiced, leads to the perception of a voiceless aspirated stop (fig.
12.8).
The perception of voiceless aspirated stops is the most interesting outcome
of this experiment. One may argue that the perception of stop categories (like
voiced and aspirated) simply depends on the perceptibility of the voicing lead
and the amount of frication noise following the release. If this were true,
voiceless aspirated responses should have been given to stimuli 1-4 in tests
6-8, since in these tests the stimuli with a short breathy voiced portion,
stimuli 5-7, were judged as voiceless unaspirated, which implies that the
voicing lead was not perceptible. But it is obvious that at least in test 6
breathy voiced instead of voiceless aspirated responses were elicited. In all
tests, the voiceless aspirated function reaches its maximum in the center of
the continuum (stimulus 4), not at the beginning.
On the other hand it seems that the perception of a voiceless aspirated stop
305
Segment

100
-o- VLD = 79.55

-• VLD = 58.65

- - - VLD = 50.80

-o- VLD = 41.90

-x- VLD = 32.90

- * - VLD = 22.50

-*• VLD = 10.40

-- VLD = 0

Figure 12.5 Percent breathy responses from experiment 1

100-
-a- VLD = 79.55

-• VLD = 58.65

- • - VLD = 50.80

-<v- VLD = 41.90

-x- VLD = 32.90

- * • VLD = 22.50

-^- VLD = 10.40

-- VLD = 0

Figure 12.6 Percent voiced responses from experiment 1

cannot be explained by the acoustic content of the stimulus itself. In order


to achieve a shortening of the breathy voiced portion the eliminated
pitch periods were taken from different parts within the breathy voiced
vowel portion. This means that the degree of frication, which is highest
immediately after the oral release, degrades as the breathy voiced portion
shortens.
306
12 Lieselotte Schiefer

100-
-•- VLD = 79.55

80- -• VLD = 58.65

--- VLD = 50.80


60
-o- VLD = 41.90

-*- VLD = 32.90


40-
-*• VLD = 22.50

-^ VLD = 10.40
20-
-• VLD = 0

0*
1

Figure 12.7 Percent voiceless unaspirated responses from experiment 1

100
-a- VLD = 79.55

- • VLD = 58.65

--- VLD = 50.80

-o» VLD = 41.90

-~- VLD = 32.90

-*- VLD = 22.50

-*• VLD = 10.40

-- VLD = 0

Figure 12.8 Percent voiceless aspirated responses from experiment 1

12.3.2 Experiment 2
In a second experiment we tried to replicate the results of the first one by
using a rather irregular example of a bilabial breathy voiced stop for
manipulation. The original stimulus (/bhola/) consisted of voicing lead (92.75

307
Segment

msec), a burst (11.1 msec), a period of voiceless aspiration (21.9 msec), and
a breathy voiced portion (119.9 msec) followed by the clear vowel. It should
be mentioned that the degree of aspiration in the breathy voiced portion was
less than in the first experiment. The test stimuli were derived from the
original one by deleting the period of voiceless aspiration for all stimuli and
otherwise following the same procedure as described in experiment 1. Thus, 8
test stimuli were obtained, forming the basic continuum. A series offivetests
was generated from the basic continuum by reducing the duration of the
voicing lead from 92.75 msec (test 1) to 37.4 msec (test 2), and then
eliminating two pitch periods each for the remaining three tests. The same
testing procedure was applied as in experiment 1.
The results for tests 1-3 resemble those of experiment 1. The continuum is
divided into two categories: breathy voiced responses are given to stimuli 1-5,
and voiced ones to stimuli 6-8 (see fig. 12.9). Only stimulus 8 cannot be
unambiguously assigned to the voiced or voiceless unaspirated category in test
3. Tests 4 and 5 produced comparable results to tests 7 and 8 of experiment 1:
there is an increase in voiceless aspirated responses for stimuli 4—6 (see fig.
12.10).
In comparing the results from the two experiments, it is obvious that the
main difference between them concerns stimuli 1-3 in those tests in which the
duration of the voicing lead drops below about 20 msec (tests 7 and 8 in
experiment 1 and tests 4 and 5 in experiment 2). Whereas in experiment 1 these
stimuli are ambiguous, they are clearly identified as breathy voiced in
experiment 2. This result is directly connected with the "acoustic content" of
the stimuli, i.e. the greater amount of "aspiration" in experiment 1 and a lesser
one in experiment 2. On the other hand the experiments are comparable as to
the rise of voiceless aspirated answers in the center of the continuum, which
cannot be explained by the acoustic content, as the degree of aspiration
degrades with the reduction of the breathy voiced portion. This result can be
explained when the duration of the unshortened breathy voiced portion is
taken into account: it appears that it exceeds that of the first experiment by
about 40 msec
12.4 Discussion

12.4.1 Summary of findings


The acoustic analysis of Hindi breathy voiced stops has revealed two main
findings. First, the realization of the stop release depends to a high degree on
articulatory constraints such as velar place of articulation or high vowel.
Second, the realization of this stop category depends on subject-specific and on
articulatory facts. Two out of four subjects produced this stop class with
prevoicing (only two exceptions), whereas in the data of P.U.N. and S.W.A.

308
12 Lieselotte Schiefer

-n- VLD = 92.75

- • VLD = 37.4

-•- VLD = 22.15

-o- VLD = 12.35

-x- VLD = 0

Figure 12.9 Percent breathy responses from experiment 2

100

A
80
-a- VLD = 92.75

60 - • VLD = 37.4

ll
40 U \ --- VLD = 22.15

-o- VLD = 12.35

-x- VLD = 0
20 +

1 4 5
Stimuli

Figure 12.10 Percent voiceless aspirated responses from experiment 2

the prevoicing is missing especially in the labial place of articulation and


before the vowel /a/ in up to 40 percent of the productions. 2 These acoustic
2
Comparable results are reported by Poon and Mateer (1985) for Nepali, another Indo-Aryan
language; seven out often informants lacked the prevoicing. Langmeier et al. (1987) found in
the data of one speaker of Gujarati (from Ahmedabad) that prevoicing was absent from about
50 percent of the productions.

309
Segment

results neither support nor refute M. Ohala's (1979) view that prevoicing is a
relevant feature of Hindi breathy voiced stops.
The results from the perception tests are even more difficult to interpret.
Several outcomes have to be discussed in detail.
There is a clear tendency to divide the continua into two stop classes,
breathy voiced and voiced, if the prevoicing is of sufficient duration. If the
duration drops below 32.9 msec, (experiment 1) and the breathy voiced
portion is reduced to about 40 msec, a voiceless unaspirated instead of a
voiced stop is perceived. On the other hand, a breathy voiced stop is heard
as long as the duration of the voicing lead does not drop below 22.5 msec,
and that of the breathy voiced vowel does not become shorter than about
30-40 msec. Otherwise, a voiceless aspirated stop is perceived. These results
point to some main differences in the perception of voiced and breathy
voiced stops: (a) Hindi stops are perceived as voiced only if they are pro-
duced with voicing lead of a sufficient duration (about 30 msec). There are
no trading relations between the voicing lead and the duration of the
breathy voiced vowel portion, (b) Breathy voiced stops are perceived even if
the duration of the voicing lead approaches the threshold of percepti-
bility, as can be concluded from the perception of either voiced or voice-
less aspirated stops. If the voicing lead is eliminated totally the responses
to the stimuli depend on the duration of the breathy voiced portion: if it
is of moderate duration only (79.55 msec, experiment 1) the first two
stimuli of the continuum cannot be unambiguously identified; if it is long,
as in experiment 2 (131 msec), the stimuli are judged as breathy voiced.
When both the breathy voiced vowel and the voicing lead are short, we
find voiceless aspirated responses. These results provide evidence that
the voicing lead is less important in the perception of breathy voiced
stops than of voiced ones, and that trading relations exist between
the duration of the voicing lead and that of the breathy voiced vowel
portion.
Thus the perception of a breathy voiced stop does not depend solely
either on the perceptibility of the voicing lead or on the duration of the
breathy voiced vowel portion. It is rather subject to the overall duration
of voicing lead + burst + breathy voiced vowel portion. If the duration
of this portion drops below a given value the perceived stop category
changes.
From these puzzling results we can conclude that neither the perception of
breathy voiced stops nor of voiceless aspirated stops can be explained solely
by the acoustic content of the stimuli. Listeners seem to perceive the
underlying glottal gesture of the stimuli, which they clearly separate from a
devoicing gesture: they respond with breathy voiced if the gesture exceeds
about 60 msec, whereas they respond with voiceless aspirated if the duration
310
12 Lieselotte Schiefer

drops below that limit. This hypothesis is directly supported by results from
physiological investigations, where it was shown that the duration of the
glottal gesture for a breathy voiced stop is about double that of a voiceless
aspirated one (Benguerel and Bhatia 1980).
Finally, let us turn to the interpretation of the feature "voice-onset time." It
must be stated that our results support neither Ohala's concept nor that of
Ladefoged and Schiefer, since what is important in the perception of breathy
voiced and voiceless aspirated stops is not the voicing lead by itself but the
trading relation with the breathy voiced portion. On the other hand, from our
results it can be concluded that the Hindi stops form three natural classes, (a)
The voiced and voiceless unaspirated stops form one class, as they are
perceived only if the duration of burst and breathy voiced portion is shorter
than 30-40 msec, i.e. if the burst is immediately followed by a regularly voiced
vowel. This result is comparable with those from voice-onset-time exper-
iments, which showed that the duration of the voicing lag in voiceless
unaspirated stops rarely exceeds 30 msec, (b) Breathy voiced and voiceless
aspirated stops form one class according to the release portion of the stop,
which is either a breathy voiced or voiceless vowel, whose duration has to be
longer than about 44 msec, in breathy voiced stops and has to be 32-65 msec, in
voiceless aspirated stops. Both stops show trading relations between the
voicing lead and the breathy voiced portion, (c) Voiceless unaspirated and
voiceless aspirated stops can be grouped together with regard to their voicing
lead duration which has to be shorter than 32.9 msec, in both stops, (d)
Obviously, voiced and breathy voiced stops do not form a natural class: voiced
stops need a longer voicing lead than breathy voiced ones (32.9 vs. 22.5 msec.)
and voiced stops do not show any trading relations between voicing lead and
the breathy voiced portion, whereas breathy voiced stops do.
All the results of the experiments can be summarized in two main points:
we must distinguish between two acoustic portions, the stop closure and the
stop release; and the perception of stops depends on the duration of the
whole glottal gesture underlying the production of the stop.

12.4.2 Feature representation


Let us now consider the representation of the Hindi stop system within the
framework of distinctive features. Since we consider both stop portions,
closure and release, as important for the perception of the stops we have to
assign one (multivalued) feature to each phase (see table 12.4). We propose
the feature "lead onset time" to account for the differences in the stop closure
and "onset of regular voicing" for the differences in the stop release.
According to the feature "lead onset time" we assign "0" to the voiceless
unaspirated and voiceless aspirated stops, "2" to the voiced and " 1 " to the
311
Segment
Table 12.4 Feature specification proposed for the
description of Hindi stops

p ph b bh
Lead onset time 0 0 2 1
Onset of regular voicing 0 2 0 2

breathy voiced stop. In assigning " 1 " to the breathy voiced stop we take
account of the fact that the voicing lead of breathy voiced stops is shorter
than that of voiced ones (cf. Schiefer 1988) or may even be missing altogether
and that the voicing lead is less important in the perception of breathy voiced
stops than in voiced ones. The way we apply the feature "onset of regular
voicing" to the stops differs from Ladefoged, Ohala, or Schiefer (1984), as we
now group together the voiceless unaspirated and the voiced stops by
assigning them the same value, namely " 0 " . In doing this we account for the
similarity in the perception of both stops: they are perceived if the duration of
the breathy voiced portion is shorter than about 30 msec. In assigning " 2 " of
the feature "onset of regular voicing" to the voiceless aspirated and breathy
voiced stops we take into consideration the similarity of these stops in
the perception experiments. On the other hand, we do not specify the
acoustic nature of this portion. Thus, this portion may be characterized
by either a voiceless or a breathy voiced vowel portion in the case of the
breathy voiced stop, i.e. may represent a regular or a "voiceless" type of the
breathy voiced stops. Therefore, our feature specification is not restricted
to the representation of the regular type of the stop but is applicable to
all stop types mentioned above. The present feature set allows us to
group the stops together in natural classes by (a) assigning the same value
of the feature "lead onset time," " 0 , " to the voiceless unaspirated and
voiceless aspirated tops; (b) assigning the same value, " 0 , " to the voice-
less unaspirated and voiced stops; and (c) assigning " 2 " of the feature
"onset of regular voicing" to the voiceless aspirated and breathy voiced
stops.
In summary: we have used the results of acoustical and perceptual
analysis, as well as the comparison of these results, to set up a feature
specification for the Hindi stops which is based on purely phonetic grounds.
Thus we are able to avoid a mixture of phonetic and phonemic arguments or
to rely on evidence from other languages. In particular, we have shown that
the addition of perceptual results helps us to form natural stop classes which
seem to be psychologically real. It is to be hoped that these results will

312
12 Comments
encourage phonologists not only to rely on articulatory, acoustic, and
physiological facts and experiments but to take into account perceptual
results as well.

Comments on chapter 12
ELISABETH SELKIRK
The Hindi stop series includes voiceless and voiced stops, e.g. /p/, /b/, and
what have been typically described (see e.g. Bloomfield 1933) as the aspirated
counterparts of these, /p h / and /b h /. The transcription of the fourway contrast
in the Indo-Aryan stop system, as well as the descriptive terminology
accompanying it, amounts to a "naive" theory of the phonological features
involved in representing the distinctions, and hence of the natural class into
which the sounds fall. The presence/absence of the diacritic h can be taken to
stand for the presence/absence of a feature of "aspiration," and the b/p
contrast taken to indicate the presence/absence of a feature of "voice." In
other words, according to the naive theory there are two features
which cross-classify and together represent the distinctions among the Hindi
stops, as shown in (1).
(1) Naive theory of the Hindi stop features
IPl /Ph/ /b/ /bh/
"Aspiration" — + — +
"Voice" + +
For clarity's sake I have used " + " to indicate a positive specification for the
features and " — " to indicate the absence of a positive specification, though
the naive theory presumably makes no commitment on the means of
representing the binary contrast for each feature.
The object of Schiefer's study is to provide acoustic and perceptual
evidence bearing on the featural classification of the Hindi stops. Schiefer
interprets "aspiration" as a delay in the onset of normal voicing in vowels,
joining Catford (1977) in this characterization of the feature. "Voice" for
Schiefer corresponds to a glottal gesture (see Browman and Goldstein 1986)
producing voicing which typically precedes the release of the stop, and which
may carry over into the postrelease "aspirated" portion in the case of voiced
aspirated stops. The name given to the feature - "lead onset time"-is
perhaps unfortunate in failing to indicate that what is at issue is the phasing
of the "voice" gesture before, through, and after the release of the stop.
Indeed, the most interesting result of Schiefer's study concerns the phasing of
this voicing gesture. In plain unaspirated voiced stops voicing always

313
Segment
precedes the stop release. In aspirated voiced stops, by contrast, there is a
considerable variability amongst speakers in whether voicing precedes the
release at all, and by how much, and variability too in whether voicing
continues through to the end of the postrelease aspirated portion of the
sound. Schiefer observes a tendency in all subjects to keep constant the
overall duration of the voicing (prevoicing, burst, and postvoicing) in the
voiced aspirates, while the point of onset of the voicing before the stop burst
might vary. This trading relation between prevoicing and postvoicing plays a
role in perception as well: the overall length of the voiced period, indepen-
dent of the precise timing of its realization with respect to the burst, appears
to be the relevant factor in the perception of voiced aspirate stops. Schiefer
proposes, therefore, that prevoicing and postvoicing are part of the same
gesture, in the Browman and Goldstein sense. Thus, while both the voiced
aspirated and plain voiced (unaspirated) stops have in common the presence
of this voicing gesture, named by the feature "lead onset time," they are
distinguished in the details of the timing, or phasing, of the gesture with
respect to the release.
Ladefoged (1971: 13) took issue with the naive theory of Hindi stops:
"when one uses a term such as voiced aspirated, one is using neither the term
voiced nor the term aspirated in the same way as in the descriptions of the
other stops." For Ladefoged "murmured [voiced aspirated] stops represent a
third possible state of the vocal cords." Schiefer's study can be taken as a
refutation of Ladefoged, confirming the naive theory's assumption that
just two features, perfectly cross-classifying, characterize the Hindi stops.
For Schiefer the "different mode of vibration" Ladefoged claims for the
voiced aspirate stops would simply be the consequence of the different
possibilities of phasing of the voicing gesture in the voiced aspirates. The
different phasing results in (a) shorter or nonexistent prevoicing times
(thereby creating a phonetic "contrast" with plain voiced stops), and (b)
the penetration of voicing into the postrelease aspirated period (there-
by creating a phonetic "contrast" with the aspiration of voiceless un-
aspirated stops). These phonetic "contrasts" are a matter of detail in the
phonetic implementation of the voicing gesture, and do not motivate
postulating yet another feature to represent the distinctions among these
sounds.
The chart in (2) gives the Schiefer theory of the featural classification of
Hindi stops:

(2) Schiefer's theory of the Hindi stops


Ivl /p V /W /I
"Onset of regular voicing" 0 2 0 2
"Lead onset time" 0 0 2 1

314
12 Comments
The different values for "lead onset time" given by Schiefer to /b/ and /bh/ -
" 2 " and " 1 , " respectively - are justified by her on the basis of (a) the phasing
difference in the realization of the voicing gesture in the two cases, and (b) the
fact that voicing lead is less important in the perception of voiced aspirates
than it is with plain voiced stops. Schiefer seems to imply that the represen-
tation in (2) is appropriate as a phonological representation of the contrasts
in the Hindi stop series. This is arguably a mistake. I would like to suggest
that (2) should be construed as the representation of the phonetic realization
of the contrasts in the Hindi stop series, and that the chart in (3), a revision of
the naive theory (1) in accordance with Schiefer's characterization of the
features "aspiration" and "voice," provides the appropriate phonological
representation of the contrasts:

(3) An alternative theory of the Hindi stop series


/P/ /Ph/ /b/ /bV
"Onset of regular voicing" — + — +
"Lead onset time" — — + +

My proposal is that the phonological representation (3) is phonetically


implemented as something like (2).
The representation in (2) presupposes that the features involved are n-ary
valued. This n-ariness may make sense in a phonetic representation. In the
phonetic implementation of "lead onset time," for example, it is indeed
necessary to specify the difference in phasing of the feature in the two
different types of voiced stops. This could be represented by means of an n-
ary scale. But there are good reasons why (2) does not hold up as a
representation of phonemic contrasts. Phonological considerations give no
basis for making anything more than a binary distinction in the voicing
dimension: either the voicing gesture is there, or it is not.
Consider first the fact that in Hindi the difference in the phasing of the
voicing gesture is not itself contrastive, i.e. plays no role in the phonology.
Indeed, which value - 1 or 2 - a voiced sound bears for the feature lead onset
time is predictable. It is entirely a function of what the specification of the
sound is for the feature for aspiration. Given the assumption that a
phonological representation contains only feature specifications that are
contrastive, i.e. only ones which allow for phonemic contrasts in the
language, the 1/2 distinction in "lead onset time" cannot be phonological. In
other words, " 1 " and " 2 " are simply allophonic variants of a same
underlying feature specification, which is a positive specification for the
presence of the glottal voicing gesture.
Banning the " 1 " vs. " 2 " specification for the voicing feature from
underlying phonological representation appears to be supported by more

315
Segment

general phonological considerations as well. The two candidates for the


phonological representation of Hindi stops, (2) and (3), presuppose two
different general theories of the features for "voicing" and "aspiration." The
two different theories make radically different predictions about what sorts
of stop systems might exist among the world's languages. The naive theory
(revised) makes the claim that no more than two phonemic distinctions are
ever made on the basis of voice (or lead onset time), namely "voiced"/
"voiceless," and that no more than two phonemic distinctions can be made
on the basis of aspiration (or delay in onset of regular voicing), namely
"aspirated"/"unaspirated." Thus, according to the theory presupposed by
(3), Hindi and the other Indo-Aryan languages exhaust the contrastive
possibilities offered by the two features. The n-ary theory that is presupposed
if (2) is understood as a phonological representation makes no such
restrictive claims. As long as the specifications 0, 1, 2, etc. are taken to
be phonological, it is predicted that languages could display a far greater
range of phonemic distinctions based on "lead onset time" and "delay
in regular voicing" than is seen in Hindi. This prediction seems not to be
borne out. A system like Hindi's is rare enough, and it is unlikely that
languages will be found that exploit a further array of distinctions based
on these phonetic dimensions alone. It is probably not premature to rule
out the n-ary theory of these two features on the grounds that it fails to
make sufficiently restrictive predictions about possible sound systems in
language.
The ability to characterize narrowly the possible systems of contrast across
the world's languages provides one criterion for evaluating alternative
phonological feature theories. Another criterion for a feature theory is that it
be able to capture generalizations about the phonological processes found in
languages. If (3) is the correct representation of the voiced aspirates in Hindi
and other languages, then there are two sorts of predictions made about how
these stops should "behave" in the sound patterning of the language. First, it
is predicted that voiced aspirates and plain voiced sounds should behave as a
natural class with respect to any rule manipulating simply the feature voice,
and it is predicted that voiced and voiceless aspirates should behave as a
natural class with respect to any rule manipulating aspiration. Hindi appears
not to exhibit any such voice- or aspiration-manipulating processes, and so
fails to provide relevant evidence with respect to this prediction (see M.
Ohala 1983). But the closely related Marathi has a word-final deaspiration
rule, and it affects both the voiceless and voiced aspirates, leaving them as
plain voiceless and voiced stops, respectively (Houlihan and Iverson 1979;
Alan Prince, p.c. - both reports are based on fieldwork, not on instrumental
studies). According to the revised naive theory, the fourway phonemic
contrast in Marathi would have the phonological representation (3). The two
316
12 Comments

types of sounds deaspirated are identified by their common feature specifica-


tion [+ (delayed) onset of regular voicing], and that specification is elimi-
nated (or changed to " —") in the operation of the deaspiration rule. The
claim made by the theory in (3) is that such a deaspiration rule will
necessarily treat voiced and voiceless aspirates as a natural class regardless of
the details of the phonetic realization of the aspiration gesture in the
language.
There is a second difference in the predictions about phonological pro-
cesses made by the n-ary and binary theories of voicing in Hindi and other
Indo-Aryan languages. The theory in (3) predicts that if voiced aspirates are
deaspirated in some language, they will always be realized as phonetically
identical to plain voiced sounds, regardless of whether or not in that
language the voicing feature has a different phonetic realization in the two
types of voiced sounds. No such prediction is made if a representation like (2)
is taken as the phonological representation. Indeed, according to this latter
theory, if such a deaspiration were to take place in a language with phonetic
properties similar to Hindi's, it would be predicted the deaspirated /b h/ would
be realized by a [b1], distinct from the [b2] realizing the plain voiceless one /b/.
(The superscripts 1 and 2 correspond to the putative phonological feature
values for "lead onset time" that would be predicted to remain unmodified
by the deaspiration.) Sanskrit has a rule which deaspirates aspirated
segments in reduplicated prefixes. The deaspirated voiced sounds
behave just like underlying plain voiced sounds with respect to further rules
of the phonology, in particular the interword rules of external sandhi. This is
predicted by the naive theory. Indeed, the naive theory predicts there should
always be such an absolute neutralization under deaspiration. The n-ary
theory makes no such prediction. Rather, it predicts that the deaspirated
voiced aspirated could display phonological and phonetic behavior distinct
from that of the plain voiced stop. Cross-linguistic phonological evidence of
the sort outlined above has yet to be systematically accumulated, and so it is
not at this point possible to say for sure whether it is the n-ary or the binary
theory of the features for "voice" and "aspiration" which makes the right
predictions. What is important is that it is clear what sorts of evidence would
be relevant to deciding the case. And the evidence comes from the phonologi-
cal treatment of voiced aspirates in the various languages, not from the
phonetics laboratory.
Phonological feature theory has always embraced the notion that indivi-
dual phonetic features are grounded in phonetic reality, and has looked to
phonetics for confirmation of hypotheses about the nature of particular
phonological features. Schiefer's study forms part of this phonetics-phono-
logy give and take. Her results on the nature of voiced aspirates provide the
phonetic basis for assuming, as the naive phonological theory in (1) has done,
317
Segment

that there are just two features involved in characterizing the fourway
contrast /p, ph, b, bh/ in Hindi and the other Indo-Aryan languages. But
phonological feature theory has not looked to phonetics for an answer to the
question of whether features are monovalent, binary, or n-ary. This is a
question that cannot be answered without examining the workings of
phonological systems, along the lines that have been suggested above. The
phonetic dimensions corresponding to phonological features are typically
gradient, not quantal. It is in phonology that one finds evidence of the
manner in which these dimensions are quantized, and so it is phonology that
must tell us how to assign values to phonological features.

318
Section C
Prosody

319
13
An introduction to intonational phonology

D. ROBERT LADD

13.1 Introduction
The assumption of phonological structure is so deeply embedded in instru-
mental phonetics that it is easy to overlook the ways in which it directs our
investigations. Imagine a study of "acoustic cues to negation" in which it is
concluded, by a comparison of spectrographic analyses of negative and
corresponding affirmative utterances, that the occurrence of nasalized for-
mants shows a statistically significant association with the expression of
negation in many European languages. It is quite conceivable that such data
could be extracted from an instrumental study, but it is most unlikely that
anyone's interpretation of such a study would resemble the summary
statement just given. Nasalized formants are acoustic cues to nasal seg-
ments-such as those that happen to occur in negative words like not,
nothing, never (or non, niente, mai or n'e, n'ikto, n'ikogda, etc.)-rather than
direct signals of meanings like "negation." The relevance of a phonological
level of description - an abstraction that mediates between meaningful units
and acoustic/articulatory parameters - is taken for granted in any useful
interpretation of instrumental findings.
Until recently, the same sort of abstraction has been all but absent from
instrumental work on intonation. Studies directly analogous to the hypothe-
tical example have dominated the experimental literature, and the expression
"intonational phonology" is likely to strike many readers as perverse or
contradictory. In the last fifteen years or so, however, a body of theoretical
work has developed, empirically grounded (in however preliminary a fash-
ion) on instrumental data, that gives a clear meaning to this term. That is the
work reviewed here.1
1
The seminal work in the tradition reviewed here is Pierrehumbert (1980), though it has
important antecedents in Liberman (1975) and especially Bruce (1977). Relevant work since
1980 includes Ladd (1983a), Liberman and Pierrehumbert (1984), Gussenhoven (1984),
Pierrehumbert and Beckman (1988), and Ladd (1990).

321
Prosody

13.2 Linear structure


The most important tenet of the new phonological descriptions is that
fundamental frequency (Fo) is best understood as a sequence of discrete
phonological events, rather than as a continuously varying contour character-
izable by overall shape and direction. Obviously, in some very direct phonetic
sense Fo "is" a continuously varying contour (although even that statement
abstracts away from gaps associated with voicelessness) - but at that level of
abstraction the same is true of, say, the second formant. What is required -
for both Fo and F2 - is a further abstraction, in which a phonological string
(of "segments," "tones," etc.) can be mapped onto a sequence of phonetic
targets. The continuously varying contour emerges when the phonetic targets
are joined up by "low-level" transitions.2

13.2.1 Basic aspects of intonational structure


13.2.1.1 Pitch accents and boundary tones
In most European languages (and many others as well) the most important
of the discrete events that make up the pitch contour are pitch accents. These
are characteristic Fo features that accompany prominent syllables - peaks,
valleys, falls, rises, etc. This use of the term "pitch accent" is due to Bolinger
(1958), though in comparison to Bolinger's original usage the current sense
shows certain differences of emphasis to which I will return in section 13.4.1.
(Bolinger's concept - though not his term - corresponds more closely to the
"prominence lending pitch movements" in the system of 't Hart and his
colleagues, e.g. Cohen and 't Hart [1967], 't Hart and Collier [1975], Collier
and 't Hart [1981].) Figure 13.1 shows an ordinary declarative utterance of
English with two pitch accents.
Besides pitch accents, the other main phonological elements that make up
Fo contours are boundary phenomena of various sorts, at least some of which
are generally known as boundary tones (see 13.2.2.2 below). The clearest cases
are abrupt final rises taking place within the last 300-500 msec, of a phrase or
utterance, generally analyzed as "high boundary tone." There are also
sometimes distinctive effects at initial boundaries, as in the RP pronunciation
of The bathroom?! shown in figure 13.2. This has a high initial boundary tone
(i.e. a distinctively high starting pitch not associated with an accented
syllable), followed by a low or low-rising pitch accent, followed by a final
boundary rise.

This is not to suggest that a target-and-transition model is necessarily the ideal phonetic
model of either Fo or spectral properties, but only that, as a first approximation, it is equally
well suited to both.

322
13 D. Robert Ladd

Figure 13.1 Speech wave and F o contour for the utterance Her mother's a lawyer, spoken with
an unemphatic declarative intonation. The peaks of the two accents (arrows) are aligned near
the end of the stressed syllables mo- and law-

Figure 13.2 Speech wave and F o contour for the sentence The bathroom?! (see text for detail).
The valley of the low(-rising) pitch accent is aligned near the end of the syllable bath-

323
Prosody
13.2.1.2 Association of tunes to texts
The basic phonological analysis of a pitch contour is thus a string of one or
more pitch accents together with relevant boundary tones. Treating this
description as an abstract formula, we can then speak of contour types or
tunes, and understand how the same contour type is applied to utterances
with different numbers of syllables. For example, consider the tune that can
be used in English for a mildly challenging or contradicting echo question, as
in the following exchange.3

(1) A: I hear Sue got a fellowship to study physics.


B: Sue?

On the monosyllabic utterance Sue this contour rises and falls and rises
again. However, we are not dealing with a global rise-fall-rise shape that
applies to whole utterances or to individual syllables, as can be seen when we
apply the same contour to a longer utterance:

(2) A: I hear Sue's taking a course to become a computer programmer.


B: A computer programmer?

The rise-fall-rise shape that spanned the entire (one-syllable) utterance in


Sue? is not simply stretched out over the seven-syllable utterance here; nor is
it applied to the accented syllable -pu- alone. Instead, the contour is seen
to consist of at least two discrete elements, a pitch accent that rises through
the accented syllable and then falls, and a boundary rise that is confined
to the last few hundred msec, of the utterance. The F o on the syllables -ter
program- is simply a transitional stretch between the low level reached at the
end of the pitch accent and the beginning of the final rise. Given an
appropriate utterance, such a transitional stretch could be extended even
further.
As the foregoing example makes clear, one of the key assumptions of
current intonational phonology is that it is possible for syllables - sometimes
several consecutive syllables-to be phonologically unspecified for Fo. The
validity of this assumption is perhaps most clearly demonstrated by Pierre-
humbert and Beckman in their work on Japanese (1988: esp. ch. 4). They
show that the traditional analysis-in which every mora is distinctively
associated with either a H or L tone - makes phonetic predictions that are
falsified by their data. Their empirical findings are modeled much more
successfully if we assume that tones are associated only with certain points in
the segmental string.
3
In order to appreciate the force of examples (1) and (2) it is important to get the intonation
right on B's reply. In particular, one contour that is not intended here is a more or less steadily
rising one, which conveys surprise or merely a request for confirmation.

324
13 D. Robert Ladd

13.2.2 Pitch accents as sequences of tones


13.2.2.1 The two-level theory
The outline of intonational structure just sketched has many aspects that go
back to earlier work, such as the British nuclear tone descriptions (well
summarized in Crystal 1969), the American levels analyses of Pike (1945) and
Trager and Smith (1951), and especially Bolinger's notion of pitch accent
(Bolinger 1958, 1986). The most important innovation of the work under
review here is that pitch accents, and in some cases perhaps boundary tones,
are further analyzed as sequences or combinations of high and low tones. This
approach is based in part on the tonal phonology of many African
languages, in which it is well established that two "level" lexical tones (such
as H and L) can occur on the same syllable and yield a phonetically falling or
rising contour.
As applied to languages like English, the decomposition of pitch accents
into tones has been a point of considerable contention. The dichotomy
drawn by Bolinger (1951) between intonational analyses based on "levels"
and those based on "configurations" has sometimes been invoked (e.g.
't Hart 1981) as a basis for not analyzing pitch accents in this way. However,
as I have argued elsewhere (Ladd 1983b), Bolinger's theoretical objections
really apply only to the American "levels" analyses just mentioned. The
approach under consideration here effectively solves Bolinger's levels-vs.-
configurations issue by reducing the number of distinct levels in languages like
English or Dutch to two, and by defining them in such a way that their
phonetic realization can vary quite considerably from one pitch accent to
another. That is, there is no presumption, as there was in the traditional
levels analyses, that a given phonological abstraction like H will necessarily
correspond to a certain Fo level; the mapping from phonological tones to Fo
targets is taken to be rather complex. The two-level theory, first formulated
explicitly by Pierrehumbert (1980), constitutes the central theoretical inno-
vation on which the current approaches to intonational phonology are
based.
Theoretical issues aside, the laboratory evidence for target levels in Fo is
strong. In perhaps the clearest result of this sort, Bruce (1977) found that the
most reliable acoustic correlate of word accent in Swedish is a peak in the Fo
contour aligned very precisely in time with respect to the accented syllable.
The rise preceding the peak, and/or the fall that follows it, can be suppressed
or reduced under certain circumstances; the important thing for signaling the
accent is, in Bruce's words, "reaching a certain pitch level at a particular
point in time ..., not the movement (rise or fall) itself (1977: 132). In
another experiment, Liberman and Pierrehumbert (1984) had speakers utter
specific contour types with wide variations of overall speaking range. They

325
Prosody

found that the phonetic property of the contour that remained most
invariant under range variation-i.e. the thing that most reliably character-
ized the contour types-was the relationship in Fo level between the two
accent peaks of the contours; other measures (e.g. size of pitch excursion)
were substantially more variable. Finally, it has been shown repeatedly that
the endpoints of utterance-final Fo falls are quite constant for a given speaker
in a given situation (e.g. Maeda 1976; Menn and Boyce 1982; Liberman and
Pierrehumbert 1984; Ladd 1988 for English; Ladd et al 1985 for German;
van den Berg, Gussenhoven, and Rietveld [this volume] for Dutch; Connell
and Ladd 1990 for Yoruba). It has been suggested that this constant
endpoint is (or at least, reflects) some sort of "baseline" or reference value for
the speaker's Fo range.4

13.2.2.2 Some remarks on notation


Pierrehumbert (1980) proposed a notational system for intonation that
expresses the theoretical ideas outlined in the foregoing sections, and her
system has been adopted, with a variety of modifications, by many investi-
gators. The basic points of this system, with a few of the modifications that
have been suggested, are outlined in this section. Discussion of the theoretical
issues underlying the differing versions of the notation is necessarily ex-
tremely condensed in what follows.
Pitch accents contain at least one tone, either H or L, which is associated
with the accented syllable. In addition, they may contain a preceding or
following tone, for example in cases where the pitch accent is characterized
by rapid Fo movement rather than just a peak or a valley. In Pierrehumbert's
original system, the tone associated with the accented syllable is written with
an asterisk (H* or L*), and if there is a preceding or following tone in the
pitch accent it is written with a following raised hyphen (H~ or L~); in a
bitonal pitch accent the two tones are joined with a + (e.g. L* + H). In some
systems based on Pierrehumbert (e.g. the one used by van den Berg,
Gussenhoven, and Rietveld, in this volume), both the plus and the raised
hyphen are dispensed with, and one would write simply L*H. In any case,
it is convenient to distinguish "starred tones" (T*) from "unstarred tones"
(T~ or just T), as the two types may exhibit certain differences in their
phonological and phonetic behavior.
Boundary tones, associated with the edges of intonational phrases, are

4
It has also been suggested (e.g. Pierrehumbert and Beckman, this volume) that the invariance
of contour endpoints has been exaggerated and/or is an artifact of Fo extraction methods, and
that the scaling of contour endpoints can be manipulated to signal discourse organization.
Whether or not this is the case, it does not affect the claim that target levels are linguistically
significant - in fact, if Pierrehumbert and Beckman are right it would in some sense strengthen
the argument.

326
13 D. Robert Ladd
written with the diacritic % (H% or L%). Between the last pitch accent and
the boundary tone of any phrase, in Pierrehumbert's original analysis, there
is another tone she called the "phrase accent." Since this tone does not seem
to be associated with a specific syllable but rather trails the pitch accent by a
certain interval of time, Pierrehumbert considered this to be an unstarred
tone like those that can be part of pitch accents, and therefore wrote it T .
For example, the rising-falling-rising tune illustrated earlier would be
written L* + H~ L~ H%, with a low-rising pitch accent, a low phrase accent,
and a high final boundary tone, as in (3):

(3) L* + H - L H%

a computer programmer

However, the status of phrase accents has remained a matter of some


controversy (for discussion see Ladd 1983a). In what seems to be a promising
approach to resolving these difficulties, Beckman and Pierrehumbert (1986)
have proposed that the "phrase accent" is actually the boundary tone for an
intonational domain smaller than the intonational phrase, a domain they call
the "intermediate phrase." That is, the end of an intermediate phrase is
marked only by what Pierrehumbert called a phrase accent, whereas the end
of an intonational phrase is marked by both a "phrase accent" and a full-
fledged "boundary tone."
Hayes and Lahiri (1991) have adopted this analysis in their description of
Bengali, and use it to motivate a useful notational innovation. For the
boundary tone of an intonational phrase (Pierrehumbert's T%) they write T;
for the boundary tone of an intermediate phrase-which they call "phono-
logical phrase" in line with other work on prosodic structure - they write
T p . Accent tones continue to be written as starred (or not as the case may
be), so that Hayes and Lahiri's notation clearly distinguishes accent
tones from boundary tones. The rising-falling-rising tune just illustrated
would thus, in Hayes and Lahiri's notation, be written something like
L*H Lp H r

13.2.3 Intonation and lexical tone


One important consequence of the point of view just outlined is that it puts
the relationship between "tone" and "intonation" in a different light. A fairly
traditional view is that all languages have "intonation" - global F o shapes
and trends - and in addition some languages have local F o perturbations for
"word accent" or "tone" overlaid on the global intonation (see e.g. Lieber-
man 1967: 101f.). More generally, the traditional view is that F o features have

327
Prosody
extent over some domain (syllable, word, phrase, utterance), and that Fo
contours are built up by superposing smaller-domain features on larger-
domain ones, in a manner reminiscent of Fourier description of complex
waveforms. The intonational models of e.g. O'Shaughnessy and Allen (1983)
or Gronnum (this volume; see also Thorsen 1980a, 1985) are based on this
approach.
This view was convincingly challenged by Bruce (1977). Bruce showed that
in Swedish, the phonetic manifestations of at least certain phrase-level
intonational features are discrete events that can be localized in the Fo contour
(Bruce's sentence accents). That is, the relationship between lexically speci-
fied Fo features (the Swedish word accents) and intonationally specified ones
is not necessarily a matter of superposing local shapes on global ones, but
involves a simple succession of Fo events in time.
In the most restrictive versions of current intonational phonology, it is
explicitly assumed that independently chosen global shapes-e.g. a "declina-
tion component" - are not needed anywhere in the phonological description.
(This is discussed further in 13.3.2 below.) In effect, the restrictive linear view
says that all languages have tonal strings; the main difference between
languages with and without lexical tone is simply a matter of where the tonal
specifications come from. In some languages ("intonation languages") the
elements of the tonal string are chosen, as it were in their own right, to
convey pragmatic meanings, while in others ("tone languages") the phonolo-
gical form of morphemes often or always includes some tonal element, so
that the tonal string in any given utterance is largely a consequence of the
choice of lexical items.
In this view, the only tonal elements that are free to serve pragmatic
functions in a tone language are boundary tones, i.e. additional tonal
specifications added on to the lexically determined tonal string at the edge of
a phrase or utterance. This is how the theory outlined here would account for
the common observation that the final syllable of a phrase or utterance can
have its lexical tone "modified" for intonational reasons (see e.g. Chang
1958). Additionally, some tone languages also seem to modify pitch range,
either globally or locally, to express pragmatic meanings like "interrog-
ation." The possibilities open to lexical tone languages for the tonal
expression of pragmatic meanings are extensively discussed by Lindsey
(1985).
The principal phonetic difference between tone languages and intonation
languages is simply a further consequence of the functional difference. More
"happens" in Fo contours in a tone language, because the tonal specifications
occur nearly every syllable and the transitions span only milliseconds,
whereas in a language like English most of the tonal specifications occur only
on prominent words and the transitions may span several syllables. But
328
13 D. Robert Ladd
the specifications are the same kind of phonological entity regardless of
their function, and transitions are the same kind of phonetic pheno-
menon irrespective of their length. There is no basis for assuming that
lexical tone involves a fundamentally different layer in the analysis of Fo
contours.

13.3 Phonetic models of F o


In order to generate Fo values from an abstract string of tonal events, some
sort of phonetic model is required; in order to formulate a useful phonetic
model, some account must be taken of the central fact that there are
conspicuous individual differences of Fo level and range. This is obviously a
problem for a phonological description based on tones that are mapped on to
Fo targets. Indeed, the ability to abstract away from individual differences of
Fo level and range is one of the principal attractions of any intonational
description based on configurations or contour shapes. A rise is a rise,
whether it moves from 80 Hz to 120 Hz or from 150 Hz to 300 Hz. A "rate"
of declination, say 1.2 semitones/sec, is a quantitative abstraction that could
in principle be applied to any speaker. Nevertheless, we know that languages
exist in which tonal phonology is based on pitch level, not contour shape.
Moreover, as noted earlier, there is growing evidence of regularities statable
in terms of pitch level even in languages without lexical tone. It seems
appropriate, therefore, to devise a phonetic model of Fo that will allow us to
express such regularities and at the same time to abstract away from
individual differences of level and range.

13.3.1 Baseline and tonal space


Intonational phonologies like those outlined in the previous section have by
and large built their phonetic models around the notion of a speaker-specific
tonal space, a band of Fo values relative to which the tonal events are scaled.5
This tonal space is somewhat above the speaker-specific baseline, a
theoretical bottom of the range which in speaking is normally reached - if at
all-only at the end of utterance-final falls. For example, a pitch accent
analyzed phonologically as a sequence of H tone and L tone might be

"Tonal space" is an ad hoc term. In various contexts the same general abstraction has been
referred to as tone-level frame (Clements 1979), grid (Garding and her co-workers, e.g.
Garding 1983), transform space (Pierrehumbert and Beckman [1988], but note that this is not
really intended as a technical term), and register (e.g. Connell and Ladd 1990). The lack of any
accepted term for this concept is indicative of uncertainty about whether it is really a construct
in its own right or simply a consequence of the interaction of various model parameters; only
for Garding does the "grid" clearly have a life of its own. See further section 13.3.2.

329
Prosody

\ i ^ ^ * register shift
\ |
\ i
\ / - ^

\ / ' \
\ / \
\y \ tonal space
| \
' \
\

baseline

Figure 13.3 Idealization of the two-accent Fo contour from figure 13.1, illustrating
one way in which the accents might be modeled using the concept of "tonal space"

modeled phonetically as a fall from the top of the tonal space to the bottom,
ending some distance above the baseline. This is shown in Figure 13.3.
The mathematical details vary rather considerably from one model to
another, but this basic approach can be seen in the models of e.g. Bruce
(1977), Pierrehumbert (1980; Liberman and Pierrehumbert 1984; Pierrehum-
bert and Beckman 1988), Clements (1979, 1990), Ladd (1990; Ladd et al.
1985; Connell and Ladd 1990), and van den Berg, Gussenhoven, and
Rietveld (this volume); to a considerable extent it is also part of the models
used by 't Hart and his colleagues and by Garding and her colleagues (see
note 5).
Mathematical details aside, the biggest point of difference among these
various models lies in the way they deal with differences of what is loosely
called "pitch range." Pretheoretically, it can readily be observed that some
speakers have wide ranges and some have narrow ones; that pitch range at
the beginning of paragraphs is generally wide and then gets narrower; that
pitch range is widened for emphasis or interest and narrowed when the topic
is familiar or the speaker is bored or depressed. But we lack the data to decide
how these phenomena should be expressed in terms of the parameters of
phonetic models like those described here. For example, many instrumental
studies (e.g. Williams and Stevens 1972) have demonstrated that emotional
arousal - anger, surprise, etc.-is often accompanied by higher Fo level and
wider Fo range. But level is generally defined in these studies in terms of mean
Fo (sampling Fo every 10-30 msec, and thus giving equal weight to targets
and transitions); range is usually viewed statistically in terms of the variance
around the Fo mean. We do not know, in terms of a phonetic model like the
ones under consideration here, whether these crude data reductions reflect a
raising of the tonal space relative to the baseline, a widening of the tonal
space, a raising of everything including the baseline, or any of a number of
other logical possibilities. A great deal of empirical work is needed to settle

330
13 D. Robert Ladd
questions like these. In the meantime, different models have taken rather
different approaches to these questions.

133.2 Downtrends
Perhaps the best illustration of such differences is the treatment of overall Fo
downtrends ("declination" and the like). Since the work of Pierrehumbert
(1980), it is widely accepted that at least some of what has earlier been treated
as global declination is in fact downstep - a stepwise lowering of high Fo
targets at well defined points in the utterance. However, this leaves open a
large number of fairly specific questions. Does downstep lower high targets
by narrowing the tonal space or by lowering it? Is it related to the "resetting
of pitch range" that often follows prosodic boundaries (see the papers by
Kubozono, and van den Berg, Gussenhoven, and Rietveld, this volume)? Are
there residual downtrends - true declination - that downstep cannot explain?
If there are, do we model such declination as a gradual lowering of the
baseline, as a gradual lowering of the tonal space relative to the baseline, or
in some other way? Appropriately designed studies have made some progress
towards answering these questions; for example, the coexistence of downstep
and declination is rather nicely demonstrated in Pierrehumbert and Beck-
man's work on Japanese (1988: ch. 3). But we are a long way from
understanding all the interactions involved here.
A further theoretical issue regarding downtrends - and in some sense a
more basic one - is whether the shape and direction of the tonal space can be
chosen as an independent variable, or whether it is conditioned by other
features of phonological structure and the tonal string. In the models
proposed by Garding and her co-workers (e.g. Garding 1983; Touati 1987)
and by 't Hart and his colleagues (e.g. Cohen and 't Hart 1967; 't Hart and
Collier 1975; Cohen, Collier, and 't Hart 1982), although there is a clear
notion of describing the contour as a linear string of elements, the tonal space
can also be modified globally in a way that affects the scaling of the elements
in the string. This recalls the global intonation component in traditional
models of tone and intonation (see 13.2.3 above). In models more directly
inspired by Pierrehumbert's work, on the other hand, the tonal space has no
real life of its own; any changes to the tonal space are either due to
paralinguistic modification of range, or are - like downstep - triggered by
phonological choices in the tonal string and/or in prosodic structure.

13.4 Prosodic structure


No discussion of current work on intonational phonology would be complete
without some consideration of the relevance of metrical phonology. The
331
Prosody

cornerstone of this theoretical development was Liberman's work on English


stress (Liberman 1975; Liberman and Prince 1977), which argued that stress
or prominence crucially involves a relation between two nodes in a binary
tree structure, e.g.:

(4)

(w = weak, s = strong)

A great deal of theoretical work on relational and hierarchical structures in


phonology has followed Liberman's lead, and it is well beyond the scope of
this review to trace all these developments. (The introduction to autosegmen-
tal and metrical phonology in van der Hulst and Smith [1982] remains an
excellent introduction to the rapid early developments of these theories.)
Here I wish to concentrate on the relevance of metrical phonology to
laboratory work on two fairly specific phenomena: pitch accents and higher
level phrasing.

13.4.1 The role of pitch accents in prominence


Early experimental work on the acoustic cues to perceived stress in isolated
words and short utterances (notably the classic experiments by Fry 1955,
1958) pointed to an important role for pitch movement or pitch change on
the affected syllable. This in turn cast doubt on the traditional distinction
between stress accent (or "dynamic accent") and pitch accent (or "melodic
accent"), and led Bolinger to redefine the term pitch accent to mean a local
pitch configuration that is simultaneously a cue to prominence and a building
block of the intonation contour. Bolinger's concept survives more or less
intact in the model of intonation put forth by 't Hart and his colleagues,
whose term "prominence-lending" pitch movement clearly implies that the
pitch movement is the sine qua non of prominence in utterances. Bolinger's
terminology, on the other hand, has been taken over for a modified concept
in the work of Pierrehumbert and many of the other authors under review
here, in the sense of a building block of the intonation contour that is merely
anchored to or associated with a prominent syllable. This association, and
indeed the prominence itself, are frequently described in terms of metrical
structure. That is, prominence is more abstract than pitch movement, and
can be cued in a variety of other ways.
A good deal of experimental evidence points to this conclusion. First, it
332
13 D. Robert Ladd

has been shown by both Huss (1978) and Nakatani and Schaffer (1978) that
certain putative differences of stress can be directly reflected in syllable
duration without any difference of pitch contour. Second, Beckman (1986)
has demonstrated clear perceptual differences between Japanese and English
with respect to accent cues: in Japanese accent really does seem to be signaled
only by pitch change, whereas in English there are effects of duration,
intensity, and vowel quality that play an important perceptual role even
when pitch change is present.6 Third, there are clear cases where pitch cues
are dependent on or orchestrated by a prominent syllable without being part
of the syllable itself. In Welsh declarative intonation, for example, a marked
pitch excursion up and down occurs on the unstressed syllable following the
major stressed syllable; the stressed syllable may be (but need not be)
distinguished durationally by the length of the consonant that closes it
(Williams 1985). This makes it clear that the pitch movement is an intona-
tional element whose distribution is governed by the (independent) occur-
rence of prominence, and suggests that pitch movement cues prominence in a
rather less direct way than that imagined by Fry and other earlier
investigators.

13.4.2 Higher-level phrasing


It is well established that differences of hierarchical structure or boundary
strength (Cooper and Paccia-Cooper 1980) can be reflected intonationally in
the height of F o targets in the vicinity of boundaries. For example, Ladd
(1988) showed that, in structures of the sort A and B but C and A but B and C
(where A, B, and C are clauses with three accented syllables each), the accents
at the beginning of the B and C clauses are slightly higher when they follow a
Z?w/-boundary than when they follow an #m/-boundary. Ladd interpreted this
as reflecting a difference of hierarchical organization, as in (5):

(5)

A and B but C A but B and C

Data on the phonetics of prominence in Tamil, reported by Soundararaj (1986), suggest that
Tamil is like Japanese in using only pitch to cue the location of the accented syllable;
Soundararaj found no evidence of differences of duration, intensity, or vowel quality (see also
the statements about Bengali in Hayes and Lahiri 1991). Interestingly, however, the function
of pitch in Tamil (or Bengali) is rather more like the function of pitch in English or Dutch, i.e.
purely intonational, not lexically specified as in Japanese. That is, phonetically speaking,
Tamil is (in Beckman's sense) a "pitch-accent" language like Japanese, but functionally
speaking it uses pitch like the "stress-accent" languages of Europe. If this is true, it provides
further evidence for excluding function from consideration in modeling F o in different
languages (see 13.2.3 above).

333
Prosody

Related findings are reported by e.g. Cooper and Paccia-Cooper (1980) and
by Thorsen (1985, 1986).
The principal issue that these findings raise for intonational phonology
involves the nature of the hierarchical structures and their relationship to
phonetic models of Fo. Are we dealing directly with syntactic constituent
structure, as assumed for example by Cooper and Paccia-Cooper? Are we
dealing with some sort of discourse organization, in which boundary strength
is a measure of the "newness" of a discourse topic? (This has been suggested
by Beckman and Pierrehumbert [1986], following Hirschberg and Pierrehum-
bert [1986].) Or are we dealing with explicitly phonological or prosodic
constituent structure, of the sort that has been discussed within metrical
phonology by, e.g., Nespor and Vogel (1986)? This latter possibility has been
suggested by Ladd, who sees "relative height" relations like

(6)
/ \
(h = high; 1 = low)

as entirely analogous to the relative prominence relations that are basic to


metrical phonology. For Ladd, in other words (as for van den Berg,
Gussenhoven, and Rietveld, this volume), boundary-strength phenomena
are thus intimately related to downstep, which can be seen as the reflection of
the same kind of relational structure. For Beckman and Pierrehumbert, on
the other hand, boundary-strength phenomena are similar to paralinguistic
phenomena in which the overall range is raised or lowered to signal interest,
arousal, etc.
These theoretical issues cannot yet be adequately resolved empirically
because they are intertwined with issues of phonetic modeling. That is, in
terms of the phonetic-realization models discussed in section 13.3 above,
boundary-related differences of Fo scaling are all pretheoretically a matter of
"pitch range" and all could be expressed in a number of ways: as local
expansion of the tonal space, as local raising of the tonal space, as local
raising of the baseline, as greater prominence of individual accents, etc. One's
understanding of which phenomena are related to which others cannot at
present be separated from one's choice of how to model the phonetic detail.

334
14
Downstep in Dutch: implications for a model

ROB VAN DEN BERG, CARLOS GUSSENHOVEN,


and TONI RIETVELD

14.0 Introduction
In this paper we attempt to identify the main parameters which must be
included in a realistic implementation model for the intonation of Dutch.*
The inspiration for our research was provided by Liberman and Pierrehum-
bert (1984) and Ladd (1987a). A central concern in those publications is the
phonological representation and the phonetic implementation of descending
intonation contours. The emphasis in our research so far has likewise been
on these issues. More specifically, we addressed the issue of how the
interruption of downstep, henceforth referred to as reset (Maeda 1974;
Cooper and Sorensen 1981: 101) should be represented. Reset has been
viewed as (a) an upward register shift relative to the register of the preceding
accent, and (b) as a local boost which interrupts an otherwise regular
downward trend. We will argue that in Dutch, reset should be modeled as a
register shift, but not as an upward shift relative to the preceding accent, but
as a downward one relative to a preceding phrase. Accordingly, we propose
that a distinction should be made between accentual downstep (which applies
to H* relative to a preceding H* inside a phrase), and phrasal downstep,
which reduces the range of a phrase relative to a preceding phrase, and
creates the effect of reset, because the first accent of the downstepped phrase
will be higher than the last downstepped accent of the preceding phrase.
We will present the results of two fitting experiments, one dealing with
accentual downstep, the other with phrasal downstep. Both address the
question whether the two downstep factors (accentual and phrasal) are
independent of speaker (cf. men vs. women) and prominence (excursion size
of the contour). In addition, the first experiment addressed the issue whether

This paper has benefited from the comments made by an anonymous reviewer.

335
Prosody
the accentual downstep factor depends on the number of downstepped
accents in the phrase.
Before reporting on these data, we give a partial characterization of the
intonation of Dutch in section 14.1. This will enable us to place the
phenomenon of (accentual and phrasal) downstep in a larger phonological
context, and will also provide us with a more complete picture of the
parameters to be included in an implementation model. This model, given in
section 14.2, will be seen to amount to a reorganization of a model proposed
in Ladd (1987b). Section 14.3.1 reports on the experiment on accentual
downstep and discusses the data fitting procedure in detail. The experiment
on phrasal downstep is presented in section 14.3.2.

14.1 Intonation in Dutch


14.1.1 Tonal structure
Like English and German, Dutch is an intonation language without lexical
tone. Focus distribution determines the locations of accents in the utterance,
each accent being an insertion slot for one of a number of possible pitch
accents. Two common contours are given in figure 14.1, on the utterance
Leeuwarden wil meer mannen ("Leeuwarden needs more men"). We assume
that the tone segments which these contours consist of are H*L L% in
(14.1a) and L*H H% in (14.1b). As shown in the figures, the first (starred)
tone segment associates with the accented first syllable, the second tone
segment spreads, while the last goes to the end of the utterance. The timing of
H% can be observed in (14.1b) on the final, stressless syllable of mannen,
while the preceding plateau evinces the spreading of the preceding H.1
In nonfinal position, these contours appear as illustrated in figure 14.2,
where they occur before a H*L L% contour on mannen. (The sentence is
otherwise identical to that given in figure 14.1.) In this position, the contours
can be analyzed as H*L (14.2a) and L*H (14.2b), both of which are
separated from the following pitch accent by an appreciable prosodic
boundary, between Leeuwarden and wil. Observe that after this boundary the
pitch begins low, which leads to a pitch change at the boundary in figure
14.2b. In figure 14.2a, the pitch is low on either side of the boundary, because
of the L to its left.
In figure 14.3, the two pitch accents appear in a different guise again. In
these realizations, there is no discontinuity after Leeuwarden of the kind we
find in figure 14.2. Looking at the (a) examples across figures 14.1-14.3, we
would appear to be dealing with what at some level of analysis must be
1
We assume that the presence of L% can be motivated as a trigger for "final lowering" (cf.
Poser 1984; Liberman and Pierrehumbert 1984).

336
14 R. van den Berg, C. Gussenhoven, and T. Rietveld

0.6
Time (sec.)
(b) 500
400

0.0 0.6
Time (sec.)

Figure 14.1 Contours (H*L L%) AD and (L*H H%) AD on the sentence Lee*uwarden wil meer
mannen

considered the same unit, while the same goes for the (b) examples. Our claim
is that the explanation for the difference between the contours in figure 14.2
and those in figure 14.3 is to be found in a difference in phrasing, rather than
in a different choice of pitch accent. In a slow, explicit type of pronunciation,
as illustrated by the contours in figure 14.2, the tone segments of a pitch
accent are confined to the highest constituent that dominates it, without
dominating another pitch accent. This constituent, which we refer to as the
association domain (AD), is Leeuwarden in the case of the first pitch accent in
the contours in figure 14.2, because the next higher node also dominates the
following H*L. The AD for this H*L, obviously, is wil meer mannen. Unless
the first syllable of the AD is the accented syllable, there will be an
unaccented stretch in the AD (none in Leeuwarden, and wil meer in wil meer
mannen), which we refer to as the onset of the AD. By default, onsets are low-
pitched in Dutch (but they may also be high-pitched, to give greater
expressiveness to the contour). Turning now to the contours in figure 14.3,
we observe that the AD-boundary after Leeuwarden is lacking, which we
337
Prosody
(a) 500
400

l e - w a r d a w l l m e - r m a n

w a r d a w I I m e* r m an
50
0.6
Time (sec.)

Figure 14.2 Countours (H*L)AD, (H*L L%) AD and (L*H)AD (H*L L%) AD on the sentence
Lee*uwarden wil meer ma*nnen

suggest is the result of the restructuring of the two ADs to a single AD'. One
obvious consequence is that wil meer is no longer an onset. More specifically,
the consequence for the first pitch accent is that the spreading rule applying
to the second tone segment no longer has a right-hand boundary it can refer
to. What happens in such cases is that this segment associates with a point in
time just before the following accented syllable. The pitch of the interaccen-
tual stretch -warden wil meer is an interpolation between the shifted tone
segment and the tone segment to its left.2 The association domain is thus an
intonational domain, constructed on the basis of the independently existing
(prosodic) structure of the utterance, not as a constituent of the prosodic tree
itself. Restructuring of ADs is more likely as the prosodic boundary that
separates them is lower in rank (see Gussenhoven, forthcoming).
2
The analysis follows that given for English in Gussenhoven (1983). Note that the introduction
of a pitch accent L + H* for the second accent in the contour in figure 14.3a, as in
Pierrehumbert (1980), would unnecessarily add a term to the inventory of pitch accents.
Moreover, it would force one to give up the generalization that contours like those in figure
14.2 are more carefully pronounced versions of those in figure 14.3.

338
14 R. van den Berg, C. Gussenhoven, and T. Rietveld

a r d a w I I m e - r m an

(b) 500
400:
300
*
200 H H

^ ^ ^ \
\
100 \ ^ _ -~v/ \ L L%
— ^
I e• w a r d a w I I m e• r m a n a
50
0.0 0.3 0.6 0.9 1.2
Time (sec.)

Figure 14.3 Contours (H*L H*L L%) AD , and (L*H H*L L%) AD , on the sentence Lee*uwarden
wil meer ma*nnen

To relate the contours in figure 14.2 to those in figure 14.1, we assume that
the lexical representations of the two pitch accents are H*L and L*H.
Boundary Tone Assignment (see (1) below) provides the bitonal H*L and
L*H with a boundary tone which is a copy of their last tone segment. We
have no account for when this boundary tone segment appears, and
tentatively assume it is inserted when the AD ends at an intonational phrase
boundary.3
(1) Boundary Tone Assignment'. 0 —• aH / aH •)„

Rule (1) should be seen as a default rule that can be preempted by other tonal processes of
Dutch. These include the modifications which H*L and L*H can undergo, and the stylistic
rule NARRATION, by which the starred tone in any L*H and H*L can spread, causing the
second tone segment to be a boundary segment. In addition, there is a third contour H*L H%.
These have been described in Gussenhoven (1988), which is a critique and reanalysis of the
well-known description of the intonation of Dutch by Cohen and 't Hart (1967), 't Hart and
Collier (1975), Collier and 't Hart (1981).

339
Prosody

A
Leewarden wil meer mannen

A A
j Leewarden wil meer mannen

H LL%

J ^ \ A
Leewarden wil meer mannen
\

H L H LL%

Figure 14.4 Tonal structure of the contours in figures 14.1a, 14.2a, and 14.3a

In summary, we find:
End of IP H*L L% L*H H%
Second tone segment spreads
End of AD, but inside IP H*L L*H
Second tone segment spreads
Inside AD' H*L L*H
Interpolation between T* and following
tone segment spans stretch between ac-
cents
Figure 14.4 gives schematic representations of the tonal structures of the
contours in figures 14.1a, 14.2a, and 14.3a.

14.1.2 Downstepping patterns


Dutch has a number of related contours, which display the downward
stepping of consecutive H*s called "downstep" in Pierrehumbert (1980).

340
14 R. van den Berg, C. Gussenhoven, and T. Rietveld

Two of the forms that downstepped contours may take are illustrated in
figure 14.5. Both are instances of the sequence of placenames Haa*rlem,
Rommelda*m, Harderwij*k en Den He*lder. In figure 14.5a, the H*s after a
H*L have been lowered (a lowered H* is symbolized !H*), and the H*s
before a H*L have spread. In the contour in figure 14.5b, the spreading of the
first and second H*s stopped short at the syllable before the accent. In fact,
the spreading of these H*s may be further restrained. Maximal spreading of
H* will cause the L to be squeezed between it and the next !H*, such that it is
no longer clear whether it has a phonetic realization (i.e. whether it forms
part of the slope down to !H*) or is in fact deleted. In our phonological
representations, we will include this L. It is to be noted that the inclusion of
dips before nonfinal !H* in artificially produced downstepped contours
markedly improves their naturalness, which would seem to suggest that the
segment is not in fact deleted. Lastly, a final !H* merges with the preceding L
to produce what is in effect a L* target for the final accented syllable. We will
not include this detail in (2) below.
The rule for downstep, then, contains two parts: the obligatory downstep-
ping of noninitial H*s, and the variable spreading of nonfinal H*s. The
domain for downstep is the AD': observe that the L does not respect the
boundary between Haarlem and Rommeldam, Rommeldam and Harderwijk,
etc., as shown in the contours in figure 14.5.
(2) Downstep a. H* -* !H* /T*T (obligatory)
b. Spread H* / LH*L (variable) Domain: AD'
Following widely attested tonally triggered downstep phenomena in lan-
guages with lexical tone, Pierrehumbert (1980) and Beckman and Pierrehum-
bert (1986) propose that in English, too, downstep is tonally triggered. The
reason why we cannot make the same assumption for Dutch is that contours
like those in figure 14.3a exist. Here, no downstep has applied; yet we seem to
have a sequence of H*L pitch accents, between which, moreover, no AD
boundary occurs. Since the tonal specification as well as the phrasing
corresponds with that of the contours in figure 14.5, we must assume that
instead of being tonally triggered, downstep is morphologically triggered.
That is, the rule in (2) is the phonological instantiation of the morpheme
[downstep], which can be affixed to AD'-domains, as an annotation on the
node of the constituent within which downstep takes place.
The characterization of downstep as a morpheme has the advantage that
all downstepped contours can be characterized as a natural class. That is,
regardless of the extent to which H* is allowed to spread, and regardless of
whether a modification like DELAY is applied to H*Ls (see Gussenhoven
1988), downstepped contours are all characterized as having undergone rule
(2). This advantage over an analysis in which downstep is viewed as a

341
Prosody

h a - r l r m r o m a l d a mh a r d a w f I k f d

0.5 1.0 1.5 2.0


Time (sec.)

h a * r | r m r o m a l d a m h a r d a w r l k <~n d e n h e I d a r
50
1.0 2.0
Time (sec.)

Figure 14.5 Downstepped contours H*L !H*L !H*L !H*L on the sentence Haa*rlem,
Rommelda*m, Harderwij*k en Den He*lder

phonetic implementation rule, triggered by particular configurations of tone


segments, was claimed by Ladd (1983a) for his proposal to describe
downstep with the help of a phonological feature [±DOWNSTEP]. The
explanatory power of a feature analysis appears to be low, as recognized in
Ladd (1990). Beckman and Pierrehumbert (1986) point out that the feature
analysis leaves unexplained why the distribution of [ 4- downstep] is restricted
to noninitial H (i.e. why not !L, or why not !H initially?). Secondly, the
nonlocal effect of downstep on the scaling of all following tone segments
inside the AD' would need a separate explanation in a feature analysis. The
postulation of a domain-based rule like (2) accounts for these properties in
the same way as does a rule of downstep which is conceived of as a phonetic
implementation rule. Moreover, it avoids what we consider to be two
drawbacks of the approach taken by Beckman and Pierrehumbert (1986).
First, the domain-dependency of downstep, the fact that it "chains" within
the AD, needs a separate explanation. Second, the phonetic-implementation
analysis requires that all mid tones are grouped as a natural class. The second

342
14 R. van den Berg, C. Gussenhoven, and T. Rietveld

mid pitch of the vocative chant, for example, a contour with a very different
meaning from the downstepped contours illustrated in figure 14.5, is
obtained by the same implementation rule that produces downstepped
accents in Pierrehumbert (1980), Pierrehumbert and Beckman (1986). A
disadvantage of our solution is that it is not immediately clear what a
nondiacritic representation of our morpheme would look like. We will not
pursue this question here.

14.1.3 Reset
As in other languages, non-final AD's in which downstepping occurs, may be
followed by a new AD' which begins with an accent peak which is higher
than the last downstepped H* of the preceding AD'. In theory, there are a
number of ways in which this phenomenon, henceforth referred to as reset,
could be effected. First, reset could be local or global: that is, it could consist
of a raising of the pitch of the accent after the boundary, or it could be a
raising of the register for the entire following phrase, such that the pitch of all
following tone segments would be affected.
Reset as local boost. The idea of a local boost is tentatively entertained by
Clements (1990). In this option, the downward count goes on from left to
right disregarding boundaries, while an Fo boost raises the first part or the
first H* in the new phrase, without affecting following H*s. Although local
boosting may well be possible in certain intonational contexts, the contour in
figure 14.6 shows that reset involves a register shift. It shows a sequence of
two AD's, the first with four proper names and the second with five, all of
them having H*L. The fifth proper name, Nelie, which starts the new AD', is
higher than the fourth, Remy. The second and third accent peaks of the
second AD', however, are also higher in pitch than the final accent of the
preceding AD'. Clearly, the scaling of more than just the first H* of the new
AD' has been affected by the Reset.
Reset as register shift using accentual downstep factor. If reset does indeed
involve a register shift, what is the mechanism that effects it? This issue is
closely related to that of the size of the shift. Adapting a proposal for the
description of tonally triggered downstep in African languages made by
Clements (1981), Ladd (1990) proposes that pitch accents are the terminal
nodes of a binary tree, whose branches are labeled [h-1] or [1-h], and whose
constituency is derived from syntax in a way parallel to the syntax-
phonology mapping in prosodic/metrical phonology. Every terminal node is
a potential location for a register shift. Whether the register is shifted or not,
and if so, whether it is shifted upward or downward, is determined by the
Relative Height Projection Rule, reproduced in (3).

343
Prosody

Figure 14.6 Contour (H*L !H*L !H*L !H*L) !(H*L !H*L !H*L !H*L !H*L) on the sentence
(Merel, Nora, Leo, Remy), en (Nelie, Mary, Leendert, Mona en Lorna)

(3) Relative Height Projection Rule: In any metrical tree or constituent, the
highest terminal element (HTE) of the subconstituent dominated by / is one
register step lower than the HTE of the subconstituent dominated by h, iff
the / is on a right branch.

(4) (a)

h 1 h 1 h h 1 h 1 h h h 1 h 1
0 1 1 2 0 1 2 1 2 0 1 2 3 1 2

If we take (3) as an exhaustive description, the convention predicts that in fi-


ll] labeled structures there is neither downstep nor reset. In consistently left-
branching [h-1] labeled trees, there is downstep for the second accent only,
and no reset. In consistently right-branching [h-1] labeled trees, there is a
chain of downstepped accents, which can be followed by a reset for a right-
hand sister terminal node. As shown in (4), where the numbers below the
terminal nodes indicate the number of register steps required by (3), there is
no reset after a two-accent structure (see (4a)), but after three or more accents

344
14 R. van den Berg, C. Gussenhoven, and T. Rietveld

there is reset, whose size depends on the number of accents that precede in
the left-hand constituent (see (4b, c)).
We doubt whether [h-1] labeled trees provide the appropriate represen-
tation for handling downstep. For one thing, within the AD', H*Ls are
downstepped regardless of the configuration of the constituents in which
they appear (see also Pierrehumbert and Beckman 1988: 168). What we are
concerned with at this point, however, is the claim implicit in Ladd's
proposal that the size of the reset is the same as, or a multiple of, the size of
downstep inside an AD'. The contour in figure 14.7a suggests they are not
equal in size. In this contour, the third H*, the one beginning the second AD,
does not go up from the second H* by the same distance that the second H*
was lowered from the first. Neither is the third H* scaled at the same pitch as
the second. It in fact falls somewhere between the first and the second, a
situation which does not obviously follow from the algorithm in (3).

Reset as register shift using factor independent of accentual downstep. If we


assume that reset involves a register shift whose step size is not the same as
the downstep factor, the question arises whether reset involves a register shift
upwards with reference to the preceding accent, or a register shift downward
with reference to the preceding phrase. In the first option, a peak scaling
algorithm would apply a raising factor to the Fo of the last H* of a phrase so
as to scale the first H* of a following phrase. This new value would then be
used for the calculation of the value for the second, downstepped, H*, and so
on. This option resembles the boost option discussed above, except that the
effect of the boost will make itself felt on the scaling of all subsequent tone
segments. Kubozono (1988c) describes the interruption of downstep in
Japanese in this way. In the second option, a register reset is obtained by
means of a lowering with reference to the preceding phrase. Notice that the
computation of the reset could still be argued to take place on the basis of the
pitch value of an immediately preceding target, in the sense that the scaling of
the new phrase takes place with reference to the preceding phrase.
These two theories of downstep-independent reset make different predic-
tions about the size of the second and following resets in an utterance. If we
assume that the reset factor is independent of the number of times it has been
applied in an utterance, then, in an accent-relative model, we should expect
multiple resets in an utterance to be a constant fraction of the preceding H*
(the last accent in the preceding phrase). In view of the fact that AD'-final
!H*s gravitate towards the same low target, the prediction is that chains of
resets should not very much decrease in size. By contrast, a phrase-relative
model would lead one to expect an unambiguously downstepping series of
resets, since every new reset is calculated on the basis of the lowered value of
the preceding one, just as AD'-internal downstepped accents form a descend-
345
Prosody

1.2 1.8 2.4 3.0


Time (sec.)

Figure 14.7 Scaling domain of three ADs: (a) with phrasal downstep on the sentence
(Merel, Nora, Leo), (Remy, Nelie, Mary), (en Mona en Lorna); (b) without phrasal
downstep on the sentence de mooiste kleren, de duurste schoenen, de beste school...
("the finest clothes, the fanciest shoes, the best school . . . " )

ing series. Thorsen's (1984a, b) discussion of "textual downdrift" in Danish


suggests that a phrase-relative model would be the better choice for those
data, and we hypothesize that the same is true for Dutch. Figure 14.7a
illustrates a contour containing three ADs, of which the first two have H*L
!H*L L*H (all accented items are proper names). Phrasal downstep has
occurred twice, once on the second AD' and once on the third: observe the
first H*s of the three AD's form a downstepping pattern. This view of reset as
a wheels-within-wheels model of downstep would allow the sizes of the two
downstep factors to be different. Conceivably, it could also account for
Thorsen's (1980b) finding that the size of accentual downstep (or equiva-
lently in her description, the slope of the declination; see also Ladd [1984]) is
smaller for medial phrases than for initial or final ones. Medial phrases have
a reduced register relative to preceding phrases, and the absolute size of the
accentual downstep will therefore be smaller. And the utterance-final phrase
will have a relatively steeper declination slope because offinallowering at the
level of the utterance.
Just as in the case of accentual downstep, we assume that phrasal

346
14 R. van den Berg, C. Gussenhoven, and T. Rietveld

downstep is morphologically triggered. That is, sequences of ADs do not


have to be downstepped. Figure 14.7b is an example of a series of nondown-
stepped ADs. The phonological coherence of the three ADs in contour 14.7b
derives from the fact that all three AD's begin at the same pitch level, even
though within each AD' accentual downstep has taken place. (Phonetically,
because there are only two accents in the AD, this manifests itself as a timing
of the final fall before the peak of the accented syllable.) Crucially, the scaling
of the first H* determines that of the first H* of each following AD' in
contour 14.7b just as much as it does in contour 14.7a. That is, we are not
dealing with a sequence of "unconnected" phrases in the latter, as opposed to
a "connected" series in the former. We will call the constituent over which
the speaker's pitch range choice remains in force the scaling domain. Phrasal
downstep, then, is a morpheme that optionally attaches to the scaling
domain: if it is there, a contour like 14.7a results, if it is not, one like 14.7b.

14.1.4 Some issues concerning L* scaling


There are two senses in which Downstep could be a register shift. In one
interpretation, Downstep affects the H* and the L of a series of H*L's inside
an AD', but would leave the scaling of any following L*H pitch accent
unaffected. In the other interpretation Downstep causes both H* and L*
targets to be scaled down (Ladd 1987, 1990). In our model, accentual
downstep and phrasal downstep are mathematically independent, and it
could therefore incorporate the effect of either or both types of downstep on
L* scaling. So far, we have not addressed the issue of L* scaling, but for the
time being we take the position that the scaling of L* is only affected by
phrasal downstep. In other words, the L* target does not depend on the
number of (accentually) downstepped H*s preceding it within the same AD7,
but L*s will be lower as more downstepped AD's precede it.
A further issue bearing on the scaling of L* targets is the effect of
prominence. A contour can retain its identity while being realized with
different degrees of prominence. More prominent accent peaks have higher
peak values, ceteris paribus, than less prominent ones (Rietveld and Gussen-
hoven 1985). While the effect of increased prominence on H* would thus
appear to be uncontroversial, the effect of increased prominence on L* is less
clear. Is L* lowered (see Liberman and Pierrehumbert 1984), does it remain
fixed (see Gussenhoven 1984), or is it raised (Steele 1986)? Since prominence
effects on L* scaling in Dutch await investigation, we will keep this question
open.

347
Prosody
14.2 An implementation model for Dutch intonation
On the basis of the above discussion, we conclude that, minimally, a model
for implementing Dutch intonation contours will need to include the
following five parameters:
1 one parameter for speaker-dependent differences in general pitch range;
2 one parameter to model the effect of overall (contour-wide) prominence;
3 one to control the distance between the targets for H* and L*;
4 one to model the effect of accentual downstep within the phrase;
5 one to model the effect of phrasal downstep.
Our work on implementation was informed by Ladd (1987a), which sets out
the issues in F0-implementation and proposes a sophisticated model of target
scaling. His model, inspired by three sources (Pierrehumbert 1980; Garding
1983; Fujisaki and Hirose 1984) is given in (5).
(5) F0(n) = Fr * NWKA)
F0(n) is the pitch value of the nth accent in Hz;
Fr is the reference line at the bottom of the speaker's range (lowest pitch
reached);
N defines the current range (N > 1.0);
f(Pn) is the phrase function with f(Pn) = f(Pn_,) * ds and f(P,) = 1; this
phrase function scales the register down or up;
d is the downstep factor (0 < d < 1);
s is + 1 for downstep or — 1 for upstep;
f(A) is the accent function of the form WE*T, which scales H* and L* targets;
W dictates register width, i.e. the distance between H* and L*
targets;
T represents the linguistic tone (T = + 1 for H*, T = — 1 for L*);
E is an emphasis factor, its normal value is 1.0; values > 1.0 result
in more emphasis, i.e. the higher scaling of H*'s and lower
scaling of L*.
We distinguish the actual scaling parameters Fr, N, IV, d, and E from the
binary choice variables T and s, which latter two are specified as + 1 or — 1
by the phonological representation. (In fact, s may be —2, —3, to effect
double, treble, etc. upsteps: cf. (4)). Of the scaling parameters, Fr is assumed
to be a speaker-constant, while N, W, and E are situation-dependent. Notice
that increasing Whas the same effect as increasing E.4 Figure 14.8 gives the
model in pictorial form. It should be emphasized that in the representation
given here N, W, and d do not represent absolute differences in F0-height.
When comparing this model with our list offiveparameters, we see that Fr,
N, and Wcorrespond to parameters 1, 2, and 3, respectively. Ladd's dis used
4
Parameter E is no longer included in the version of the model presented in Ladd (1990). The
difference was apparently intended to be global vs. local: is can be used to scale locally defined
prominence, as it is freely specifiable for each target.

348
14 R. van den Berg, C. Gussenhoven, and T. Rietveld

H
|Jd !H H
r W
i 1 _| *
i 1
i
1 J

L
Fr

Figure 14.8 Ladd's model for scaling accent targets

for both downward and upward scaling, which, in our model, is taken care of
by the two downstep parameters 4 and 5. However, an important aspect of
Ladd's formula is that it allows for the independent modeling of intraphrasal
and interphrasal effects. Although our conception of downstep and reset
differs from that of Ladd, we can retain his formula in a modified form. We
include two downstep factors, which makes s superfluous. We also exclude E
(see note 4). The model we propose scales all H* and L* targets within a
scaling domain, i.e. the domain over which TV and W remain constant. It uses
the same mathematical formula to scale targets in scaling domains with and
without phrasal downstep, as well as targets in AD's with and without
accentual downstep.
(6) F0(m,n) = Fr * Nf(pm) * «An)
F0(m,n) is the pitch value for the wth accent in the rath phrase;
Fr is the reference frequency determining general pitch range;
N defines the current range (N > 1.0);
f(Pm) = dpsp*(ml), the phrase function, scales the phrase reference lines for
the mth phrase;
dp is the phrasal downstep-factor (0 < dp < 1);
SP indicates whether phrasal downstep takes place or not; Sp =
+ 1 if it does and Sp = 0 if it does not.
f(An) = WT * da i/2*saM+T)*(n-i)9 t h e accent function, scales H* and L* targets;
W determines register width, the distance between H* and L*
targets;
da is the downstep factor (0 < da < 1) for downstepping H*
targets within the phrase;
T represents the linguistic tone (T = + 1 for H*, T = — 1 for L*);
Sa indicates whether accentual downstep takes place in that AD';
Sa = + 1 if it does and Sa = 0 if it does not;
the inclusion of the weighting factor \ in the accent function
ensures that the exponentiation for da is 1 when n = 2, 2
when n = 3, and so on.

349
Prosody
H

!H da
!H
r W
!H

Fr

Figure 14.9 Our own model for scaling accent targets

Again we distinguish the actual scaling parameters, Fr, N, W, da, and dp,
corresponding to the parameters mentioned under 1 to 5 above, respectively,
from the binary choice variables, T, Sa, and Sp. The latter serve to
distinguish between H* and L* targets, between AD's with and without
accentual downstep, and between scaling domains with and without phrasal
downstep, respectively. A pictorial representation of this model is given in
figure 14.9. Again, we emphasize that N, W, da, and dp should not be taken to
represent absolute Fo differences.
In figure 14.9, the high reference line represents the Fo for the H* targets
not subject to accentual downstep, while the low reference line represents the
Fo for all L* targets. If phrasal downstep is in force (Sp= 1), we assume, at
least for the time being, that its effect amounts to a downward scaling of both
reference lines. If it is not (Sp = 0), both reference lines are scaled equally high
for all AD's within the scaling domain. The phrase function we propose
accomplishes both: f(Pm) = dp^-1) if phrasal downstep takes place, and
f(Pm) = 1 if it does not.
The scaling of the targets within the AD' is taken care of by the accent
function given above. For an AD' without accentual downstep (Sa = 0), the
accent function reduces to f(An) = WT. Consequently, all H* targets
(T= + 1) are scaled on the phrase high-line (7) and all L* targets (T= — 1) on
the phrase low-line (8).
(7) Fr*N f ( p m)*w

(8) F r * Nf(pm) * i/w

If accentual downstep takes place (Sa=l), the accent function becomes


350
14 R. van den Berg, C. Gussenhoven, and T. Rietveld

f(An) = W * da1/2*<1+'Wn-1>. The pitch value for the nth H* target (T= + 1)
T

is given by (9), which scales the first H* target (n= 1) on the high-line. L*
targets (T= — 1) are scaled on the low-line (10). Note that the scaling of L* is
not affected by accentual downstep, but that it is by phrasal downstep (see
14.1.4).
(9) F r * Nf(pm) * W * dan - 1

(10) F r * Nf(pm) * i/w

The model assumes that the parameters Fr, da, and dp are constants, with
(speaker-specific) values which are constant across utterances. In fact, the
downstep factors dp and da may also be constant across utterances and
speakers. (This is one of the questions to be dealt with below.) If these
assumptions hold, all target values in a scaling domain are fully deter-
mined with the speaker's choice of N and W for that particular scaling
domain.

14.3 Fitting experiments


The aim of thefirstexperiment was to assess whether the accentual downstep
factor da is independent of speaker, prominence, and/or the number of
accentually downstepped accents within the AD', while the second exper-
iment addressed the issue whether the phrasal downstep factor dp is
independent of speaker and/or prominence. To this effect, we collected data
from two male and two female speakers,5 all staff members of the
University of Nijmegen, who produced the required contours with varying
degrees of prominence. The contours used in the second experiment also
allowed us to assess the relative values of the two downstep factors, da and
dp.

14.3.1 The accentual downstep factor da


143.1.1 Method
To preclude consonantal perturbations and effects of intrinsic pitch, the
downstepping contours (of the type H*L !H*L !H*L etc.) were produced on
sentences consisting of three or more instances of the nonsense word
maaMAAmaa, the longest sentence consisting of six of them. The speakers
were instructed to produce the sentences with four different degrees of
prominence. For each speaker, a total of ten sets of 16 sentences (4 lengths
5
Originally, we had more speakers in the first experiment. However, some speakers were not
able to produce downstepping contours on reiterant speech while still retaining the impression
that they were producing normal sentences. We therefore discarded their utterances.

351
Prosody

times 4 prominence levels) were recorded in two sessions, which were one
week apart. Subsequently, some utterances were judged to be inadequate,
and discarded. In order to obtain an equal number of utterances for each of
the four speakers, we randomly discarded as many utterances as necessary to
arrive at a total of thirty-two sentences per length for each subject. The
speech material was digitized with a sampling rate of 10 kHz. The resulting
speech files were analyzed with the LPC procedures supplied by the LVS
package (Vogten 1985) (analysis window 25 msec, prediction order 10 and
pre-emphasis 0.95). For all nonfinal H* targets we measured the highest pitch
reached in the accented vowel. This procedure resulted in Fo values which in
our opinion form a fair representation of the height of the plateau that could
visually be distinguished in the Fo contour. It is somewhat more problematic
to establish an intuitively acceptable norm for measuring the final H* target.
Visual inspection showed that the Fo contour sloped down from the plateau
for the prefinal target to a lower plateau. The slope between them was
generally situated in the first half of the final accented syllable. Since the
contour did not offer clear cues as to where the last target should be
measured, we arbitrarily decided to take a measurement at the vowel
midpoint. Where the pitch-extraction program failed because vocal fry had
set in, we measured the pitch period at the vowel midpoint and derived a F o
value from it. Because our model does not (yet) include a final-lowering
factor, these final values will be excluded from the fitting experiment.
However, we will use them to obtain an estimate for the speaker's general
pitch range.

143.1.2 Data fitting


For each speaker, the data to be fitted are thirty-two series (differing in
prominence) of two, three, four, and five values (measured in utterances of
three, four,five,and six accents). The 448 data points for each speaker will be
referred to as Mksn, where M stands for "measured," the supscript k gives the
number of accent values in the series (2 ^ k ^ 5), s indicates the series number
within the group of a particular length (1 <s^32), and n gives the accent
position within the series (1 ^ « ^ k ) . The F o targets predicted by the model
will be referred to as Pksn, where P stands for "predicted." The prediction
equation for Pksn is derived as follows. With only one phrase the phrase
function f(P) equals 1. Accentual downstep takes place, so Sa= 1. Because
we are dealing with H* targets only (T= + 1), the distinction between N and
W need not be retained: the two can be considered a single range parameter
(R) with R = N™, which varies from utterance to utterance as indicated in the
subscripts, Rks. To test the hypothesis that the value of da might depend on
the number of accents, we fitted the data with two different models, for each
of the four speakers separately. The first model incorporates one Fr value for
352
14 R. van den Berg, C. Gussenhoven, and T. Rietveld

all the speaker's accent series and four da values, one for each length. This is
indicated in the subscript, dak. The second model incorporates one Fr value
and one da value, irrespective of series length, forcing the four daks to have
the same value. With this provision, the same mathematical formula can be
used for both models. Applying the above to our model's general formula,
the predicted value for the nth accent in a series of k accent values thus
becomes (11).
(11) Pksn = Fr * R k s ^ - i

In an optimizing procedure, the parameters to be optimized (here the daks)


are assigned initial settings, while for the other parameters in the model (here
Fr and the Rks) fixed values are chosen. The model's predictions are then
calculated (here the Pksn), as well as a distance measure between observed and
predicted data (here the Mksn and Pkm) (see below for the measure used here).
Subsequently, this distance measure is minimized by an optimizing pro-
cedure. (We used iterative hill-climbing techniques [Whittle 1982]). The dak
values that go with this minimum are taken as the optimized values.
The (fixed) value of Fr was chosen on the assumption that the range is set
by the height of the first accent in a series. We equated Mksl with PksP and
because Pksl = Fr * Rks, Rks is calculated as MkJFr. Subsequently, we set the
daks at the values found in the optimizing procedure and, in a second
procedure, optimized the Rks estimates. Our definition of the initial Rks
estimates entails a perfect fit for the first accent in each series. We therefore
expected this secondary optimizing to result in a (somewhat) closer fit, at
least for those series where the first accent is "out of line." All values for
parameters and the goodness-of-fit index given in this paper are as calculated
after this secondary optimizing.
In our distance measure we weighted the P minus M distances with the
absolute M values, which gives us a percentual measure of the distance. We
further wanted the larger (and thus more serious) percentual deviations to
carry more weight, and therefore squared the percentual distances. Thus, the
measure (D) used in the optimizing procedure is the sum (over all data points,
i.e. with summation over k, s, and n) of the squared percentual distances (12).
(12) D = X I 2 (Pksn - Mksny / (Mksn¥
The same distance measure was used in optimizing the Rks, but applied here
to all data points within a series, i.e. with summation over n only.

143.13 Results
To obtain an idea of the best possible fit the model allows, we first ran, for
each speaker separately, optimizing analyses for a wide range of Fr values.
(Of course, different Fr values lead to different da values because of the

353
Prosody

Table 14.1 Optimal combination of FT and da values for a length-dependent-da


model and for a length-independent-da, model and goodness-of-fit indexes for
the four speakers separately (for further details, see text)

Length-dependent Length-independent

Speaker Fr da2 da3 da4 da5 index Fr da index

Fl 152 0.62 0.64 0.65 0.68 6.03 156 0.64 6.57


F2 146 0.60 0.60 0.63 0.63 5.66 151 0.60 5.94
Ml 36 0.78 0.81 0.83 0.84 9.37 48 0.79 10.80
M2 43 0.77 0.80 0.83 0.84 7.07 56 0.79 9.27

mathematical interdependence of the two.) We thus established the optimal


combination of values for Fr and the four daks for what will be termed the
length-dependent-^ model, and for Fr and da for the length-independent-^
model. These values are listed in table 14.1 for the four speakers separately
(Fl and F2 are the female, Ml and M2 the male speakers).
Although the distance measure is quite effective in the optimization of da k,
it is somewhat opaque if one wants to assess how well a model predicts the
measured data. We defined our goodness-of-fit index as the greatest differ-
ence between the measured and observed values to be found in any set of 95
percent of these 448 differences. We express this difference as an absolute
percentage of the measured data, i.e. as |(P-M)/M|. For instance, a goodness-
of-fit index of 8.2 percent indicates that the absolute difference between P
and M is larger than 8.2 percent of M for only 22 (or 5 percent) of the 448
data points. These indexes are listed in table 14.1.
Although the indexes give a general idea of the model's adequacy, they do
not reveal any consistent overshooting or undershooting for particular
accent positions. We therefore calculated the mean of the 32 percentual
residuals, for each accent position for each length. For both models we
observed a slight tendency in the residuals for the last accent in the series with
four and five accent values to be negative (meaning the model undershoots
these accent positions). This tendency may result from an attenuation of the
downstep rate towards the end of the longer series, possibly to ensure that
enough space is left for final lowering. However, the effect was very small
indeed, and not consistent across speakers.
In addition to a gradual attenuation of da, it is conceivable that for larger
series of accents speakers use a smaller step down (i.e. a larger da) in order to
have enough space for all the H* targets in the downstepping contour.
Alternatively, the speakers could increase their range, i.e. start at a higher F o
value. Both alternatives would be instances of what Liberman and Pierre-
354
14 R. van den Berg, C. Gussenhoven, and T. Rietveld
Table 14.2 Fr estimates as "endpoint average" and optimal da values for a
length-dependent-Adi model and for a length-independent-da, model and the
indexes, for the four speakers separately

Length-dependent Length-independent

Speaker Fr da2 da3 da4 da5 Index Fr da Index

Fl 148 0.63 0.65 0.67 0.70 5.96 148 0.68 6.55


F2 154 0.57 0.57 0.59 0.59 6.01 154 0.58 6.10
Ml 77 0.61 0.65 0.66 0.66 14.22 77 0.65 14.28
M2 77 0.64 0.69 0.72 0.73 9.60 77 0.71 10.52

humbert (1984: 220) call "soft" preplanning, i.e. behavioral common sense,
as opposed to "hard" preplanning, which would involve right-to-left compu-
tation of the contour. From the dak values for the length-dependent-da model
it is clear that all speakers do to some extent have higher values for da with
more targets. However, as was also the case in the data collected by
Liberman and Pierrehumbert (1984), this trend is small. Indeed, a compari-
son of the indexes in table 14.1 shows that the inclusion of a length-
dependent da in the model results in only a limited gain. To test whether
some speakers adjusted the pitch height of the first accent to the number of
accents, we subjected the Fo values for the initial accents to an analysis of
variance. The effect of the factor series length was not significant, F(3,496) =
1.36(p>0.10).
While the values for Fr in table 14.1 are optimal, they are in no way theory-
based. In order to assess how realistic it would be to assume a language-
specific da, we ran optimizing procedures for both models with an externally
motivated estimate of the general pitch-range parameter. To this end, we
adopted Ladd's operational definition of Fr as the average of endpoint values
of utterance-final falling contours. This corresponds with the mean F o of the
utterance-final plateaus. Table 14.2 gives the results. The indexes appear to
be higher than the optimal values. (The one exception must probably be
attributed to the procedurally inevitable two-step optimizing.) They are
nearly optimal for Fl and F2, somewhat higher for M2, and quite a bit
higher for Ml. The increase in da values with increasing number of targets
appears to be independent of Fr and, again, only a small improvement is
obtained with a length-dependent da. Finally, observe that this particular Fr
estimate does not result in speaker-independent da values.
Since the mathematical interdependence of Fr and da is of a nonlinear
nature, a different Fr might give speaker-independent da values. If we take a
value of approximately two-thirds of the endpoint average, the da obtained
355
Prosody
Table 14.3 Fr estimates as "international locus" and optimal Adi values for a
length-dependent-Adi model and for a length-independent-Adi model and the
indexes for the four speakers separately

Length-dependent Length-independent
Speaker Fr da2 da3 da4 da5 Index Fr da Index
Fl 100 0.76 0.79 0.81 0.83 7.42 100 0.81 9.09
F2 100 0.72 0.74 0.77 0.79 7.55 100 0.77 9.03
Ml 50 0.73 0.77 0.78 0.79 10.21 50 0.78 10.99
M2 50 0.74 0.78 0.81 0.82 7.26 50 0.81 9.21

appears to be speaker-independent. This particular Fr estimate may be


viewed as the "intonational locus," i.e. a theoretical target which is never
actually reached. The results are given in table 14.3. For Ml and M2 the
index values do not differ greatly from the maximally attainable, for Fl and
F2 they are somewhat higher. Again we observe a slight length dependence
for da.
We therefore conclude that (1) accentual downstep can be modeled with a
single downstep factor, which is independent of prominence and the number
of downstepped accents, but that (2) the answer to the question whether da is
a speaker-independent parameter is determined by the way in which the
general pitch-range parameter Fr is defined.

14.3.2 The phrasal downstep factor dp


14.3.2.1 Method
Because it was felt that it might be too strenuous for our four speakers to
produce the required contours on reiterant speech, the speech material this
time consisted of a meaningful sentence. The accented syllables (all part of
proper names) had either /e/ or /o/ in the peak to preclude effects of intrinsic
vowel pitch, and a sonorant consonant in the onset to preclude pitch
perturbations. The sentence was MEREL zag LENIE en REMY, LEO zag NORA en
NEELTJE, MARY zag MONA en LEENDERT, en KAREL zag ANNE, and the contour
was (H*L !H*L L*H) !(H*L !H*L L*H) !(H*L !H*L L*H) !(H*L !H*L).
This contour is a single scaling domain, annotated with the morpheme
[DOWNSTEP], SO that the four AD's it dominates form a downstepping series.
The last AD' was included to finish the contour naturally and to provide
measurements of the final plateau to determine the general pitch-range
parameter (see 14.3.1.1). The accentually downstepped second H* in the first
three AD's allows us to also assess an optimal value for da and thus to

356
14 R. van den Berg, C. Gussenhoven, and T. Rietveld

Table 14.4 Values for Fr, dp, da, and the goodness-of-fit index for (a) the
optimal parameter combination, (b) the "endpoint average" Fr estimate, and
(c) the "intonational locus" Fr estimate, for four speakers separately

(b) "Endpoint (c) "Intonational


(a) OptimurrI average" locus"

Speaker Fr dp da Index Fr dp da Index Fr dp da Index

Fl 156 0.73 0.35 6.64 135 0.82 0.54 6.91 100 0.89 0.72 7.61
F2 160 0.78 0.29 6.07 150 0.81 0.37 6.16 100 0.90 0.63 7.82
Ml 70 0.84 0.61 7.85 74 0.83 0.59 7.79 50 0.89 0.71 8.09
M2 90 0.83 0.09 10.13 65 0.88 0.34 11.13 50 0.90 0.46 12.72

compare the two downstep factors, dp and da. The speakers produced the
utterance at four prominence levels with five replications for each level,
reading them from a stock of twenty cards with a different random order for
each speaker.
The same procedure for digitizing, speech-file analysis, and F o measure-
ment was followed. Since we needed to fit the H* targets in the first three
AD's, we collected 120 data points for each speaker. These are referred to as
Msmn, M standing for "measured," s indicating the replication (1 0 ^ 2 0 ) , m
the AD' within the scaling domain (1 ^ra^3), and n the target within the
AD' (1 ^ « ^ 2 ) . With only H* targets (T= + 1) Nand Emerge into a single
range parameter R ( = TV^). For a given (fixed) value of Fr, the Rs for a
particular replication is estimated as Ms]]/Fr. Sp= 1 and Sa= 1, so Psmn, the
predicted (P) pitch value for the nth H* target in the rath AD' in the sth
replication, is given by (13).
v * smn s

With Fr and the Rs fixed, we optimized dp and da using the same distance
measure as before. In a subsequent procedure, we set dp and da at the
optimal values obtained and optimized the Rs. We then calculated the
goodness-of-fit index.

14.3.2.2 Results
As before, we ran optimizing procedures for a number of different Fr values
to obtain the best possible fit the model allows. We also optimized dp and da
for both the "endpoint average" and the "intonational locus" estimates of
Fr. Table 14.4 gives the results.
The increase in the index values (compared to the best attainable in table
14.4a) is somewhat smaller for the "endpoint average" estimate of Fr than
357
Prosody

for the "international locus" estimate. The latter estimate has the advantage
that dp is apparently speaker-independent. However, for both estimates the
da values vary considerably across speakers. In fact, no Fr estimate led to
speaker-independent values for both dp and da, but a combination of the
"intonational locus" Fr, a dp of 0.90, and a da of 0.70 (as observed for Fl and
Ml) would seem to be a reasonable choice to be used in a computer
implementation of our model.
For both Fr estimates and for all speakers, dp is larger than da, which
supports our view that phrasal downstep entails a smaller step down than
accentual downstep. Comparing the da values for the "locus" estimate across
the two experiments, we observe that here da is lower (and sometimes
considerably so) than in our study of accentual downstep. It would appear
that speakers tend to use a smaller da, so a larger step down, if they have to
realize accentual downstep in combination with phrasal downstep, possibly
in an attempt to keep the two as distinct as possible.
We conclude that (1) phrasal downstep can be modeled with a single
downstep factor which is independent of prominence, but that (2) the answer
to the question whether it is speaker-independent is determined by the way in
which the general pitch-range parameter is defined.

14.4 Concluding remarks


In this paper we argued that the intonational phenomena of downstep and
reset in Dutch should be modeled as the composite effect of phrasal
downstep, the downward scaling of a phrase's (AD7) register relative to a
preceding AD', and accentual downstep, the downward scaling of H* targets
within the AD'. We presented data from four speakers (two male and two
female) showing that both downstep factors are independent of prominence
and that the accentual downstep factor only marginally depends on the
number of (accentually) downstepped H*s. We further demonstrated that
the values of both downstep factors depend on the value of the general pitch
range parameter Fr and that an "intonational locus" estimate of this
parameter gives speaker-independent values for the phrasal downstep factor
and for the accentual downstep factor in a single AD7. With more AD's, no
single speaker-independent da was found, but a value lower than the one for
a single AD' appears to be appropriate.
We know of no independent theoretical reasons to choose a particular
estimate of general pitch range. For the implementation of the model, we
therefore prefer the "intonational locus" estimate of the general pitch-range
parameter (i.e. Fr is 100 for women and 50 for men), because this allows us to
implement a single speaker-independent value for the phrasal downstep

358
14 Comments
factor {dp is 0.90) as well as for the accentual downstep factor {da is 0.80 in a
single AD' and 0.70 if there are more AD's).

Comments on chapter 14
NINA GR0NNUM (formerly THORSEN)
Introduction
My comments concern only van den Berg, Gussenhoven, and Rietveld's
proposed analysis and model of Dutch intonation. As I am not a Dutch
speaker, and I do not have first-hand knowledge of data on Dutch inton-
ation, my comments are questions and suggestions which I would like
readers and the authors to consider, rather than outright denials of the
proposals. Nevertheless, it will be apparent from what follows that I think
van den Berg, Gussenhoven, and Rietveld's description obscures the most
important fact about accentuation in Dutch, and that it tends to misrepresent
the relevant difference between contours in some instances because it
disregards linguistic function (in a narrower as well as a wider sense). The
purported phonological analysis thus nearly reduces to a phonetic transcrip-
tion (though a broad one) and not always an adequate one at that, as far as I
can judge. To mute a likely protest from the outset: I am not generally
against trying to reduce rich phonetic detail to smaller inventories of
segments, features or parameters: it is the nature of van den Berg, Gussen-
hoven, and Rietveld's description and its relevance to a functional descrip-
tion that I question.
I begin with some general criticisms which bear upon the concrete
examples below. First, it is difficult to evaluate the adequacy of a description
which is based on examples, in the literal sense of the word, i.e. sample
utterances recorded once, by one speaker. Second, we are led to understand
that the various contours accompanying the same utterance are meaningfully
different, but we are not told in which way they are different, what kind of
difference in meaning is expressed, and whether or not Dutch listeners would
agree with the interpretation. Third, I miss an account of the perceived
prominence of the accented syllables in the examples, which might have been
relevant to the treatment of downstep. And fourth, in that connection I miss
some reflections about what accentual and phrasal downstep are for, what
functions they serve.

359
Prosody
Accentuation
From the publications of Cohen, Collier, and 't Hart (e.g. 't Hart and Cohen
1973; 't Hart and Collier 1975; Collier and 't Hart 1981), I have understood
Dutch to epitomize the nature of accentuation: accented syllables are stressed
syllables which are affiliated with a pitch change.1 Beyond that-as far as I
can see - there are few restrictions, i.e. the change may be in either direction,
it may be quick or slow, it may be coordinated with either onset or offset of
the stressed syllable, and it may be bidirectional. Not all the logical
combinations of directions, timings, and slopes occur, I suppose, but many
of them do, as witnessed also by van den Berg, Gussenhoven, and Rietveld's
examples. Nor does a pitch change necessarily evoke an accent, as for
example when it is associated with unstressed syllables at certain boundaries.
From this freedom in the manifestation of accent, a rich variety of patterns
across multiaccented phrases and utterances arises.2
I would therefore like to enquire how far one could take a suggestion that
the underlying mechanism behind accentuation in Dutch is pitch change, and
that the particular choice of how pitch is going to change is a combination of
(1) restrictions at phrase and utterance boundaries, connected with utterance
function and pragmatics, (2) focus distribution, (3) degree and type of
emphasis, (4) syntagmatic restrictions (i.e. certain pitch patterns cannot
precede or follow certain other ones if pitch changes are to be effected and
boundary conditions met), and (5) speech style, i.e. factors outside the realm
of phonology/lexical representation. I realize, of course, that some of these
factors (especially speech style and pragmatics) are universally poorly
understood and I cannot blame van den Berg, Gussenhoven, and Rietveld
for not incorporating them in their model. I do think, however, that they
could have exploited to greater advantage the possibility of separating out
effects from utterance function and boundary conditions, and they could at
least have hinted at the possible effects of the two different elocutionary styles
involved in their examples (complete utterances versus lists of place and
1
Here and throughout I will use "pitch" to refer to both F o and perceived pitch, unless I need to
be more specific.
2
Lest it be thought that I am implicitly comparing van den Berg, Gussenhoven, and Rietveld's
analysis to Cohen, Collier, and 't Hart's, and preferring the latter: this is not so. Even if the
latter's description in terms of movements is phonetically accurate, it would gain from being
further reduced to a specification of target values, pinning down the perceived pitch (pitch
level) of the accented syllables: many rapid local F o changes are likely to be produced and
perceived according to their onsets or offsets, although we can (be trained to) perceive them as
movements, especially when they are isolated. However, Cohen, Collier, and 't Hart's purpose
and methodology (to establish melodic identity, similarity, categorization) apparently have
not motivated such a reduction. A further difficulty with them is that they have not committed
themselves to a specification of the functional/pragmatic circumstances under which specific
combinations of accent manifestations are chosen. (Collier [1989] discusses the reasons why.)
This latter is, however, equally true of van den Berg, Gussenhoven, and Rietveld.

360
14 Comments
proper names). As it is, they represent every contour in their examples as a
sequence of L*H or H*L pitch accents, with utterance final L% and H%
tones added, with conventions for representing slow and level movements,
and a downstep morpheme to characterize series of descending H*s.3

Tonal representation of accents


Let me first question the adequacy of some of the tonal representations that
van den Berg, Gussenhoven, and Rietveld assign to accentual pitch move-
ments, and thereby also challenge their claim that their model is "realistic."
(By realistic I take them to mean a model which is relevant to speech
production and perception, which models both speaker and listener beha-
vior, and one which is not unduly abstract.)

Low-pitchedfinalaccents
Thefinalmovement in e.g.figures14.2a and 14.3b is rendered as H*L, i.e. the
stressed syllable is associated with a H in both instances. My own experience
with listening to speech while looking at Fo traces made me strongly doubt
that the two are perceptually equivalent. I would expect the one in figure
14.3b to be perceived with a low-pitched stressed syllable. To check this, I
recorded two utterances which were to resemble as closely as possible the one
in figure 14.3b, globally and segmentally, but where one was produced and
perceived with a high pitched the other with a low pitched stressed syllable at
the end. The result is shown in figure 14.10. (Note that neither of these
contours is an acceptable Standard Danish contour; intonationally they are
nonsense.) The two traces look superficially alike, but the fall is timed
differently with respect to the stressed vowel, corresponding to the high
(figure 14.10a) and low (figure 14.10b) pitch level of the stressed syllable.
Figure 14.10b rather resembles figure 14.3b, from which I infer that it should
be given a L* representation. This would introduce a monotonal L* (or a
L*L) to the system.
The same suspicion attaches to the final !H* infigures14.5a, b, and 14.7a. I
suspect that since van den Berg, Gussenhoven, and Rietveld have decided
that utterances can only terminate in H*LL% or L*HH% (or H*LH%), and
since figure 14.3b (and figures 14.5, 14.7a) is clearly not L*HH%, it must be
H*LL%. I return to this on pages 363-4 below.
I am, likewise, very skeptical about the reality of the assignment of !H*L to
kleren, schoenen, and school in figure 14.7b. I would expect the perceptually
salient part of such steep movements to correspond to the Fo offset rather
3
They do not mention tritonal pitch accents, and in the following I will assume that those
H*LHs which appear in Gussenhoven (1988) have been reinterpreted as L*LH%, which is
reasonable, considering his examples.

361
Prosody

100 L i l a m t r d n u e h a

100 centiseconds

u
100
100 centiseconds

Figure 14.10 Two Dutch intonation contours imposed on a Danish sentence, with a final
stressed syllable that sounds high (upper plot) or low (lower plot). The Danish text is Lille
Morten vil have mere manna ("Little Morten wants more manna")

than the onset of the stressed vowel. Is that not what led to the assignment of
H*(L), rather than L*H, to mooiste, duurste (together with the acknowledg-
ment that the Fo rise is a consequence of the preceding low value on de and of
the syllabic structure with a voiced initial consonant and a long vowel, cf.
beste in the last phrase)? In other words, if the Fo rises in mooiste, duurste are
assigned tones in accordance with the offset of their Fo movements, why are
the falls in kleren, schoenen, school not treated in the same way? Even though
phonological assignment need not and should not exactly mirror phonetic
detail, I think these assignments are contradictory. (An utterance where all
the stressed syllables had short vowels in unvoiced obstruent surroundings,
like beste, might have been revealing.) What are the criteria according to
which, e.g. mooiste and kleren are variants of the same underlying pattern,
rather than categorically different?

Accentuation and boundaries


Now let me proceed to selected examples where I think van den Berg,
Gussenhoven, and Rietveld's analysis may obscure the real issue. They state
that Figure 14.2a has an appreciable prosodic boundary. What is so
appreciable about it when none is present in the same location in figure
14.1a? Are we to understand that if the last word was spliced off from 14.1a
and 14.2a, then on replaying them listeners would hear a prosodic boundary
in 14.2a but not in 14.1a? If not, then the boundary is not a prosodic
boundary per se, but a "rationalization after the fact" (of the presence of a
362
14 Comments
succeeding accent), i.e. something which is only appreciated when its
conditioning factor is present. Indeed, it may be asked whether Dutch
listeners would agree that such a boundary is present even in the full version
of figure 14.2a. If not, the boundary is merely a descriptive device, invoked in
order to account for the steep fall to L in figure 14.2a versus the slow decline
in figure 14.3a. Would it be possible, instead, to consider the quick versus
slow fall over -warde wil meer to be a property of the accent itself, connected
to emphasis or speech style? (For example, do quick falls lend slightly more
prominence to the accent? Are they characteristic of more assertive speech
styles?)
The boundary assignment in figure 14.2b is much more convincing, since it
is associated with an extensive and rapid movement, affiliated with
unstressed syllables. With such a boundary in 14.2b, is there any choice at all
of pitch accent on Leeuwardenl That is, if a boundary is to be signaled
unambiguously prosodically by a pitch change to the succeeding phrase, and
if the succeeding phrase is specified to start low (whether by default or by a
L%), then the only bitonal pitch accent Leeuwarden can carry is L*H. This is
then not a choice between different lexical representations: the L*H manifes-
tation is a choice forced by circumstances elsewhere.
Finally, if mannen is more adequately represented as L* in figure 14.3b (see
page 361 above), then the preceding unstressed syllable(s) must be higher,
since the accent must be signaled by a change to the low man-. (Note that it
cannot be signaled by a change to high/row man-, since this would violate
demands for a prosodically terminal contour.) Thus, again, Leeuwarden
cannot be H*L, it must be L*H.

Summary
I fundamentally agree with van den Berg, Gussenhoven, and Rietveld that
there is a sameness about the first pitch accent across the (a)s infigures14.1-
3, and likewise across the (b)s. I suggest, however, that this sameness applies
to all the accented syllables, and that the different physical manifestations are
forced by circumstances beyond the precincts of accentuation, and presuma-
bly also by pragmatic and stylistic factors which are not in the realm of
phonology at all.

Boundary tones
Van den Berg, Gussenhoven, and Rietveld state that they are uncertain
about the occurrence of the L%, but they suppose it to characterize the
termination of IPs (identical to utterances, in their examples). I find the
concept of boundary tone somewhat suspect, when it can lead them to state
363
Prosody

that a T% copies the second T of the last pitch accent (if it is not pushed off
the board by the preceding T), except in attested cases of H*LH%. A
boundary tone proper should be an independently motivated and specified
characteristic, associated with demands for intonation-contour termination
connected to utterance function, in a broad syntactic and pragmatic sense. It
cannot be dictated by a preceding T*T; on the contrary, it can be conceived
of as being able to affect the manifestation of the preceding accent, as
suggested above.

Low boundary tone


It may be useful to discuss L% and H% separately, because I think the L%
can be done away with entirely. In figure 14.1a the authors admit that the
(H*)L and the L% are not clearly different, but the presence of L% can be
motivated on the basis of the final lowering. Now, if 14.1a justifies a L%, so
does 14.1b-which is counterintuitive and nonsensical. Perhaps both of the
slight final falls in 14.1a and b should be ascribed to weak edge vibrations
before a pause, and something which might have been left out of consider-
ation with different segmentation criteria.
Furthermore, if there is any special final lowering command involved at
the end of prosodically terminal utterances in Dutch, I would expect final
falls from high to low to be of greater extent than nonfinal falls from high to
low - and there is not a single such instance in the examples. For comparison,
figure 14.11 shows some cases in which "final lowering" seems a meaningful
description. I suggest that low termination in Dutch be considered the
unmarked or default case, which characterizes prosodically terminal utter-
ance. It does not require a special representation, but can be stated in terms
of restrictions on the manifestation of the last pitch accent, H*L or L*. In the
same way, low unstressed onset of phrases and utterances can also be
considered unmarked.

High boundary tone


High termination seems to require a H%, and is not solely a matter of
manifestation of the last pitch accent. This can be seen in figure 14.1b, where
a rise is superposed on the relatively high level stretch, and in the attested
cases of H*LH% mentioned by van den Berg, Gussenhoven, and Rietveld.
But why should H% a priori be restricted to utterance boundaries? In this
connection, compare the magnitude of the L*H interval in 14.1b to the rise in
Leo, and Mary in figure 14.7a - where it seems that a H% might be necessary
to account for the higher rises.

364
14 Comments
HP
semitones
20

15

10

0L Kofoed og Thorsen skal med rutebilen fra Gudhjem til SnogebaBk klokken fire pa tirsdag

0 100 200 300 400 500


centiseconds

PBP
semitones
15
n = 6

10

0L Kofoed og Thorsen skal med rutebilen fra Tingler til Tonder klokken fire pa tirsdag

0 100 200 300 400 500


centiseconds
JOW
semitones
15
n = 2

10

Hanna und Markus werden am Donnerstag Nachmittag mit dem Autobus


von Hamburg nach Kassel fahren

0 100 200 300 400 500


centiseconds

Figure 14.11 Three contours illustrating "final lowering" in the sense introduced in the text. In
each case the final fall from High to Low spans a considerably wider pitch range than the
preceding non-final falls. The top two contours are from regional varieties of standard Danish,
while the bottom contour is (northern) standard German; more detail can be found in Gronnum
(forthcoming), from which thefiguresare taken

365
Prosody

Downstep
Accentual downstep
Could the lack of downstep in figures 14.2a and 14.3a be due to the second
accented syllable being produced and perceived as more prominent than the
first one? (See Gussenhoven and Rietveld's [1988] own experiment on the
perception of prominence of later accents with identical F o to earlier ones.) If
uneven prominence does indeed account for an apparent lack of downstep
within prosodic phrases, there would be no need to introduce a downstep
morpheme, successive lowering of H*s would characterize any succession of
evenly prominent H*s in a phrase. The disruption of downstep would be a
matter of pragmatic and speech style effects (which I suspect to be at work in
figure 14.7b), not of morphology. In my opinion, the authors owe the reader
an account of the meaning of [!]: when does it and when does it not appear?
Van den Berg, Gussenhoven, and Rietveld find no evidence of downstep-
ping L*s. In fact, it appears from their examples that low turning points, be
they associated with stressed syllables or not, are constantly low during the
utterance, and against this low background the highs are thrown in relief and
develop their downward trend. This makes sense if the Ls are produced so as
to form a reference line against which the upper line (smooth or bumpy as the
case may be) can be evaluated - which is the concept of Garding's (1983)
model for Swedish by whom van den Berg, Gussenhoven, and Rietveld claim
to have been inspired. With such a view of the function of low tones, one
would not expect L*s to downstep, and one would look for the manifestation
of unequal prominence among L*s in the interval to a preceding or
succeeding H.
Phrasal downstep
In the discussion of how to conceive of and model resets, I think the various
options could have been seen in the light of the function resetting may serve.
If it is there for purely phonetic reasons, i.e. to avoid going below the bottom
of the speaking range in a long series of downstepping H*s, then it is
reasonable to expect that H*s after the first reset one would also be higher. If
resetting is there for syntactic or pragmatic reasons - to signal a boundary -
it would be logically necessary only to do something about the first
succeeding H*. However, I imagine that if consecutive H*s did not accom-
pany the reset one upwards, then the reset one would be perceived as being
relatively more prominent. To avoid that, the only way to handle phrasing in
this context is to adjust all the H*s in the unit. Yet it is doubtful whether the
phenomenon is properly labeled a shift of register, since the L*s do not
appear to be affected (cf. the phrase terminations in figures 14.6 and 14.7a). It
seems to be the upper lines only which are shifted.

366
14 Comments
Otherwise, I entirely endorse van den Berg, Gussenhoven, and Rietveld's
concept of a "wheels-within-wheels" model where phrases relate to each
other as wholes and to the utterance, because it brings hierarchies and
subordination back into the representation of intonation. In fact, their
"lowering model" (i.e. one which lowers successive phrase-initial pitch
accents [to which accentual downstep then refers], rather than one which
treats phrasal onsets as a step up from the last accent in the preceding phrase)
is exactly what I suggested a "tone sequence" representation of intonation
would require in order to accommodate the Danish data on sentences in a
text: "Consecutive lowering of individual clauses/sentences could be handled
by a rule which downsteps thefirstpitch accent in each component relative to
the first pitch accent in the preceding one" (Thorsen 1984b: 307). However,
data from earlier studies (Thorsen 1980b, 1981, 1983) make it abundantly
clear that the facts of accentual downstep within individual phrases in an
utterance or text are not as simple as van den Berg, Gussenhoven, and
Rietveld represent them.

Conclusion
I have been suggesting that perhaps Dutch should not be treated along the
same lines as tone languages or word-accent languages, or even like English.
Perhaps the theoretical framework does not easily accommodate Dutch data;
forcing Dutch intonation into a description in terms of sequences of
categorically different, noninteracting pitch accents can be done only at the
expense of phonetic (speaker/listener) reality. But even accepting van den
Berg, Gussenhoven, and Rietveld's premise, I think their particular solution
can be questioned on the grounds that it brings into the realm of prosodic
phonology phenomena that should properly be treated as modifications due
to functions at other levels of description.

367
Modeling syntactic effects on downstep in Japanese

HARUO KUBOZONO

15.1 Introduction
One of the most significant findings about Japanese intonation in the past
decade or so has been the existence of downstep*. At least since the 1960s, the
most widely accepted view had been that pitch downtrend is essentially a
phonetic process which occurs as a function of time, more or less indepen-
dently of the linguistic structure of utterances (see e.g. Fujisaki and Sudo
1971a). Against this view, Poser (1984) showed that downtrend in Japanese is
primarily due to a downward pitch register shift ("catathesis" or "down-
step"), which is triggered by (lexically given) accents of minor intonational
phrases, and which occurs iteratively within the larger domain of the so-
called major phrase. The validity of this phonological account of downtrend
has subsequently been confirmed by Beckman and Pierrehumbert (Beckman
and Pierrehumbert 1986; Pierrehumbert and Beckman 1988) and myself
(Kubozono 1988a, 1989).1
Consider first the pair of examples in (1).
(1) a. uma'i nomi'mono "tasty drink"
b. amai nomi'mono "sweet drink"
The phrase in (la) consists of two lexically accented words while (lb)
T h e research reported on in this paper was supported in part by a research grant from the
Japanese Ministry of Education, Science, and Culture (no. 01642004), Nanzan University Pache
Research Grant IA (1989) and travel grants from the Japan Foundation and the Daiko
Foundation. I am grateful to the participants in the Second Conference on Laboratory
Phonology, especially the discussants, whose valuable comments led to the improvement of this
paper. Responsibility for the views expressed is, of course, mine alone.
1
It must be added in fairness to Fujisaki and his colleagues that they now account for
phonemena analogous to downstep by positing an "accent-level rule," which resembles
McCawley's (1968) "accent reduction rule" (see Hirose et al. 1984; Hirose, Fujisaki, and
Kawai 1985).

368
75 Haruo Kubozono
consists of an unaccented word (of which Tokyo Japanese has many) and an
accented word. Downstep in Japanese looks like figures 15.1a and 15.2 (solid
line), where an accented phrase causes the lowering of pitch register for
subsequent phrases, accented and unaccented alike, in comparison with the
sequences in which the first phrase is unaccented (i.e. figures 15.1b and 15.2,
dotted line). The effect of downstep can also be seen from figure 15.3, which
shows the peak values of the second phrase as a function of those of the first.
The reader is referred to Kubozono (1989) for an account of the difference in
the height of the first phrases. Downstep thus defined is a rather general
intonational process in Japanese, where such syntactic information as
category labels are essentially irrelevant.
Previous studies of downstep in Japanese have concentrated on the scaling
of the peak values of phrases, or the values of High tones, which constitute
the ceiling of the pitch register. There is, on the other hand, relatively little
work on the scaling of valley values or Low tones, which supposedly
constitute the bottom of the register. This previous approach to the modeling
of downstep can be justified in most part by the fact, reported by Kubozono
(1988a), that the values for Low tones vary considerably depending on
factors other than accentual structure.
Downstep seems to play at least two roles in the prosodic structure of
Japanese. It has a "passive" function, so to speak, whereby its absence signals
a sort of pragmatic "break" in the stream of a linguistic message (see
Pierrehumbert and Beckman 1988; Fujimura 1989a). A second and more
active role which this intonational process plays is to signal differences in
syntactic structure by occurring to different degrees in different syntactic
configurations. As we shall see below, however, there is substantial disagree-
ment in the literature as to the details of this syntax-prosody interaction, and
it now seems desirable to consider more experimental evidence and to
characterize this particular instance of syntax-prosody interaction in the
whole prosodic system of Japanese. With this background, this paper
addresses the following two related topics: one is an empirical question of
how the downward pitch shift is conditioned by syntactic structure, while the
other is a theoretical question of how this interaction can or should be
represented in the intonational model of the language.

15.2 Syntactic effects on downstep


In discussing the interaction between syntax and phonology in Japanese, it is
important to note the marked behavior of right-branching structure. In
syntactic terms, Japanese is a "left-branching language" (Kuno 1973), which
takes the left-branching structure (as against the right-branching structure)
in various syntactic constructions such as relative-clause constructions. It has

369
Prosody

180 "
Pi
160 •
* P2
1
140 -

120 - v2
v3
100 -

(*) 1
•«-/
180 •

160 -
Pi P2

140 •
V2
Vi

120 -
v3

100 -

Figure 15.1 (a) A typical pitch contour for phrase (la), pronounced in the frame sorewa ... desu
("It is ..."); (b) A typical pitch contour for phrase (lb), pronounced in the frame sorewa ... desu
("It is ...")

been made clear recently that the right-branching structure is marked in


phonology as well, in that it blocks prosodic rules, both lexical and phrasal.
This is illustrated in (2) where the right-branching structure blocks the
application of prosodic rules which would otherwise unify two (or more)
syntactic/morphological units into one prosodic unit.2
2
Interestingly, Clements (1978) reports that prosodic rules in Anlo Ewe and in Italian are also
sensitive to right-branching structure (which he defines by the syntactic notion "left-branch").
Selkirk and Tateishi (1988a, b) propose a slightly different generalization, which will be
discussed below in relation to downstep.

370
15 Haruo Kubozono
(Hz)

160

150

140

130

120

110

Pi v2 p2 v3
Figure 15.2 Schematic comparison of the pitch patterns of (la)-type utterances (solid line) and
(lb)-type utterances (dotted line), plotted on the basis of averaged values at peaks and valleys

(Hz)

[Pd

170

160

150

140

140 150 160 170 (Hz) [Pi]

Figure 15.3 Distribution of PI and P2 in figure 15.2: P2 values as a function of PI values in


utterances of type (la) (circles) and type (lb) (squares)

371
Prosody
(2) a. [A B] -> AB
b. [[A B] C] -> ABC
c. [A [B C]] -> A/BC
d. [[A [B C]] D] -> A/BCD
e. [A [B [C D]]] -> A/B/CD
Examples of such prosodic rules include (a) lexical rules like compound
formation ("compound accent rules"; see Kubozono 1988a, b) and sequen-
tial voicing rules also characteristic of the compound process (Sato n.d.; Otsu
1980), and (b) phrasal rules such as the intonational phrasing process
generally known as "minor phrase formation" (see Fujisaki and Sudo 1971a;
Kubozono 1988a).
Given that Japanese shows a left-right asymmetry in these prosodic
processes, it is natural to suspect that the right-branching structure shows a
similar marked behavior in downstep as well. The literature, however, is
divided on this matter. Poser (1984), Pierrehumbert and Beckman (1988),
and Kubozono (1988a, 1989), on the one hand, analyzed many three-phrase
right-branching utterances and concluded that the register shift occurs
between the first two phrases as readily as it occurs between the second and
third phrases in this sentence type: the second phrase is "phonologically
downstepped" as relative to the first phrase in the sense that it is realized at a
lower level when preceded by an accented phrase than when preceded by an
unaccented phrase.
On the other hand, there is a second group who claim that downstep is
blocked between the first two phrases in (at least some type of) the right-
branching structure. Fujisaki (see Hirose, Fujisaki, and Kawai 1985), for
instance, claims that the "accent level rule," equivalent to our downstep rule
(see note 1), is generally blocked between the first two phrases of the right-
branching utterances. Fujisaki's interpretation is obviously based on the
observation that three-phrase right-branching utterances generally show a
markedly higher pitch than left-branching counterparts in the second com-
ponent phrase, a difference which Fujisaki attributes to the difference in the
occurrence or absence of downstep in the relevant position (see figure 15.4
below). Selkirk and Tateishi (1988a, b) outline a similar view, although they
differ from Fujisaki in assuming that downstep is blocked only in some type
of right-branching utterances. They report that downstep is blocked in the
sequences of [Noun-«o [Noun-H0 Noun]] (where no is a relativizer or genitive
particle) as against those of [Adjective [Adjective Noun]], taking this as
evidence for the notion of "maximal projection" in preference to the
generalization based on branching structure (see Selkirk 1986 for detailed
discussion). Selkirk and Tateishi further take this as evidence that the "major
phrase," the larger intonational phrase defined as the domain of downstep,
can be defined in a general form by this new concept.

372
75 Haruo Kubozono
With a view to reconciling these apparently conflicting views in the
literature, I conducted experiments in which I analyzed utterances of the two
different syntactic structures made by two speakers of Tokyo Japanese (see
Kubozono 1988a and 1989 for the details of the experimental design and
statistical interpretations). In these experiments various linguistic factors
such as the accentual properties and phonological length of component
elements were carefully controlled, as illustrated in (3); the test sentences
included the two types of right-branching utterances such as those in (4) and
(5) which, according to Selkirk and Tateishi, show a difference in downstep.
(3) a. [[ao'yama-ni a'ru] daigaku]
Aoyama-in exist university
"a university in Aoyama"
b. [ao'yama-no [a'ru daigaku]]
Aoyama-Gen certain university
"a certain university in Aoyama"
(4) a. [a'ni-no [me'n-no eri'maki]]
brother-Gen cotton-Gen muffler
"(my) brother's cotton muffler"
b. [ane-no [me'n-no eri'maki]]
"(my) sister's cotton muffler"
(5) a. [ao'i [o'okina eri'maki]]
"a blue, big muffler"
b. [akai [o'okina eri'maki]]
"a red, big muffler"
The results obtained from these experiments can be summarized in three
points. First, they confirmed the view that the two types of branching
structure exhibit distinct downtrend patterns, with the right-branching
utterances generally showing a higher pitch for their second elements than
the left-branching counterparts, as illustrated in figure 15.4.3 The distribu-
tion of the values for Peak2(P2), shown in figure 15.5, does not hint that the
averaged pitch values for this parameter represent two distinct subgroups of
tokens for either of the two types of branching structure (see Beckman and
Pierrehumbert, this volume). Second, it was also observed that the pitch
difference very often propagates to the third phrase, as illustrated in figure
15.6, suggesting that the difference in the second phrases represents a
difference in the height of pitch register, not a difference in local prominence.
Third and most important, the experimental results also confirmed the
observation previously made by Poser, Pierrehumbert and Beckman, and
3
Analysis of the temporal structure of the two patterns has revealed no significant difference,
suggesting that the difference in the pitch height of the second phrase is the primary prosodic
cue to the structural difference.

373
Prosody
(Hz)

160

140

120

100

80

60

Vi Pi V2 P2 V3 P3 V4

Figure 15.4 Schematic comparison of the pitch patterns of (3a) (dotted line) and (3b) (solid
line), plotted on the basis of averaged values at peaks and valleys

140 • o
o

120 •

100 •
:a •
.---•"
a
80 •

120 140 160 [Pi]

Figure 15.5 Distribution of PI and P2 in figure 15.4: P2 values as a function of PI values in


utterances of (3a) (squares) and (3b) (circles)

myself that the right-branching structure (as well as the left-branching


structure) undergoes downstep in the phonological sense. This can be seen
from a comparison of the two sentences in each of the pairs in (4) and (5) with
374
15 Haruo Kubozono

Pal 1

100
o
ooo o

80

60

80 100 120 140 [P 2 ]

Figure 15.6 Distribution of P2 and P3 in figure 15.4: P3 values as a function of P2 values in


utterances of (3a) (squares) and (3b) (circles)

respect to the height of the second component phrases. Figures 15.7 and 15.8
show such a comparison of the pairs in (4) and (5) respectively, where the
peak values of the second phrase are plotted as a function of those of the first
phrase. The results in thesefiguresreveal that the second phrase is realized at
a lower pitch level when preceded by an accented phrase than when preceded
by an unaccented phrase. Noteworthy in this respect is the fact that the two
types of right-branching constructions discriminated by Selkirk and Tateishi
are equally subject to downstep and, moreover, show no substantial differ-
ence from each other in downstep configuration. In fact, the effect of
downstep was observed between the first two phrases of right-branching
utterances irrespective of whether the component phrases were a simple
adjective or a Noun-«o sequence, suggesting that at least as far as the results
of my experiments show, it is the notion of branching structure and not that
of maximal projection that leads to a linguistically significant generalization
concerning the occurrence of downstep; in the absence of relevant data and
information, it is not clear where the difference between Selkirk and
Tateishi's experimental results and mine concerning the interaction between
syntax and downstep come from - it may well be attributable to the factor
of speaker strategies discussed by Beckman and Pierrehumbert (this
volume).
The observation that there are two distinct downstep patterns and that
they can be distinguished in terms of the branching structure of utterances is
375
Prosody

[P 2 ]

130

120

110

120 130 140 150 Pi]

Figure 15.7 Distribution of PI and P2 for the two sentences in (4): P2 values as a function of PI
values in utterances of (4a) (circles) and (4b) (squares)

[P2]

150

140

130 o
o o

120

110 120 130 140 150 [Pi]

Figure 15.8 Distribution of PI and P2 for the two sentences in (5): P2 values as a function of PI
values in utterances of (5a) (circles) and (5b) (squares)

376
75 Haruo Kubozono

further borne out by experiments in which longer stretches of utterances


where analyzed. Particularly important are the results of the experiments in
which two sets of four-phrase sentences, given in (6) and (7), were analyzed.
Each set consists of three types of sentence all involving an identical syntactic
structure (symmetrically branching structure) but differing in the accented-
ness of their first and/or second elements.
(6) a. [[na'oko-no a'ni-no][ao'i eri'maki]]
"Naoko's brother's blue muffler"
b. [[na'oko-no ane-no][ao'i eri'maki]]
"Naoko's sister's blue muffler"
c. [[naomi-no ane-no][ao'i eri'maki]]
"Naomi's sister's blue muffler"
(7) a. [[na'oko-na a'ni-wa][ro'ndon-ni imasu]]
"Naoko's brother is in London"
b. [[na'oko-no ane wa][ro'ndon-ni imasu]]
"Naoko's sister is in London"
c. [[naomi-no ane-wa][ro'ndon-ni imasu]]
"Naomi's sister is in London"
Utterances of the first type, (6a) and (7a), typically show a pitch pattern
like that in figure 15.9, in which the peak of the third phrase is usually higher
than that of the second phrase. A glance at this pattern alone suggests that
downstep is blocked between these two elements and that some sort of
prosodic boundary must be posited in this position. Comparison of this F o
pattern with those of other accentual types, however, suggests that this
interpretation cannot be justified.
The relationship between the height of the third phrase and the accented-
ness of the preceding phrases is summarized in figure 15.10, in which the peak
values of the third phrase are plotted as a function of the peak values of the
first phrase. This figure shows that in each of the three cases considered, the
peak values for the third phrase are basically linear functions of those for the
first phrase, distributed along each separable regression line. What this figure
suggests is that the third component phrase is lowered in proportion to the
number of accents in the preceding context, with the phrase realized lower
when preceded by one accent than when preceded by no accent, and still
lower when preceded by two accents. Leaving aside for the moment the fact
that the third phrase is realized at a higher level than the second phrase, the
result in this figure suggests that downstep has occurred iteratively in the
utterances of (6a)/(7a) type without being blocked by the right-branching
structure involved. This, in turn, suggests that occurrence of downstep
cannot be determined by comparing the relative height of two successive
minor phrases observed at the phonetic output (see Beckman and Pierrehum-
bert, this volume).

377
Prosody

(Hz)

180

160

140

120

Vi Pi V2 P2 V3 P3 V4 P4 V5

Figure 15.9 Schematic pitch contour of (6a)-type utterances, plotted on the basis of averaged
values at peaks and valleys

Pal
(Hz)l

170

160

150

140

170 180 190 200 (Hz)

Figure 15.10 Distribution of PI and P3 for the three sentences in (6): P3 values as a function of
PI values in utterances of (6a) (circles), (6b) (squares) and (8c) (crosses)

378
75 Haruo Kubozono

To sum up the experimental evidence presented so far, it can be concluded


that downstep occurs irrespective of the branching structure involved but
that it occurs to a lesser extent between two elements involving a right-
branching structure than between those involving a left-branching structure.
Viewed differently, this means that downstep serves to disambiguate differ-
ences in branching structure by the magnitudes in which it occurs. Seen in the
light of syntax-prosody mapping, it follows from these consequences that
occurrence or absence of downstep cannot be determined by the branching
structure of utterances, as supposed by Fujisaki, or in terms of the notion of
"maximal projection," as proposed by Selkirk and Tateishi. It follows,
accordingly, that the "major phrase" cannot be defined by syntactic structure
in a straightforward manner; rather, that the mapping of syntactic structure
onto prosodic structure is more complicated than has been previously
assumed (see the commentary to chapters 3 and 4 by Vogel, this volume).

15.3 Modeling syntax-downstep interaction


Given that left-branching and right-branching structures show different
configurations of downstep, there are two approaches accounting for this
fact. One is to assume two types of downstep which occur in different
magnitudes, one applying over right-branching nodes and the other apply-
ing elsewhere. The other possibility is to assume just one type of down-
step and attribute the syntactically induced difference in question to a
phonetic realization rule independent of downstep. Leaving aside for the
moment the first approach (which is fully described by van den Berg, Gussen-
hoven, and Rietveld, this volume), let us explore the second approach in
detail here and consider its implications for the modeling of Japanese
intonation.

15.3.1 Metrical boost


To account for the observed difference in downstep pattern, I proposed the
concept of "metrical boost" (MB), or an upstep mechanism controlling the
upward shifting pitch register, which triggers a global pitch boost in the
right-branching structure on to the otherwise identical downstep pattern (i.e.
the gradually declining staircase defined by the phonological structure). The
effect of this upstep mechanism is illustrated in figure 15.11; see Kubozono
(1989) concerning the question of whether the effects of multiple boosts are
cumulative.
This analysis is capable of accounting in a straightforward manner for the
paradoxical case discussed in relation to figure 15.9, the case where a
downstepped phrase is realized at a higher pitch level than the phrase which
379
Prosody

(Hz)l

160

140

120

100

80

60

V2 P2 V3 P3 V4

Figure 15.11 Effect of metrical boost in (3b)-type utterances: basic downstep pattern (dotted
line) and downstep pattern on which the effect of metrical boost is superimposed (solid line)

triggers the downward register shift. Under this analysis, it can be under-
stood that the downstepped (i.e. third) phrase has been raised by the phonetic
realization rule of metrical boost to such an extent that it is now realized
higher than the second minor phrase (fig. 15.12). This case is a typical
example where the syntactically induced pitch boost modifies the phonologi-
cally defined downstep pattern. Syntactically more complex utterances show
further complex patterns in downstep as shown in Kubozono (1988a), but all
these patterns can be described as interactions of downstep and metrical
boost, the two rules which control the shifting of pitch register in two
opposite directions.
The notion underlying the rule of metrical boost is supported by yet
another piece of evidence. In previous studies of Japanese intonation, it is
reported that sudden pitch rises occur at major syntactic boundaries such as
sentence, clause, and phrase boundaries. These "juncture phenomena" have
been explained by way of the "resetting" of pitch register or other analogous
machinery in intonational models (see Han 1962; Hakoda and Sato 1980;
Uyeno et al. 1981; Hirose et al. 1984). In the sentences given in (8),
for example, remarkable degrees of pitch rise reportedly occur at the
beginning of the fourth phrase in (8a) and the second phrases in (8b) and
(8c).

380
15 Haruo Kubozono

(Hz)

180

160

140

120

Vi Pi V2 P2 V3 P3 V4 P4 V5

Figure 15.12 Effect of metrical boost in (6a)/(7a)-type utterances: basic downstep pattern
(dotted line) and downstep pattern on which the effect of metrical boost is superimposed (solid
line)

(8) a. [[A [B C]][[ D E][F G]]]


[[bo'bu-wa [ro'ndon-ni ite]][[sono a'ni-wa][njuuyo'oku-ni imasu]]]
"Bob is in London, and his brother is in New York"
b. [A [[[B [C D]] E] F]]
[kino'o [[[ro'ndon-de [hito-o korosita]] otoko'-ga] tukama'tta]]
yesterday London-in person-Obj killed man-Nom was caught
"A man who killed a person in London was caught yesterday"
c. [A [[B C] D]]
[Zjo'n-to [[bo'bu-no imooto-no] me'arii]]
John-and Bob-Gen sister-Gen Mary
"John and Mary, Bob's sister"
If these phonemena are analyzed in terms of the branching structure of the
utterances, it will be clear that the sudden pitch rises occur where right-
branching structure is defined in a most multiple fashion, that is, at the
beginning of the element immediately preceded by double or triple "left
branches" (see Clements 1978). If metrical boost is redefined as a general
upstep process, in fact, all these so-called juncture phenomena can now be
treated as different manifestations of the single rule of metrical boost. In
other words, it can be said that pitch register is raised at "major syntactic
boundaries" not directly because of the syntactic boundaries but because
right-branching structure is somehow involved.

381
Prosody

According to my previous experiments (Kubozono 1988a), moreover,


there is evidence that the magnitudes of pitch rises (or, upward pitch register
shift, to be exact) in each syntactic boundary can largely be predicated by the
depth of right-branching structure. Consider the three sentences in (9), for
example, all of which involve the marked right-branching structure (i.e. left-
branches in the bracket notation employed here) between the first and second
elements. These sentences show different degrees of upstep at the beginning
of the second phrases with (9a) exhibiting a considerably greater effect than
the other two cases.
(9) a. [A [[B C] D]]
[ao'i [[na'oko-ga a'nda] eri'maki]]
"blue muffler which Naoko knitted'
b. [[A [B C]] D]
[[na'oko-no [ao'i eri'maki-no]] nedan]
"Naoko's blue muffler's price"
c. [A [B [C D]]]
[na'oko-no [ao'i [o'okina eri'maki]]]
"Naoko's blue big muffler"
In this light, it can be understood that the reported tendency of sentence
boundaries to induce a greater degree of upstep than clause or phrase
boundaries is simply attributable to the fact that sentence boundaries often,
if not always, involve a greater depth of right-branching structure than other
types of syntactic boundaries.
This line of generalization points to a certain difference between the two
types of pitch register shift: downstep, or the downward register shift, is a
binary process, whereas the upstep mechanism of metrical boost is an «-ary
process. This characterization of metrical boost is worth special attention
because it enables us to eliminate the conventional rule of pitch register reset
from the intonational system of Japanese. Moreover, it enables us to make a
generalization as to the linguistic functions of pitch register shifts in Japanese
in such a way that lexical information (word accent) and phrasal information
(syntactic consituency) are complementary in the use of pitch features.

15.3.2 Implications for intonational representation


Having understood that the upstep rule of metrical boost can be well
supported in the intonational system of Japanese, let us finally consider the
implications of this analysis for the organization of intonational represen-
tation and the interaction between syntax, phonology, and phonetics in the
language.
The most significant implication that emerges is that the phonetic (realiza-
tion) rule of metrical boost requires information concerning the hierarchy of

382
15 Haruo Kubozono
syntactic structure. Given the orthodox view that intonational representation
is the only source of information available for intonational (i.e. phonetic
realization) rules to derive correct surface pitch patterns, it follows that the
intonational structure itself is hierarchically organized, at least to such an
extent that left branching and right-branching structures can be readily
differentiated in the domain where downstep is denned.
If we define the major phrase (MP) as the domain where downstep occurs
between adjacent minor phrases (m.p.), the conventional intonational model
can be illustrated in (10). Obviously, this model is incapable of describing the
kind of syntactic information under consideration.

(10)

m.p. m.p. m.p.

There seem to be two possible solutions to this problem. One is to revise


this intonational representation slightly and to posit an intermediate intona-
tional phrase or level (represented as "IP" here) between the minor phrase
and the major phrase.4 In fact, this is the only possible revision that can be
made of the conventional representation insofar as we take the position that
intonational structure involves an H-ary branching structure. Alternatively, it
is also possible to take a substantially different approach to intonational
phrasing by assuming that intonational structure is binary branching.
Under the first approach, the difference between right-branching and left-
branching syntactic structures is represented as a difference in the number of
intermediate intonational phrases involved, as shown in (11):

(11) a. b c.
Utterance Utterance Utterance

MP MP MP

IP I> IP IP IP
A
m / \ / A\
m .p. m.p. m.p. n l.p. m.p. m.p.
.p. m.p. m.p. m.p
4 «jp» m u s t n o t be confused with the "intermediate phrase" posited by Beckman and
Pierrehumbert, which they define as the domain of downstep and hence corresponds to our
"major phrase."

383
Prosody

(lla) and (lib) are the representations assigned to the two types of three-
phrase sentence in (3a) and (3b) respectively, while (1 lc) is the representation
assigned to the symmetrically branching four-phrase sentences in (6a)/(7a).
Given the representations in (11), it is possible to account for the syntactic
effect on downstep in two different ways: either we assume that the rule of
metrical boost raises pitch register upwards at the beginning of IPs that do
not begin a MP, or we assume, as proposed by van den Berg, Gussenhoven,
and Rietveld (this volume), that downstep occurs to varying degrees depend-
ing upon whether it occurs within an IP or over two IPs, to a lesser degree in
the latter case than in the former.
Of these two analyses based on the model in (11), thefirstanalysis falls into
several difficulties. One of them is that the motivation of positing the new
intonational phrase (level) lies in and only in accounting for the syntactic
effect upon downstep. Moreover, if this analysis is applied to syntactically
more complex sequences of phrases, it may indeed eventually end up with
assuming more than one such additional phrase between the minor phrase
and the MP. If this syntactic effect on downstep can be handled by some
other independently motivated mechanism, it would be desirable to do
without any additional phrase or level. A second and more serious problem
arises from the characterization of IP as the trigger of metrical boost. It has
been argued that metrical boost is a general principle of register shift whose
effects can be defined on an «-ary and not binary basis. If we define
occurrence of metrical boost with reference to "IP," we would be obliged to
posit more than one mechanism for upward pitch register shifts, that of MB,
which applies within the major phrase, and that of the conventional reset
rule, which applies at the beginning of every utterance-internal MP. This is
illustrated in a hypothetical intonational representation in (12).

(12) Utterance

m.p. m.p. m.p. m.p. m.p. m.p.


m.p m.p.
T TT T
Reset MB
MB
Again, it would be desirable to posit a single mechanism rather than two if
consequences in both cases are the same. Thus, addition of a third intonatio-
nal level fails to account for the «-ary nature of the upstep mechanism,
thereby failing to define it as a general principle controlling upward register
shift in Japanese. Correspondingly, this second problem gives rise to a third

384
75 Haruo Kubozono
problem with the revised representation. That is, this analysis assumes two
types of the intermediate phrase, MP-initial IPs which do not trigger the
boost and MP-internal IPs which do trigger it.
Similarly, the analysis proposed by van den Berg, Gussenhoven, and
Rietveld (this volume) poses several difficult problems. While this analysis
may dispense with the principle of metrical boost as far as downstep is
concerned, it falls into the same difficulties just pointed out. Specifically, it
fails to capture the general nature of the upstep principle which can be
defined on the basis of syntactic consituency, requiring us instead to posit
either the conventional rule of register reset or a third type of downstep in
order to account for the upward register shift occurring at the beginning of
every utterance-internal major phrase (see (12)).
If the revised model illustrated in (11) is disfavored because of these
problems, the only way to represent the difference of syntactic structure in
intonational representation will be to take a substantially different approach
to intonational phrasing by abandoning the generally accepted hypothesis
that intonational structure is «-ary branching. Noteworthy in this regard is
the recursive model proposed by Ladd (1986a), in which right-branching and
left-branching structures can be differentiated in a straightforward manner
by a binary branching recursive mechanism. Under this approach, the two
types of pitch pattern in figure 15.4 can be represented as in (13a) and (13b)
respectively, and the four-phrase pattern in figure 15.9 as in (13c).

(13) a. b.
Utterance Utterance Utterance

MP MP MP

/
m.p. m.p. m.p. m.p. m.p. m.p. m.p. m.p. m.p. m.p.
T T
Upstep Upstep

Interpreting these intonational representations at the phonetic level, the


rule of metrical boost raises pitch register in response to any right-branching
structure defined in the hierarchical representation. The recursive intonatio-
nal structure represented in this way represents something very directly
related to syntactic constituency, and is different from the structure as
proposed by Poser, and Pierrehumbert and Beckman, which is more or less
distinct from syntactic structure and reflects syntactic structure only
indirectly.

385
Prosody

This new approach can solve all the problems with the conventional
approach. That is, it is not necessary to posit any additional intonational
level/phrase which lacks an independent motivation. Nor is it necessary to
postulate more than one mechanism for upstep phenomena because occur-
rence and magnitudes of upsteps can be determined by the prosodic
constituency rather than by intonational category labels, as illustrated in
(14).

(14) Utterance

15.4 Concluding remarks


The foregoing discussions can be summarized in the following three points.
First, as for the empirical question of how the intonational process of
downstep interacts with syntactic structure, experimental evidence suggests
that the register shift occurs in both of the two types of branching structure,
left-branching and right-branching, and yet disambiguates them by means of
the different degrees to which it occurs. Second, the effects of syntax on
downstep patterns can be modeled by the phonetic realization rule termed
"metrical boost," which triggers a step-like change in the pitch register for
the rest of the utterance, working something like the mechanism of downstep
but in the opposite direction. Moreover, by defining this rule as having an n-
ary effect whose degrees are determined by the depth of right-branching
structure involved - or the number of "left branches" in bracket notation - it
is possible to generalize the effect of this upward register shift on the basis of
syntactic constituency. More significantly, this characterization of metrical
boost enables us to dispense with the conventional rule of register reset in the
intonational system of Japanese.
Given this line of modeling of the syntax-downstep interaction, the
evidence for the syntactic effects on downstep patterns casts doubt upon the
conventional hypothesis that intonational representation involves an «-ary
branching and flat structure. It speaks, instead, for a substantially different
approach to intonational structure as proposed by Ladd (1986a). There may
be alternative approaches to this problem (see Beckman and Pierrehumbert's

386
14 and 15 Comments
commentary in this section), but the evidence presented in this paper suggests
that the new approach is at least worth exploring in more depth.

Comments on chapters 14 and 15


MARY BECKMANand JANET PIERREHUMBERT
Chapters 14 and 15 are about downstep and its interactions with phrasal
pitch range, local prominence, and the like - things that cause variation in the
fundamental frequency (Fo) values that realize a particular tonal event.
Sorting out this Fo variation is like investigating any other physical measure
of speech; making the measure necessarily involves making assumptions
about its phonetic control and its linguistic function, and since the assump-
tions shape the investigation, whether acknowledged or not, it is better to
make them explicit.
Leaving aside for the moment assumptions about the control mechanism,
we can classify the assumed linguistic function along two dimensions. The
first is categorical versus continuous: the variation in the measure can
function discretely to symbolize qualitatively different linguistic categories of
some kind or it can function continuously to signal variable quantities of
some linguistic property in an analogue way. The second is paradigmatic
versus syntagmatic: values of the measure can be freely chosen from a
paradigm of contrasting independent values or they can be relationally
dependent on some other value in the context.
Describing downstep along these dimensions, we could focus on different
things, depending on which aspect we are considering at which level of
representation. The aspect of the phonological representation which is
relevant to downstep is fundamentally paradigmatic; it is the tone string,
which depicts a sequence of categorical paradigmatic choices. A syntagmatic
aspect of downstep comes into focus when we look at how it is triggered; the
pitch range at the location of a given tone is lowered, but the lowering
depends upon the tone appearing in a particular tonal context. In Hausa,
where downstep applies at any H following a HL sequence, this contextual
dependence can be stated in terms of the tone string alone. In Japanese and
English, the context must be specified also in terms of a structural feature -
how the tones are organized into pitch accents. The syntagmatic nature of
downstep comes out even more clearly in its phonetic consequences; the new
value is computed for the pitch range relative to its previous value. When this
relational computation applies iteratively in long sequences, the step levels
resulting from the categorical input of H and L tone values tend toward a
continuous scale of Fo values.
387
Prosody

These aspects of downstep differentiate it from other components of


downtrend. The phonological representation of the categorical input trigger,
for example, differentiates downstep from any physiological downtrend that
might be attributed to consequences of respiratory function (e.g. Maeda
1976). The phonetic representation of the syntagmatic choice of output
values differentiates it also from the use of register slope to signal an
interrogative or some degree of finality in a declarative, as in Thorsen's
(1978, 1979) model of declination, or from the use of initial raising and final
lowering of the pitch range to signal the position of the phrase in the
discourse hierarchy of old and new topics, as proposed by Hirschberg and
Pierrehumbert (1986). In either of these models, the declination slope or the
initial orfinalpitch range value is an independent paradigmatic choice that is
linguistically meaningful in its own right.
These definitional assumptions about its linguistic function suggest several
tactics for designing experiments on downstep and its domain. First, to
differentiate downstep from other components of downtrend, it is important
to identify precisely the phonological trigger. This is trivial in languages such
as Hausa or Japanese, where the trigger involves an obvious lexical contrast.
In languages such as English and Dutch, on the other hand, identifying the
downstep trigger is more difficult and requires an understanding of the
intonational system as a whole. In this case, when we run into difficulties in
the phonological characterization of downstep, it is often a useful tactic to
wonder whether we have the best possible intonational analysis. Thus, given
the admitted awkwardness of their characterization of downstep as a phrasal
feature, we might ask whether van den Berg, Gussenhoven, and Rietveld
(this volume) have found the optimal tonal analysis of Dutch intonation
contours.
We are particularly inclined to ask the question here because of the
striking resemblance to difficulties that have been encountered in some
analyses of English. For example, in translating older British transcription
systems into autosegmental conventions, Ladd uses H* + L for the represen-
tation of a falling Fo around the nuclear accent, precluding its use for a
prenuclear H* + L which contrasts with H* phonologically primarily in its
triggering of downstep. This analysis led him first (in Ladd 1983a) to a
characterization of downstep in English as a paradigmatic feature of H tones,
exactly analogous to van den Berg, Gussenhoven, and Rietveld's characteri-
zation of downstep in Dutch as a paradigmatic choice for the phrase,
specified independently of the choice of tones or pitch accents. In subsequent
work, Ladd has recognized that this characterization does not explain the
syntagmatic distribution of downstepped tones which is the original motiva-
tion for the downstep analysis. He has replaced the featural analysis with the
phonological representation of register relationships among pitch accents

388
14 and 15 Comments

using a recursive metrical tree, as in Liberman and Prince's (1977) account of


stress patterns or Clements's (1981) account of downdrift in many African
tone languages. We feel that a more satisfactory solution is to use H* + L to
transcribe the prenuclear tone in many downstepped contours where the L is
not realized as an actual dip in Fo before the next accent's peak. This more
abstract analysis captures the phonological similarities and the common
thread of pragmatic meaning that is shared by all downstepping pitch accents
in English, whether they are rising (L + H) or falling (H + L) and whether the
following downstepped tone is another pitch accent or is a phrasal H (see
Pierrehumbert and Hirschberg, 1990).
A similar problem arises in the transcription of boundary tones. Gronnum
(this volume) criticizes van den Berg, Gussenhoven, and Rietveld for
transcribing the contour in figure 14.1a with a L% boundary tone, on the
grounds that there is no marked local lowering of Fo. But not transcribing a
L% boundary tone here would mean jettisoning the generalization that all
intonation phrases are marked with boundary tones - that figure 14.1a
contrasts with the initial portion of contour figure 14.2a as well as with
contours with a H% boundary tone, realized as a clear local rise. It would
also jettison the cross-language generalization that final lowering - a
progressive compression of the pitch range that reaches in from the end to
lower all tones in the last half-second or so of the phrase - is typically
associated with L phrase tones (see Pierrehumbert and Beckman 1988: ch.8,
for a review). Rejecting L% boundary tones in Dutch on the grounds that
contours such as figure 14.1a have no more marked local fall than contours
like 14.1b puts a high value on superficial similarities between the F o
representation and the tonal analysis, at the expense of system symmetry and
semantic coherence. On these grounds, we endorse Gussenhoven's (1984,
1988) approach to the analysis of intonation-phrase boundary tones in
Dutch, which he and his co-authors assume in their paper in this volume, and
we cannot agree with Gronnum's argument against transcribing figure 14.1a
with a L% boundary tone. We think that the awkwardness of characterizing
downstep in transcription systems that analyze the nuclear fall in English as a
H* 4- L pitch accent is symptomatic of a similar confusion between phonetic
representation and phonological analysis.
As Ladd does for English, Gussenhoven (1988) analyzes nuclear falls in
Dutch as H* + L pitch accents. Adopting this analysis, van den Berg,
Gussenhoven, and Rietveld transcribe the falling patterns in figures 14.2a,
14.3a, and 14.5b all as H* + L. We wonder whether this analysis yields the
right generalizations. Might the first Fo peak and subsequent sharp fall in
contour 14.2a be instead a sequence of H* pitch accent followed by a L
phrase accent? This alternative analysis would give an explicit intonational
mark for the edge of this sort of prosodic constituent, just as in English (see
389
Prosody
Beckman and Pierrehumbert 1986) and in keeping with the accounts of a
unified prosodic structure proposed on various grounds by Selkirk (1981),
Nespor and Vogel (1986), Beckman (1986), and Pierrehumbert and Beckman
(1988). Might the gradually falling slope after the first peak in figure 14.3a be
instead an interpolation between a H* pitch accent and the L of a following
L + H* accent? This alternative analysis would attribute the contrast between
absence of downstep in figure 14.3a and its presence in 14.5b to the singleton
versus bitonal pitch accent, just as in similar contours in English (Beckman
and Pierrehumbert 1986) and in keeping with the descriptions of other
languages where a bitonal pitch accent triggers downstep, such as Tokyo
Japanese (see Poser 1984; Pierrehumbert and Beckman 1988; Kubozono, this
volume). Because they analyze the falls in figures 14.2a, 14.3a, and 14.5b all
as H* + L, van den Berg, Gussenhoven, and Rietveld must account for the
differences among the contours instead by operations that are specified
independently of the tonal representation. They attribute the step-like
character of the fall in 14.5b to the operation of downstep applied as a
paradigmatic feature to the whole phrase. And they account for the more
gradual nature of the fall in 14.3a by the application of a rule that breaks up
the H* + L accent unit to link its second tone to the syllable just before the
following nuclear pitch accent. Elsewhere, Gussenhoven (1984, 1988) de-
scribes this rightward shift of the second tone as the first step of a two-part
Tone Linking Rule, which in its full application would delete the L entirely to
create a hat pattern. The partial and complete applications of Tone Linking
merge the prenuclear accent phonologically with the nuclear accent, thus
giving the sequence a greater informational integrity. Tone Linking and
downstep are mutually exclusive operations.
In testing for the systemic and semantic coherence of such an intonational
analysis, it is a useful strategy to determine exhaustively which patterns and
contrasts are predicted to exist and which not to exist. One can then
systematically determine whether the ones that are predicted to exist do
contrast with each other and are interpretable in the expected way. Further-
more, one can synthetically produce the ones that are predicted not to exist
and see where they are judged to be ill-formed or are reinterpreted as some
phonetically similar existing pattern. Among other things, Gussenhoven's
analysis of Dutch predicts that two sequences of prenuclear and nuclear
accent which contrast only in whether Tone Linking applied partially or
completely should not contrast categorically in the way that contours like
figure 14.3a and the hat pattern do in English. Also, the description of Tone
Linking implies that the operation does not apply between two prenuclear
accents, so that three-accent phrases are predicted to have a smaller
inventory of patterns than do two-accent phrases. That is, sequences of three
accents within a single phrase should always be downstepped rather than

390
14 and 15 Comments

Tone-linked. More crucially, the analysis predicts the impossibility of three-


accent phrases in which only one of the accents triggers a following
downstep. As Pierrehumbert (1980) points out, the existence of such mixed
cases in English precludes an analysis of downstep as an operational feature
of the phrase as a whole. In general, if downstep is to be differentiated from
other components of downtrend, we need to be careful of analyses that make
downstep look like the paradigmatic choice of whether to apply a certain
amount of final lowering to a phrase.
A second tactical point relating to the categorical phonological represen-
tation of downstep is that in looking at the influence on downstep of syntax
or pragmatic focus, one needs first to know whether or not downstep has
occurred, and in order to know this, it is imperative to design the corpus so
that it contrasts the presence of the downstep trigger with its absence. That is,
in order to claim that downstep has occurred, one cannot simply show that a
following peak is lower than an earlier peak; one must demonstrate that the
relationship between the two peaks is different from that between compar-
able peaks in an utterance of a form without the phonological configuration
that triggers downstep. Kubozono reminds us of this in his paper, and it is a
point well worth repeating. Using this tactic can only bring us closer to a
correct understanding of the relationship between syntactic and prosodic
constituents, including the domain of downstep.
A third tactical point is always to remember that other things which
superficially look like downstep in the ways in which they affect pitch range
do not necessarily function linguistically like downstep. For example, it is
generally agreed that downstep has a domain beyond which some sort of
pitch-range reset applies. Since the sorts of things that produce reset seem to
be just like the things that determine stress relationships postlexically -
syntactic organization and pragmatic focus and the like - it is very easy to
assume that this reset will be syntagmatic in the same way that downstep is.
Thus, van den Berg, Gussenhoven, and Rietveld list in their paper only these
two possible characterizations:
1 The reset is a syntagmatic boost that locally undoes downstep.
2 The reset is a syntagmatic register shift by a shift factor that either
(a) reverses the cumulative effects of downstep within the last domain; or
(b) is an independent "phrasal downstep" parameter.
They do not consider a third characterization, suggested by Liberman and
Pierrehumbert (1984) and developed in more detail by Pierrehumbert and
Beckman (1988):
3 The reset is a paradigmatic choice of pitch range for the new phrase.
In this last characterization, the appearance of phrasal downstep in many

391
Prosody

experiments would be due to the typical choice of a lower pitch range for the
second phrase of the utterance, reflecting the discourse structure of the
mini-paragraph.
A criticism that has been raised against our characterization in point 3 is
that is introduces too many degrees of freedom. Ladd (1990), for example,
has proposed that instead of independent paradigmatic choices of pitch
range for each phrase and of tonal prominence for each accent, there is only
the limited choice of relative pitch registers that can be represented in binary
branching trees. Kubozono in his paper finds this view attractive and adapts
it to the specification of pitch registers for Japanese major and minor
phrases. Such a phonological characterization may seem in keeping with
results of experiments such as the one that Liberman and Pierrehumbert
(1984) describe, where they had subjects produce sentences with two inton-
ation phrases that were answer and background clauses, and found a very
regular relationship in the heights of the two nuclear accent peaks across ten
different levels of overall vocal effort. Indeed, results such as these are so
reminiscent of the preservation of stress relationships under embedding that
it is easy to see why Ladd wants to attribute the invariant relationship to a
syntagmatic phonological constraint on the pitch range values themselves,
rather than to the constant relative pragmatic saliences.
However, when we consider more closely the circumstances of such results,
this criticism is called into question. The design of Liberman and Pierrehum-
bert's (1984) experiment is typical in that it encouraged the subjects to zero in
on a certain fixed pragmatic relationship - in that case, the relationship of an
answer focus to a background focus. The constant relationship between the
nuclear peak heights for these two foci may well reflect the subject's uniform
strategy for realizing this constant pragmatic relationship. In order to
demonstrate a syntagmatic phonological constraint, we would need to show
that the peak relationships are constant even when we vary the absolute
pragmatic salience of one of the focused elements.
The analogy to stress relationships also fails under closer examination in
that the purely syntagmatic characterization of stress is true only in the
abstract. A relational representation of a stress pattern predicts any number
of surface realizations, involving many paradigmatic choices of different
prominence-lending phonological and phonetic features. For example, the
relatively stronger second syllable of red roses relative to the first might be
realized by the greater prominence of a nuclear accent relative to a prenuclear
accent (typical of the citation form), or by a bigger pitch range for the second
of two nuclear accents (as in a particularly emphatic pronunciation that
breaks the noun phrase into two intermediate phrases), or by the greater
prominence of a prenuclear accent relative to no accent (as in a possible
pronunciation of the sentence The florist's red roses are more expensive).
392
14 and 15 Comments
Similarly, a weak-strong pragmatic relationship for the two nouns in Anna
came with Manny can be realized as a particular choice of pitch ranges for
two intonational phrases, or as the relative prominence of prenuclear versus
nuclear pitch accent if the speaker chooses to produce the sentence as one
intontation phrase. As Jackendoff (1972), Carlson (1983), and others have
pointed out, the utterance in this case has a somewhat different pragmatic
interpretation due to Anna's not being a focus, although Anna still is less
salient pragmatically than Manny.
The possibility of producing either two-foci or one-focus renditions of this
sentence raises an important strategic issue. Liberman and Pierrehumbert
(1984) elicited two-foci productions by constructing a suitable context frame
and by pointing out the precise pragmatic interpretation while demonstrat-
ing the desired intonation pattern. If they had not taken care to do this, some
of their subjects might have given the other interpretation and produced the
other intonation for this sentence, making impossible the desired comparison
of the two nuclear-accent peaks. A more typical method in experiments on
pitch range, however, is to present the subject with a randomized list of
sentences to read without providing explicit cues to the desired pragmatic
and intonational interpretation. In this case, the subject will surely invent an
appropriate pragmatic context, which may vary from experiment to exper-
iment or from utterance to utterance in uncontrolled ways. The effects of this
uncontrolled variation is to have an uncontrolled influence on the phrasal
pitch ranges and on the prominences of pitch accents within a pitch range.
The variability of results concerning the interaction of syntax and downstep
in the literature on Japanese (e.g., Kubozono 1989, this volume; Selkirk
1990; Selkirk and Tateishi, 1990) may reflect this lack of control more than it
does anything about the interaction between syntactic constituency and
downstep.
The fact that a sentence can have more than one pragmatic interpretation
also raises a methodological point about statistics: before we can use
averages to summarize data, we need to be sure that the samples over which
we are averaging are homogeneous. For example, both in Poser (1984) and in
Pierrehumbert and Beckman (1988), there were experimental results which
could be interpreted as showing that pragmatic focus reduces but does not
block downstep. When we looked at our own data more closely, however, we
found that the apparent lesser downstep was actually the result of a single
outlier in which the phrasing was somewhat different and downstep had
occurred. Including this token in the average made it appear as if the
downstep factor could be chosen paradigmatically in order to give greater
pitch height than normal to prosodic constituents with narrow focus. The
unaveraged data, however, showed that the interaction is less direct; elements
bearing narrow focus tend to be prosodically separated from preceding

393
Prosody

elements and thus are realized in pitch ranges that have not been down-
stepped relative to the pitch range of preceding material. It is possible that
Kubozono could resolve some of the apparent contradictions among his
present results and those of other experiments in the literature on Japanese if
he could find appropriate ways of looking at all of the data token by token.
The specific tactical lesson to draw here is that since our understanding of
pragmatic structure and its relationship to phrasing and tone-scaling is not as
well developed as our understanding of phonological structure and its
interpretation in Fo variation, we need to be very cautious about claiming
from averaged data that downstep occurs to a greater or lesser degree in
some syntactic or rhythmic context. A more general tactical lesson is that we
need to be very ingenious in designing our experiments so as to elicit
productions from our subjects that control all of the relevant parameters. A
major strategic lesson is that we cannot afford to ignore the knotty questions
of semantic and pragmatic representation that are now puzzling linguists
who work in those areas. Indeed, it is possible that any knowledge concern-
ing prosodic structure and prominence that we can bring to these questions
may advance the investigative endeavor in previously unimagined ways.
Returning now to assumptions about control mechanism, there is another
topic touched on in the paper by van den Berg, Gussenhoven, and Rietveld
which also raises a strategic issue of major importance to future progress in
our understanding of tone and intonation. This is the question of how to
model L tones and the bottom of the pitch range. Modeling the behavior of
tones in the upper part of the pitch range - the H tones - is one of the success
stories of laboratory phonology. The continuing controversies about many
details of our understanding (evident in the two papers in this section) should
not be allowed to obscure the broad successes. These include the fact that
[ + H] is the best understood distinctive-feature value. While work in speech
acoustics has made stunning progress in relating segmental distinctive
features to dimensions of articulatory control and acoustic variation, the
exact values along these dimensions which a segment will assume in any
given context in running speech have not been very accurately modeled. In
contrast, a number of different approaches to H-tone scaling have given rise
to Fo synthesis programs which can generate quite accurately the contours
found in natural speech. Also, work on H-tone scaling has greatly clarified
the division of labor between phonology and phonetics. In general, it has
indicated that surface phonological representations are more abstract than
was previously supposed, and that much of the burden of describing sound
patterns falls on phonetic implementation rules, which relate surface phono-
logical representations to the physical descriptions of speech. Moreover,
attempts to formulate such rules from the results of appropriately designed
experiments have yielded insights into the role of prosodic structure in speech

394
14 and 15 Comments
production. They have provided additional support for hierarchical struc-
tures in phonology, which now stand supported from both the phonetic and
morphological sides, a fate we might wish on more aspects of phonological
theory.
In view of these successes, it is tempting to tackle L tones with the same
method that worked so well for H tones - namely, algebraic modeling of the
Fo values measured in controlled contexts. Questions suggested under this
approach include: What functions describe the effects of overall pitch range
and local prominence on Fo targets for L tones? What prevents L tones from
assuming values lower than the baseline? In downstep situations, is the
behavior of L tones tied to that of H tones, and if so, by what function?
We think it is important to examine the assumptions that underlie these
questions, particularly the assumptions about control mechanisms. We
suggest that it would be a strategic error to apply too narrowly the precedents
of work on H-tone scaling.
Looking first at the physiological control, we see that L-tone scaling is
different from H-tone scaling. A single dominant mechanism, cricothyroid
contraction, appears to be responsible for H tone production, in the sense
that this is the main muscle showing activity when F o rises into a H tone. In
contrast, no dominant mechanism for L tone production has been found.
Possible mechanisms include:
Cricothyroid relaxation - e.g., Simada and Hirose (1971), looking at the
production of the initial boundary L in Tokyo Japanese; Sagart et al.
(1986), looking at the fourth (falling) tone in Mandarin.
Reduction of subglottal pressure - e.g., Monsen, Engebretson, and Ver-
mula, (1978), comparing L and H boundary tones.
Strap muscle contraction - e.g., Erickson (1976), looking at L tones in Thai;
Sagart et al. (1986), looking at the third (low) tone in Mandarin;
Sugito and Hirose (1978), looking at the initial L in L-initial words
and the accent L in Osaka Japanese; Simada and Hirose (1971) and
Sawashima, Kakita, and Hiki (1973), looking at the accent L in
Tokyo Japanese.
Cricopharyngeus contraction - Honda and Fujimura (1989), looking at L
phrase accents in English.
Some of these mechanisms involve active contraction, whereas others involve
passive relaxation. There is some evidence that the active gesture of strap
muscle contraction comes into play only for L tones produced very low in the
pitch range. For example, of the four Mandarin tones, only the L of the third
tone seems to show sternohyoid activity consistently (see Sagart et al. 1986).
Similarly, the first syllable of L-initial words in Osaka Japanese shows a
marked sternohyoid activity (see Sugito and Hirose 1978) that is not usually
observed in the higher L boundary tone at the beginning of Tokyo Japanese
395
Prosody

accentual phrases (see, e.g., Simada and Hirose 1971). Lacking systematic
work on the relation of the different mechanisms to different linguistic
categories, we must entertain the possibility that no single function controls
L-tone scaling. Transitions from L to H tones may bring in several
mechanisms in sequence, as suggested in Pierrehumbert and Beckman (1988).
One of the tactical imports of the different mechanisms is that we need to be
more aware of the physiological constraints on transition shape between
tones; we should not simply adopt the most convenient mathematical
functions that served us so well in H-tone scaling models.
Another common assumption that we must question concerns the func-
tional control of the bottom of the pitch range. We need to ask afresh the
question: Is there a baseline? Does the lowest measured value at the end of an
utterance really reflect a constant floor for the speaker, which controls the
scaling of tones above it, and beyond which the speaker does not aim to
produce nor the hearer perceive any L tone?
Tone-scaling models have parlayed a great deal from assuming a baseline,
on the basis of the common observation that utterance-final L values are
stable for each speaker, regardless of pitch range. On the other hand, it is not
clear how to reconcile this assumption with the observation that nuclear L
tones in English go up with voice level (see e.g. Pierrehumbert, 1989). This
anomaly is perturbing because it is crucial that we have accurate measures of
the L tones; estimates of the baseline from H-tone scaling are quite unstable
in the sense that different assumptions about the effective floor can yield
equally good model fits to H-tone data alone. The assumption that the
bottom of the pitch range is controlled via a fixed baseline comes under
further suspicion when we consider that the last measure Fo value can be at
different places in the phrasal contour, depending on whether and where the
speaker breaks into vocal fry or some other aperiodic mode of vocal-fold
vibration. It is very possible that the region past this point is intended as, and
perceived as, lower than the last point where Fo measurement is possible.
A third assumption that relates to both the physiological and the func-
tional control of L tones concerns the nature of overall pitch-range variation.
It has been easy to assume in H-tone modeling that this variation essentially
involves the control of Fo. Patterns at the top of the range have proved
remarkably stable at the different levels of overall Fo obtained in exper-
iments, allowing the phenomenon to be described with only one or two
model parameters.
We note, however, that the different Fo levels typically are elicited by
instructing the subject to "speak up" to varying degrees. This is really a
variation of overall voice effort, involving both an increased subglottal
pressure and a more pressed vocal-fold configuration. It seems likely,
therefore, that the actual control strategy is more complicated than our H-
396
14 and 15 Comments
tone models make it. While the strategy for controlling overall pitch range
interacts with the physiological control of H tones in apparently simple ways,
its interaction with the possibly less uniform control mechanism for L tones
may yield more complicated Fo patterns. In order to find the invariants in this
interaction, we will probably have to obtain other acoustic measures besides
Fo to examine the other physiological correlates of increased pitch range
besides the increased rate of vocal-fold vibration. Also, it may be that pitch-
range variation is not as uniform functionally as the H-tone results suggest. It
is possible that somewhat different instructions to the subject or somewhat
different pragmatic contexts will emphasize other aspects of the control
strategy, yielding different consequences for Fo variation, particularly at the
bottom of the pitch range.
These questions about L-tone scaling have a broader implication for
research strategy. Work on H tones has brought home to us several
important strategic lessons: experimental designs should orthogonally vary
local and phrasal properties; productions should be properly analyzed
phonologically; and data analysis should seek parallel patterns within data
separated by speaker. We clearly need to apply these lessons in collecting Fo
measurements for L tones. However, to understand fully L tones, we will
need something more. We will need more work relating linguistic to
articulatory parameters. We will need to do physiological experiments in
which we fully control the phonological structure of utterances we elicit, and
we will need to develop acoustic measures that will help to segregate the
articulatory dimensions in large numbers of utterances.

397
i6
Secondary stress: evidence from Modern Greek

AMALIA ARVANITI

16.1 Introduction
The need to express formally stress subordination in English has always been
felt and many attempts to do so have been made, e.g. Trager and Smith
(1951), Chomsky and Halle (1968). However, until the advent of metrical
phonology (Liberman and Prince 1977) all models tried to express stress
subordination through linear analyses. The great advantage of metrical
phonology is that by presenting stress subordination through a hierarchical
structure it captures the difference in stress values between successive stresses
in an economical and efficient way.
When Liberman and Prince presented their model, one of their intentions
was to put forward a "formalization of the traditional idea of'stress timing'"
(1977: 250) through the use of the metrical grid. This reference to stress-
timing implies that their analysis mainly referred to the rhythm of English.
However, the principles of metrical phonology have been adopted for the
rhythmic description of other languages (Hayes 1981; Hayes and Puppel
1985; Roca 1986), including Polish and Spanish, which are rhythmically
different from English. The assumption behind studies like Hayes (1981) is
that, by showing that many languages follow the same rhythmic principles as
English, it can be proved that the principles of metrical phonology, namely
binarity of rhythmic patterns and by consequence hierarchical structure, are
universal. However, such evidence cannot adequately prove the universality
of metrical principles; what is needed is evidence that there are no languages
which do not conform to these principles. Thus, it would be interesting to
study a language that does not seem to exhibit a strictly hierarchical, binary
rhythmic structure. If the study of such a language proves this to be the case,
then the claim for the universality of binary rhythm may have to be revised.
One language that seems to show a markedly different kind of rhythmic
patterning from English is Modern Greek.

398
16 Amalia Arvaniti

In fact, the past decade has seen the appearance of a number of studies of
Modern Greek prosody both in phonology (Malikouti-Drachman and
Drachman 1980; Nespor and Vogel 1986, 1989; Berendsen, 1986) and in
phonetics (Dauer 1980; Fourakis 1986; Botinis 1989). These studies show
substantial disagreement concerning the existence and role of secondary
stress in Greek.
By way of introduction I present a few essential and undisputed facts
about Greek stress. First, in Greek, lexical stress conforms to a Stress Well-
formedness Condition (henceforth SWFC), which allows lexical stress on any
one of the last three syllables of a word but no further to the left (Joseph and
Warburton 1987; Malikouti-Drachman and Drachman 1980). Because of the
SWFC, lexical stress moves one syllable to the right of its original position
when affixation results in the stress being more than three syllables from the
end of the word; e.g.
(1) /'ma0ima/ "lesson" > /'maGima-l-ta/ "lesson-hs" > /ma'Oimata/
"lessons"
Second, as can be seen from example (1), lexical stress placement may depend
on morphological factors, but it cannot be predicted by a word's metrical
structure because there are no phonological weight distinctions either among
the Greek vowels, /i, e, a, o, u/, or among syllables of different structure; i.e.
in Geek, all syllables are of equal phonological weight. Therefore, it is quite
common for Greek words with the same segmental structure to have stress
on different syllables; e.g.
(2) a. /'xo.ros/ "space"
b. /xo.'ros/ "dance" (noun)
It is equally possible to find words like those in (3),
(3) a. /'pli.6os/ "crowd"
b. /'plin.Gos/ "brick"
where both words are stressed on their first syllable, although this is open in
(3a) and closed in (3b). Finally, when the SWFC is violated by the addition of
an enclitic to a host stressed on the antepenultimate, a stress is added two
syllables to the right of the lexical stress. For example,
(4) /'maBima tu/>/'ma0i'ma tu/ "his lesson"
(5) /'5ose mu to/>/'6ose 'mu to/ "give it to me"
All investigators (Setatos 1974; Malikouti-Drachman and Drachman
1980; Joseph and Warburton 1987; Botinis 1989) accept that the two stresses
in the host-and-clitic group have different prominence values. However, not
all of them agree as to the relative prominence of the two stresses. Most
399
Prosody

investigators (Dauer 1980; Malikouti-Drachman and Drachman, 1980;


Joseph and Warburton 1987) agree that the added stress is stronger than the
host's lexical stress. Setatos (1974), however, followed by Nespor and Vogel
(1986, 1989), claims that the host's lexical stress remains the strongest. Thus,
a first point of disagreement emerges: namely, the prominence value of the
SWFC-induced stress.
Botinis (1989) presents an entirely different analysis: influenced perhaps by
work on Swedish prosody, he claims that the SWFC-induced stress of a host-
and-clitic group is a "phrase stress." Botinis admits that "on acoustic
grounds it is questionable if there is enough evidence to differentiate between
word and phrase stress although they have quite different perceptual
dimensions" (1989: 85). However, Botinis's claim that word and phrase stress
are perceptually distinct could be attributed partly to incorrect manipulation
of the Fo contour in the synthesized stimuli of his perceptual experiment and
partly to incorrect interpretation of this experiment's results (for details see
Arvaniti 1990). In fact, Botinis's evidence suggests that the SWFC-induced
stress is acoustically the most prominent in the host-and-clitic group.
Most of the studies (Dauer 1980; Fourakis 1986; Joseph and Warburton
1986) mention stress subordination only in relation to the host-and-clitic
group stress addition, in which case they refer to "secondary stress." In other
words, in most of the studies it is assumed that in Greek each word normally
carries only lexical stress. Phonological studies (Malikouti-Drachman and
Drachman 1980; Nespor and Vogel 1986, 1989), though, assume that, in
addition to lexical stress, Greek exhibits rhythmic stresses which are added at
the surface level. Thus, the presence of rhythmic stresses is the second point
of contention among studies. Nespor and Vogel and Malikouti-Drachman
and Drachman relate rhythmic stress to the "secondary" stress of host-and-
clitic groups but in two different ways.
Nespor and Vogel, on the one hand, propose that rhythm is represented by
the grid which is built "on the basis of the prosodic structure of a given
string" (1989: 70). The grid shows prominence relations among stresses but it
cannot show constituency. Nespor and Vogel (1989) suggest that in Greek
rhythmic stresses appear only when there is a lapse in the first level of the
grid; in other words, whereas a series of unstressed syllables constitutes a
lapse that can trigger rhythmic stress, a series of lexical stresses of equal
prominence does not constitute a lapse. When a lapse occurs, one of the
syllables that has only one asterisk in the grid acquires a second asterisk, i.e. a
rhythmic stress, through the beat addition rule. The placement of rhythmic
stresses is regulated by two preference rules. As has been mentioned, Nespor
and Vogel (1986, 1989) also maintain that the SWFC-induced stress (or
"secondary stress," as they call it) is less prominent than the original lexical
stress of the host; this "secondary stress" is of equal prominence to a

400
16 Amalia Arvaniti

rhythmic stress. According to Nespor and Vogel (1989) the difference


between a SWFC-induced stress and a rhythmic stress lies in the fact that the
former is the result of an obligatory prosodic rule which operates within C
(clitic group) while the latter is the result of Beat Addition, an optional
rhythmic rule which operates in the grid. Examples from Nespor (forthcom-
ing) and Nespor and Vogel (1989) suggest that rhythmic stress and "second-
ary" stress have the same phonetic realization since (a) they are both
represented by two asterisks in the grid and (b) the grid cannot show
constituency differences (i.e. the fact that "secondary stress" belongs to C).
However, if this were correct then the following examples
(6) [o 'daska.los tu] c 1 ['anikse tin 'porta] c
"His teacher opened the door"
(7) [o '5askalos] c [tu 'anikse tin 'porta] c
"The teacher opened the door to him"
could have the same rhythmic structure since the lapse in (7), i.e. the series of
unstressed syllables /skalos tu/, can only be remedied by adding a rhythmic
stress on /los/. This is not the case, however; the two examples are clearly
differentiated in Greek. Indeed, the subjects who took part in the perceptual
experiments reported in Botinis (1989) could distinguish even synthesized
stimuli of similar structures to (6) and (7) in at least 91 percent of the cases.
Thus, one of the two claims of Nespor and Vogel (1989) must be incorrect: if
/los/ in (6) carries "secondary" stress then "secondary" and rhythmic stress
must be presented in different ways in the grid, or else /los/ in (6) carries the
main stress of the host-and-clitic group, not a "secondary" one.
As has been mentioned, Malikouti-Drachman and Drachman (1980) also
relate secondary and rhythmic stresses, but their approach differs from that
of Nespor and Vogel, in that the former assume that the secondary stress in a
host-and-clitic group is the weakened lexical stress of the host. Rhythmic
stresses are added following the Rhythm Rule which states "Make a trochaic
foot of any adjacent pair of weak syllables to the left of the lexical stress
within the word [word + clitics] (iterative)" (1980: 284).
(8)

w w w s w
A A
6er fi mu > i a 5er fi mu "my sister"

The stresses on /daskalos/ are presented here with the prominence values assumed by Nespor
and Vogel (1989).

401
Prosody
(9)

i e pav li mas > i e pav li mas "our villa"


As can be seen from examples (8) and (9), the Rhythm Rule applies not
only to the left but also to the right of the lexical stress; this "refooting"
explains the SWFC-induced stress. However, by equating the metrical
structures of (8) and (9), this analysis cannot differentiate between a metrical
tree with an optional rhythm-induced stress like (8), and a tree with an
obligatory SWFC-induced stress like (9).
To summarize, there seem to be two main interconnected issues that are
addressed by researchers: namely, the presence and nature of rhythmic and
SWFC-induced stress. In brief, Nespor and Vogel (1986, 1989) and Malik-
outi-Drachman and Drachman (1980) alone agree that Greek exhibits
rhythmic stress; and although they disagree as to which of the two stresses in
a host-and-clitic group is the most prominent, they agree that the weaker one
of the two is identical to rhythmic stress. Botinis (1989), on the other hand,
does not mention rhythmic stresses but he proposes two distinct prosodic
categories, i.e. word and phrase stress, to account for the SWFC-induced
stress in host-and-clitic groups. The present paper is an attempt to examine
these issues using acoustical and perceptual evidence rather than impression-
istic data. First, two questions must be answered: (a) whether the stress
added to a host-and-clitic group due to SWFC violation is the most
prominent in the group; (b) whether this added stress is perceptually distinct
from a lexical stress. When answers to these questions are established two
more questions can be addressed: (c) whether "secondary" stress and
rhythmic stress are perceptually and acoustically the same or not; (d) whether
there is any evidence for rhythmic stress. The questions are investigated by
means of two perceptual tests, and acoustic analyses of the data of the second
experiment. Finally, an attempt is made to present the results formally within
a broadly conceived metrical framework.

16.2 Experiment 1
16.2.1 Method
16.2.1.1 Material
The first experiment is a simple perceptual test whose aim is to see first
whether the lexical stress of the host-and-clitic group is more prominent than

402
16 Amalia Arvaniti

Table 16.1 One of the two test pairs (la and lb) and one of the distractors (2a
and 2b) in the context in which they were read. The test phrases and distractors
are in bold type.

1 (a) /tu 'ipa to yi'a arista su ke 'xarike po'li/


"I told him about your 1st class mark and he was very pleased"
(b) /e'yo tu fo'nazo 'ari stasu ki a'ftos 6e stama'tai/
"I shout at him Ari stop but he doesn't stop"
2 (a) /pi'stevo 'oti 'ksero to 'mono 'loyo yi'a a'fti tin ka'tastasi/
"I believe that I know the only reason for this situation"
(b) /den 'exo a'kusi pi'o vare'to mo'noloyo sto '0eatro/
"I haven't listened to a more boring theatrical monologue"

the SWFC-induced one as Setatos (1974) and Nespor and Vogel (1986, 1989)
maintain, and second whether Botinis's phrase stress and word stress are two
perceptually distinct stress categories as his analysis suggests.
Two test pairs were designed (see the parts of table 16.1, la and lb, in bold
type): in each test pair the two members are segmentally identical but have
word boundaries at different places and are orthographically distinct in
Greek. The first member, (a), of each pair consists of one word stressed on
the antepenultimate and followed by an enclitic possessive pronoun. As this
pattern violates the SWFC a stress is added on the last syllable of the word.
The second member, (b), consists of two words which together form a phrase
and which are stressed on the same syllables as member (a). Thus, the
difference between members (a) and (b) of each test pair is that in (a) the
phrase contains a lexical and a SWFC-induced stress whereas in (b) each one
of the two words carries lexical stress on the same syllables as (a). According
to Nespor and Vogel, the most prominent stress in (a) phrases is the lexical
stress of the host while in (b) phrases it is the stress of the second word (i.e.
the one that falls on the same syllable as the SWFC-induced stress in (a))
since the second word is the head of the phonological phrase O (1986: 168).
Also, in Botinis's terms (a) and (b) phrases have different stress patterns, (a)
containing one word and one phrase stress and (b) containing two word
stresses; these stress patterns are said by Botinis to be perceptually distinct. If
either Nespor and Vogel or Botinis are correct, (a) and (b) phrases should be
distinguishable.
The test phrases were incorporated into meaningful sentences (see table
16.1). Care was taken to avoid stress clashes, and to design, for each pair,
sentences of similar prosodic structure and length. Two distractor pairs were
devised on the same principle as the test pairs (see table 16.1, 2a and 2b). The
difference is that in the distractors one member contains two words, each one
403
Prosody
with its own lexical stress (/'mono 'loyo/"only reason"), while in the other
member the same sequence of syllables makes one word with lexical stress on
a different syllable from those stressed in the first member
(/mo' noloyo/"monologue").
The sentences were read by four subjects including the author. Each
subject read the test sentences and the distractors six times from a random-
ized list, typed in Greek. The recorded sentences and the distractors were
digitized at 16 kHz and then were edited so that only the test phrases and
distractors were left. For each test phrase and distractor one token from each
one of the four subjects was selected for the test tape. The tokens chosen
were, according to the author's judgment, those that sounded most natural
by showing minimum coarticulatory interference from the carrier phrase.
To make the listening tape, the test phrases and the distractors were
recorded at a sampling rate of 16 kHz using computer-generated random-
ization by blocks so that each token from each subject was heard twice. Each
test phrase and distractor was preceded by a warning tone. There were 100
msec, of silence between the tone and the following phrase and 2 sec. between
each stimulus and the following tone. Every twenty stimuli there was a 5 sec.
pause. In order for listeners to familiarize themselves with the task, the first
four stimuli were repeated at the end of the tape, and the first four responses
of each listener were discarded. Thus, each subject heard a total of seventy
stimuli: 4 speakers x (4 test phrases + 4 distractors) x 2 blocks + 4
repeated items -I- 2 stimuli that consisted of two tones each (a result of the
randomization program).

16.2.1.2 Subjects
As mentioned, four subjects took part in the recording. Three of them (two
female, one male) were in their twenties and they were postgraduate students
at the University of Cambridge. The fourth subject was a sixty-year-old
woman visiting Cambridge. All subjects were native speakers of Greek and
spoke the standard dialect. All, apart from the fourth subject, had extensive
knowledge of English. None of the subjects had any history of speech or
hearing problems. Apart from the author all subjects were naive as to the
purpose of the experiment.
Eighteen subjects (seven male and eleven female) did the perceptual test.
They were all native speakers of Greek and had no history of speech or
hearing problems. Twelve of them were between 25 and 40 years old and the
other six were between 40 and 60 years old. Fourteen of them spoke other
languages in addition to Greek but only one had extensive knowledge and
contact with a foreign language (English). All subjects had at least secondary
education and fourteen of them held university degrees. All subjects spoke
404
16 Amalia Arvaniti
Standard Greek, as spoken in Athens, where sixteen of them live. They were
all naive as to the purposes of the experiment.

16.2.13 Procedure
The subjects did the test in fairly quiet conditions using headphones and a
portable Sony Stereo Cassette-Corder TCS-450. No subject complained that
their performance might have been marred by noise or poor-quality equip-
ment. The subjects were given a response sheet, typed in Greek, which gave
both possible interpretations of every stimulus in the tape (70 x 2 possible
answers). The task was explained to them and they were urged to give an
answer to all stimuli even if they were not absolutely certain of their answer.
The subjects were not allowed to play back the tape.

16.2.2 Results
The subjects gave a total of 576 responses excluding the distractors (18
subjects x 32 test phrases/answer sheet). There were 290 mistakes, i.e. 50.34
percent of the responses to the test phrases were wrong (identification rate
49.66 percent). The number of mistakes ranged from a minimum of nine (one
subject) to twenty-one (one subject). By contrast, the identification rate of the
distractors was 99.1 percent; out of 18 subjects only two made one and four
mistakes respectively. Most subjects admitted that they could not tell the test
phrases apart although they found the distractors easy to distinguish. Even
the subjects who insisted that they could tell apart the test pairs made as
many mistakes as the rest.
Thus the results of experiment 1 give an answer to the first two questions
addressed here. The results clearly indicate (see table 16.2) that, contrary to
Setatos (1974) and Nespor and Vogel (1986, 1989), the SWFC-induced stress
is the most prominent in the host-and-clitic group, whereas the original
lexical stress of the host weakens. This weakening is similar to that of the
lexical stress of a word which is part of a bigger prosodic constituent, such as
a O, without being its head. Also, the results show that in natural speech
Botinis's "phrase stress" is not perceptually distinct from word stress as he
suggests.

16.3 Experiment 2
16.3.1 Method
16.3.1.1 Material
Experiment 2 includes a perceptual test and acoustical analyses of the
utterances used for it. With the answers to questions (a) and (b) established,
this experiment aims at answering the third question addressed here: namely,
whether rhythmic stress and the weakened lexical stress (or "secondary

405
Prosody
Table 16.2 Experiment 1: contingency table of type of stimulus by subject
response.
Type of stimulus

Response "1 word" stimulus "2 word" stimulus Total

(a) Observed responses


"1 word" 113 115 228
"2 word" 175 173 348
Total 288 288 576
(b) Expected responses (and deviances)
"1 word" 114(0.008) 114(0.008)
"2 word" 174 (0.005) 174 (0.005)

Note: Total deviance (%2) = 0.026 ldf. The difference between the relevant conditions
is not significant.

stress") of the host in a host-and-clitic group are perceptually and acousti-


cally the same as Malikouti-Drachman and Drachman suggest. In addition,
this experiment is an attempt to find acoustical evidence for rhythmic stress.
Four pairs of segmentally identical words with different spelling and stress
patterns were chosen (see the parts of table 16.3 in bold type). One word in
each pair has lexical stress on the antepenultimate - (la) in table 16.3 - and
the other on the last syllable - (lb) in table 16.3. These pairs were
incorporated into segmentally identical, but orthographically distinct sen-
tences in which they were followed by a possessive enclitic (see table 16.3).
For clarity, (a) test words will be referred to as SS (for "secondary stress")
and (b) words as RS (for rhythmic stress), thus reflecting the terms used by
various analyses but not necessarily the author's opinion on the nature of
stress in Greek. These two terms will be used throughout with the same
caution. The addition of the enclitic results in a change in the stress pattern of
SS words as required by the SWFC. Thus, these words have a "secondary
stress" on their antepenultimate syllable (i.e. the weakened lexical stress) and
primary stress (i.e. the added stress) on their last syllable. According to
Malikouti-Drachman and Drachman, RS words also have this stress pattern
since (a) polysyllabic words with final stress carry rhythmic stress on their
antepenultimate syllable and (b) rhythmic stress and "secondary stress" are
not distinguished. If their claims are correct, then the SS and RS words of
each test pair are segmentally and metrically identical and therefore
indistinguishable.
Four pairs of distractors incorporated into identical sentences were also
included (see table 16.3, 2a and 2b). These were devised on the same pattern
406
16 Amalia Arvaniti
Table 16.3 One of the test sentence pairs and one of the distractor sentence
pairs of experiment 2. The test words and distractors are in bold type.

1 (a) / mu 'eleye 'oti 'vriski ton .eni'ko tis po'li enoxliti'ko/


"S/he was telling me that s/he finds her tenant very annoying"
(b) / mu 'eleye 'oti 'vriski ton eni'ko tis po'li enoxliti'ko/
"S/he was telling me that s/he finds her "singular"
[nonuse of politeness forms] very annoying"
2 (a) /i a'poxi tus 'itan po'li me'yali/
"Their hunting-net was very big"
(b) /i apo'xi tus 'itan po'li me'yali/
"Their abstention was very big"

as the test sentences, the difference being that the word pairs in the distractors
differed in the position of the primary stress only; /a'poxi/ "hunting net":
/apo'xi/ "abstention". The sentences were read by six subjects, in conditions
similar to those described for experiment 1, from a typed randomized list
which included six repetitions of each test sentence and distractor. The first
two subjects (El and MK) read the sentences from hand-written cards three
times each.
A listening tape was made in the same way as in experiment 1. The stimuli
were the whole sentences not just the test word pairs. The tape contained one
token of each test sentence and distractor elicited from each subject. There
were 3 sec. of silence between sentences and 5 sec. after every tenth sentence.
Thefirstfour sentences were repeated at the end of the tape, and the first four
responses of each listener were discarded. Each subject heard a total of 100
sentences: 6 speakers x (8 test sentences + 8 distractors) + 4 repeated
stimuli.

16.3.1.2 Perceptual experiment


The same speakers that recorded the material of experiment 1 did the
recording of experiment 2. In addition, two more female subjects (El and
MK) of similar age and education as three of the subjects of experiment 1
took part in the recording. All the subjects that took part in perceptual test 1
performed test 2 as well. The responses of one of the subjects who did not
understand what she was asked to do and left most test pairs unmarked were
discarded. The procedure was the same as that described in experiment 1.
The answer sheet gave 200 possible answers i.e. 100 stimuli x 2 alternatives.

16.3.1.3 Acoustical analyses


All three tokens of each test sentence of the original recording of El and
407
Prosody

MK and the first three tokens of HP's recording were digitized at a sampling
rate of 16 kHz and measurements of duration, amplitude, and F o were
obtained. Comparisons of the antepenultimate and final syllables of the SS
words with the equivalent syllables of the RS words are presented here (see
figures 16.1-16.4, below). For instance, the duration, Fo and amplitude
values of/e/ in SS word /.eni'ko/ "tenant" were compared to those of/e/ in
RS word /eni'ko/ "singular."
Duration was measured from spectrograms. The error range was one pitch
period (about 4-5 msec, as all three subjects were female). Measurements
followed common criteria of segmentation (see Peterson and Lehiste 1960).
VOT was measured as part of the following vowel.
Three different measurements of amplitude were obtained: peak amplitude
(PA), root mean square (RMS) amplitude, and amplitude integral (AI). All
data have been normalized so as to avoid statistical artifacts due to
accidental changes such as a subject's leaning towards the microphone etc.
To achieve normalization, the PA of each syllable was divided by the highest
PA in the word in question while the RMS and AI of each syllable was
divided by the word's RMS and AI respectively; thus the RMS and AI of
each syllable are presented as percentages of the word's RMS and AI
respectively. All results refer to the normalized data. All original measure-
ments were in arbitrary units given by the signal processing package used.
For peak amplitude, measurements were made from waveforms at the
point of highest amplitude of each syllable nucleus. RMS and AI were
measured using a computer program which made use of the amplitude
information available in the original sample files.2 To calculate the RMS,
the amplitude of each point within the range representing the syllable nucleus
was squared and the sum of squared amplitudes was divided by the number
of points; the square root of this measurement represents the average
amplitude of the sound (RMS) and is independent of the sound's duration.
AI measurements were obtained by simply calculating the square root of the
sum of squared amplitudes of all designated points without dividing the sum
by the number of points. In this way, the duration of the sound is taken into
account when its amplitude is measured, as a longer sound of lower
amplitude can have the same amplitude integral as a shorter sound of higher
amplitude. This way of measuring amplitude is based on Beckman (1986);
Beckman, indeed, found that there is strong correlation between stress and
AI for English.
Fundamendal frequency was measured using the Fo tracker facility of a
signal processing package (Audlab). To ensure the reliability of the Fo
tracks, narrow-band spectrograms were also made and the contour of the
2
I am indebted to Dr. D. Davies and Dr. K. Roussopoulos for writing the program for me.

408
16 Amalia Arvaniti
Table 16.4 Experiment 2: contingency table oftype of stimulus by subject
response.

Type of stimulus

Response SS stimulus RS stimulus Total

(a) Observed responses


SS 402 17 419
RS 6 391 397
Total 408 408 816
(b) Expected responses (and deviances)
SS 209.5 (176.87) 209.5 (176.87)
RS 198.5 (186.68) 198.5 (186.68)

Note: Total deviance (x2) = 727.10 ldf. The result is significant; p<0.001

harmonics tracked and measured. Discontinuities in the F o tracks were


smoothed out by hand to correspond to the contour of the harmonics in the
narrow-band spectrograms. No actual measurements of F o are presented
here since what is1 essential is the difference between the contours of SS and
RS words.
163.2 Results
16.3.2.1 Perceptual experiment
The subjects gave a total of 816 responses excluding the distractors (16
subjects x 48 responses/answer sheet). Nine subjects made no mistakes in
the test words and the other seven made a total of 23 mistakes; the test's
identification rate was 97.2 percent. Of the subjects that made mistakes, five
made between 1 and 3 mistakes (2 mistakes on average). Only two subjects
made 6 and 7 mistakes respectively. The distractors' identification rate was
very similar (98.2 percent). Only four people made 1,2, 3, and 9 mistakes each
in the distractors. The persons who made the highest number of mistakes in the
test words made the highest number of mistakes in the distractors as well. The
results clearly show (see table 16.4) that rhythmic and "secondary stress" can
be easily distinguished by native speakers of Greek; thus, it is incorrect to
equate them as Malikouti-Drachman and Drachmando.

163.2.2 Acoustical analyses


Duration. For each test word the duration of two syllables is presented here,
that of the antepenultimate syllable ("secondary" or rhythmic stress) and
that of the last syllable (primary stress). Results for all subjects together are
shown in figure 16.1.
409
Prosody
(a) Msec.
300

250

200

150

100

50

(b) Msec.
300 i

250

200

150

100

50

PI pi Li li XI xi KO ko
epitropi simvuli simetoxi eniko

• • Series 1 E/23 Series 2

Figure 16.1 (a) Means (series 1) and SDs (series 2) of the duration, in msec, of antepenultimate
syllables of SS words (left, upper case) and RS words (right, lower case) for all subjects, (b) Same
measurements for final syllables

The data from the three subjects are pooled, as t-tests performed on each
subject's data separately showed no differences across subjects. One-tailed t-
tests for the data of all three subjects show that, for antepenultimate
syllables, the duration of the antepenult of SS words is significantly longer
than that of RS words in all word pairs (see table 16.5 for the t-test results).
For vowel durations, one-tailed t-tests show that the duration of the

410
16 Amalia Arvaniti

Table 16.5 Results of one-tailed t-tests performed on


the durations of the antepenultimate syllables of SS
and RS words of all test word pairs. In all cases,
df= 16. The syllables that are being compared are in
upper case letters.

Test pair t p <

1 ePItropi 5.94 0.0005


2 SIMvuli 3.58 0.005
3 siMEtoxi 4.47 0.0005
4 Eniko 5.6 0.0005

Table 16.6 Results of one-tailed t-tests performed on


the durations of the vowels of antepenultimate syll-
ables ofSS and RS words of all test word pairs. In all
cases, df=16. The vowels that are being compared
are in upper case letters.

Test pair t p <

1 epitropi 6.83 0.0005


2 slmvuli 5.75 0.0005
3 simEtoxi 8.22 0.0005
4 Eniko 5.6 0.0005

antepenultimate vowel is also significantly longer in SS words, in all test


word pairs (see table 16.6 for t-test results). By contrast, no significant
differences were found between SS and RS words either in the duration of
their final syllables or in that of final vowels when two-tailed tests were
performed.

Amplitude Results of amplitude measurements were not pooled; the measure-


ments differed extensively between subjects so that pooling the data could
result in statistical artifacts. Of all measurements only AI shows relatively
consistent correlation between stress and amplitude for subject HP only. PA
and RMS measurements did not yield any significant results for any subject.
AI results for HP's data are shown in figure 16.2. One-tailed t-tests showed
that all SS antepenults have significantly higher AI than their RS counter-
parts although the statistical results are not as strong as those of durational
data (see table 16.7 for details). On the other hand, two-tailed t-tests on the
411
Prosody
(a) % of word's Al
v
' 100 i

80

iJllLu
60

40

20

PI pi SIM sim ME me E e
epitropi simvuli simetoxl eniko

(b) % of word's Al
V
' 100 i

80

IULUJI
60

40

20

PI pi Li li XI xi KO ko
epitropi simvuli simetoxi eniko

Figure 16.2 (a) Al means, expressed as percentages, of antepenultimate syllables of SS words


(left, upper case) and RS words (right, lower case) for subject H.P. (b) Same measurements for
final syllables

final syllables of SS and RS words do not show significant differences


between them. The results of Al measurements for MK, whose data do not
show any correspondence between amplitude and stress, are presented in
figure 16.3.
Subject to further investigation, the results suggest that perhaps amplitude
is not a strong stress correlate in Greek. Botinis (1989) however, found that,
412
16 Amalia Arvaniti
(a) % of word's Al
100 I

uJLlLJI
80

60

40

20

PI pi SIM sim ME me E e
epitropi simvuli simetoxi eniko

(b) % of word's Al
100 i

80

60

ILJLJLII
40

20

PI pi Li li XI xi KO ko
epitropi simvuli simetoxi eniko

Figure 16.3 (a) Al means, expressed as percentages, of antepenultimate syllables of SS words


(left, upper case) and RS words (right, lower case) for subject M.K. (b) Same measurements for
final syllables

in his data, stressed syllables had significantly higher peak amplitude than
unstressed syllables. On the other hand, he also reports that in perceptual
tests amplitude changes did not affect the subjects' stress judgments. These
results could mean that as amplitude is not a robust stress cue, its acoustical
presence is not necessary and some speakers might opt not to use it. Clearly,
both a more detailed investigation into how to measure amplitude, and data

413
Prosody
Table 16.7 Results of one-tailed t-tests performed on
the AI of the antepenultimate syllables ofSS and RS
words of all test word pairs, for subject HP. In all
cases, df= 4. The syllables that are being compared
are in upper case letters

Test pair t p <

1 ePItropi 6.62 0.005


2 SIMvuli 3.58 0.025
3 siMEtoxi 6.45 0.005
4 Eniko 3.73 0.025

elicited from a larger number of speakers are needed before a conclusion is


reached.

Fundamental frequency Characteristic F o plots are shown in figure 16.4. There


were no differences either within each subject's data or across subjects. The
F o contours show a significant difference between SS and RS test words. In
SS words, F o is high on the antepenult whereas in RS words, F o is very low
and relatively flat. No important differences between the contours of last
syllables of SS and RS words were found. They all started with slightly low F o
that rose sharply to a high value. One noticeable effect is that in many cases
the F o high is not associated with the beginning of the stressed syllable but
rather with its end and the beginning of the following, unstressed syllable.
This seems to be a characteristic of Greek stress as the results of Botinis
(1989) confirm.

16.4 Discussion
The results of the first experiment show that native speakers of Greek cannot
differentiate between the rightmost lexical stress of a phrase and a SWFC-
induced stress which fall on the same syllable of segmentally identical
phrases. This implies that, contrary to the analyses of Setatos (1974) and of
Nespor and Vogel (1986, 1989), the SWFC-induced stress is the most
prominent stress in a host-and-clitic group, in the same way that the most
prominent stress in a O is the rightmost lexical stress. This conclusion agrees
with the description of the phenomenon by most analyses of Greek, both
phonological (e.g. Joseph and Warburton 1987; Malikouti-Drachman and
Drachman 1980) and phonetic (e.g. Botinis 1989), and also with the basic
requirement of the SWFC; namely, that the main stress must fall at most
three syllables to the left of its domain boundary. Moreover, the results

414
16 Amalia Arvaniti

3.50
e + 02

Fo ^^_ —
Hz

e + 02 I I I I I I I I I
1.11 2.02
Time (sec.)

hp /tonenikotis/SS

(*)

0.00

1.15
Time (sec.)

3.50
e + 02

Fo
Hz

1.00
e + 02 i i i i i I I I
1.98
Time (sec.)

hp /tonenikotis/RS

Figure 16.4 Characteristic Fo contours together with the corresponding narrow band
spectrograms for HP.'s /ton eniko tis/; (a) SS word (b) RS word. The thicker line on the plot
represents the smoothed contour

415
Prosody

indicate that Botinis's proposal that the SWFC-induced stress belongs to a


perceptually distinct prosodic category is incorrect.
These results are corroborated by those of experiment 2. Starting with
Malikouti-Drachman and Drachman's assumption that "secondary stress"
(i.e. the weakened lexical stress of the host) and rhythmic stress are
phonetically identical, it was shown that they are in fact very different both
acoustically and perceptually. On the one hand, the syllables that carry
"secondary stress" were shown to be acoustically more prominent than
syllables thought to carry rhythmic stress. On the other hand, no acoustical
evidence for rhythmic stress was found; syllables thought to carry rhythmic
stress exhibited durations and F o contours similar to those of unstressed
syllables. Finally, the data corroborate those of experiment 1 in that the final
syllables in all test word pairs exhibited no acoustical differences between SS
and RS words. These results indicate that the stress of both final syllables is
primary whether lexical or SWFC-induced.
I propose to account for the present results in the following way. Word (or
lexical) stress placement is a lexical process while the SWFC-induced stress in
host-and-clitic groups is the result of postlexical application of the SWFC.
This difference becomes clear if one considers, again, cliticization and
affixation. Although syllable addition is common to both of these processes
they yield different results: whereas affixation results in a shift of the main
stress as in
(10) /'maGima/ "lesson" > /ma'Gimata/ "lessons"
cliticization results in a stress addition, as in
(11) /'maGima tu/>/,maGi'ma tu/ "his lesson"
This is precisely because affixation takes place within the lexical component,
whereas cliticization is a postlexical process. Thus, by leaving the lexical
component all words, except clitics, form independent stress domains, like
the final form in (12).

(12) SD SD SD

s w w s w w w w s w w
pi ra ma > pi ra ma + ta > pi ra ma ta
"experiment" > "experiment •+• s" > "experiments"

The fact that all words constitute independent stress domains is true even
of monosyllabic "content" words. The difference between those and clitics
becomes apparent when one considers examples like (13):

416
16 Amalia Arvaniti
(13) /'anapse to 'fos/ "turn on the light"
which shows that SWFC violations do not arise between words because these
form separate stress domains.
Clitics, however, remain unattached weak syllables until they are attached
to a host postlexically. In this way, clitics extend the boundaries of words, i.e.
of stress domains (SDs); clitics form compound SDs with their hosts. For
example,
(14)

SD

w w s w w w w s w w w w s w w
ton pa te ra mu ton pa te ra mu to(m) b a t e r a mu
"my father (ace.)"

The stress domain formed by the host and its enclitic still has to conform to
the SWFC. When cliticization does not result in a SWFC violation no change
of stress pattern is necessary. When, however, the SWFC is violated by the
addition of enclitics, the results of the violation are different from those
observed within the lexical component. This is precisely because the host has
already acquired lexical stress and constitutes an independent stress domain
with fixed stress. Thus, in SWFC violations the host's stress cannot move
from its position, as it does within the lexical component. The only
alternative, therefore, is for another stress to be added in such a position that
it can comply with the SWFC, thus producing the stress two syllables to the
right of the host's lexical stress. In this case the compound SD is divided into
two SDs.
SD
(15)

SD SD SD

w w s w w w w w s w w w
//W A
w w s w s w
to ti le fo no mas to ti le fo no mas >
"our telephone" to ti le fo no mas

In this way, the subordination of the first stress is captured, as well as the
fact that both stresses still belong to one stress domain, albeit a compound
one, and therefore they are at the same prosodic level. The disadvantage of
417
Prosody

this proposal is that there is no motivation for choosing between /fonomas/


and /nomas/ as the second constituent of the compound SD.
Finally, another question that emerges from the experimental results is
whether Greek metrical structure needs to present rhythmic stresses at all.
Experimental data indicate that there is no acoustical evidence for rhythmic
stress in Greek and that what has been often described as "secondary stress"
is a weakened lexical stress. This statement may, at first, seem self-evident;
however, it must be remembered that a weakened lexical stress and the
subordinate stresses of words have not always been considered the same.
Liberman and Prince (1977), for instance, refer to Trager and Smith (1951),
who "argued for a distinction between nonprimary stresses within a word,
and subordinated main stresses of independent words, a distinction that
could be expressed by a one-level downgrading of all nonprimary stresses
within the confines of a given word; thus
3 1 2 1
Tennessee but Aral Sea."
(Liberman and Prince 1977: 255). If this line of argument is followed in the
Greek examples, then the lexical stress of the host should become a "3 stress"
rather than a "2 stress" since it becomes a subordinate stress within a word.
This, however, does not happen; on the contrary, the lexical stress of the host
remains strong enough to be perceptually identical to a subordinate main
stress, as experiment 1 showed. In my opinion, these results cast doubt on the
presence of "secondary stress," and consequently rhythmic stress, in Greek.
Although further investigation is necessary before a solution is reached, there
are strong arguments, in addition to the acoustical evidence, for proposing
that Greek does not in fact exhibit rhythmic stress.
Since experimental results fail to show any evidence for rhythmic stresses
the only argument for their existence appears to be that they are heard.
However, in certain cases, different investigators have conflicting opinions as
to the placement of rhythmic stresses. For instance, on a phrase like
(16) /me'yali katastro'fi/ "great destruction"
the rhythmic stress should be on /ka/, according to the rules of Nespor and
Vogel (1989) while, according to Malikouti-Drachman and Drachman
(1980), the rhythmic stress should fall on /ta/. Differences between the two
analyses become greater as the number of unstressed syllables between lexical
stresses increases.
Moreover, only phonological analyses suggest that Greek exhibits rhyth-
mic stress. Phonetic analyses either do not mention the matter (Dauer 1980;
Botinis 1989), or fail to find evidence for rhythmic stress; for instance,
Fourakis (1986) concludes, on the basis of durational data, that Greek seems
418
16 Amalia Arvaniti
to have a two way distinction: ± stress with no gradations. Although more
detailed research into the presence of rhythmic stress is necessary there is
fairly strong support for postulating an «-ary branching analysis in which no
rhythmic stresses are marked. If further evidence confirms the present results,
then the universality of binary rhythmic patterns could be questioned.

16.5 Conclusion
It has been shown that the Greek Stress Well-Formedness Condition applies
both lexically, moving lexical stress to the right of its original position, and
postlexically, adding a stress two syllables to the right of the host in a host-
and-clitic group. In the latter case, the SWFC-induced stress becomes the
most prominent in the group; this stress was shown to be perceptually
identical to a lexical stress. The weakened lexical stress of the host was shown
to be acoustically and perceptually similar to a subordinate lexical stress and
not to rhythmic stress as has often been thought. The experimental evidence
together with the absence of strong phonological arguments to the contrary
suggest that Greek might not exhibit rhythmic stresses at all.

419
Prosody
Appendix 1
The test phrases (bold type) of experiment 1 in the
context in which they were read

l(a) /tu 'ipa yi'a to 'ari'sta su ke 'xarike po'li/


"I told him about your first-class mark and he was very pleased."
(b) /e'yo tu fo'nazo 'ari stasu ki a'ftos 6e stama'tai/
"I shout at him Ari stop, but he doesn't stop."
2(a) /pso'nizi 'panda a'po to psaradi'ko tus/
"S/he always shops from their fishmongery."
(b) /'ixan a'neka0en psa'ra di'ko tus/
"They have always had their own fishmonger."

420
16 Amalia Arvaniti
Appendix 2
The distractors (bold type) of experiment 1 in the
context in which they were read

l(a) /pi'stevo 'oti 'ksero to 'mono 'loyo yi'a a'fti tin ka'tastasi/
"I believe I know the only reason for this situation."
(b) /den 'exo a'kusi pi'o vare'to mo'noloyo sto 'Oeatro/
"I haven't listened to a more boring theatrical monologue."
2(a) /6e '9elo 'pare 'dose me a'fto to 'atomo/
"I don't want to have anything to do with this person."
(b) /'ksero 'oti to pa'redose stus dike'uxus/
"I know that he delivered it to the beneficiaries."

421
Prosody
Appendix 3
The test sentences of experiment 2.
The test words are in bold type

l(a) /i epitropi mas 'itan a'kurasti/


"Our commissioners were indefatigable."
(b) /i epitro'pi mas 'itan a'kurasti/
"Our committee was indefatigable."
2(a) /pi'stevo 'oti i .simvu'li tu 'itan so'fi/
"I believe that his counsellors were wise."
(b) /pi'stevo 'oti i simvu'li tu 'itan so'fi/
"I believe that his advice was wise."
3(a) 'no'mizo 'oti i simeto'xi tu 'ine e'ksisu apa'retiti/
"I think that his co-participants are equally necessary."
(b) /no'mizo 'oti i simeto'xi tu 'ine e'ksisu apa'retiti/
"I think that his participation is equally necessary."
4(a) /mu 'eleye 'oti 'vriski ton eniko tis po'li enoxliti'ko/
"S/he was telling me that s/he finds her tenant very annoying."
(b) /mu 'eleye 'oti 'vriski ton eni'ko tis po'li enoxliti'ko/
"S/he was telling me that s/he finds her 'singular' [nonuse of
politeness forms] very annoying."

422
16 Amalia Arvaniti
Appendix 4
The distractor sentences of experiment 2.
The distractors are in bold type

l(a) /ma a'fto 'ine porto'kali/


"But this is an orange."
(b) /ma a'fto 'ine portoka li
"But this is orange."
2(a) /i a'poxi tus 'itan po'li me'yali/
"Their hunting-net was very big."
(b) /i apo'xi tus 'itan po'li me'yali/
"Their abstention was very big."
3(a) /teli'ka to 'kerdise to me'talio/
"Finally he won the medal."
(b) /teli'ka to 'kerdise to meta'lio/
"Finally he won the mine."
4(a) /stin omi'lia tu ana'ferGike stus 'nonius tus/
"In his speech he referred to their laws."
(b) /stin omi'lia tu ana'fer9ike stus no'mus tus/
"In his speech he referred to their counties."

423
References

Abbreviations
ARIPUC Annual Report, Institute of Phonetics, University of Copenhagen
CSLI Center for the Study of Language and Information
IPO Instituut voor Perceptie Onderzoek (Institute for Perception Research)
IULC Indiana University Linguistics Club
JASA Journal of the Acoustical Society of America
JL Journal of Linguistics
JPhon Journal of Phonetics
JSHR Journal of Speech and Hearing Research
JVLVB Journal of Verbal Language and Verbal Behaviour
LAGB Linguistics Association of Great Britain
Lg Language
Lg & Sp Language and Speech
LI Linguistic Inquiry
MITPR Massachusetts Institute of Technology, Progress Report
NELS North East Linguistic Society
PERILUS Phonetic Experimental Research at the Institute of Linguistics, University of
Stockholm
Proc. IEEE Int. Conf Ac, Sp. & Sig. Proc. Proceedings of the Institute of Electrical
and Electronics Engineers Conference on Acoustics, Speech and Signal Processing
PY Phonology Yearbook
RILP Report of the Institute of Logopedics and Phoniatrics
STL-QPSR Quarterly Progress and Status Report, Speech Transmission Laboratory,
Royal Institute of Technology (Stockholm)
WCCFL West Coast Conference on Formal Linguistics
Albrow, K. H. 1975. Prosodic theory, Hungarian and English. Festschrift fur Norman
Denison zum 50. Geburtstag (Grazer Linguistiche Studien, 2). Graz: University of
Graz Department of General and Applied Linguistics.
Alfonso, P. and T. Baer. 1982. Dynamics of vowel articulation. Lg & Sp 25: 151-73.

424
References
Ali, L. H., T. Gallagher, J. Goldstein, and R. G. Daniloff. 1971. Perception of
coarticulated nasality. JASA 49: 538^0.
Allen, J., M. S. Hunnicutt, and D. Klatt. 1987. From Text to Speech: the MITalk
System. Cambridge: Cambridge University Press.
Anderson, L. B. 1980. Using asymmetrical and gradient data in the study of vowel
harmony. In R. M. Vago (ed.), Issues in Vowel Harmony. Amsterdam: John
Benjamins.
Anderson, M., J. Pierrehumbert, and M. Liberman. 1984. Synthesis by rule of English
intonation patterns. Proc. IEEE Int. Conf. Ac, Sp. & Sig. Proc. 282-4. New
York: IEEE.
Anderson, S. R. 1974. The Organization of Phonology. New York: Academic Press.
1978. Tone features. In V. Fromkin (ed.), Tone: a Linguistic Survey. New York:
Academic Press.
1982. The analysis of French schwa. Lg 58: 535-73.
1986. Differences in rule type and their structural basis. In H. van der Hulst and N.
Smith (eds.), The Structure of Phonological Representations, part 2. Dordrecht:
Foris.
Anderson, S. R. and W. Cooper. Fundamental frequency patterns during sponta-
neous picture description. Ms. University of Iowa.
Archangeli, D. 1984. Underspecification in Yawelmani phonology. Doctoral disser-
tation, Cambridge, MIT.
1985. Yokuts harmony: evidence for coplanar representation in nonlinear phono-
logy. LI 16: 335-72.
1988. Aspects of underspecification theory. Phonology 5.2: 183-207.
Arvaniti, A. 1990. Review of A. Botinis, 1989. Stress and Prosodic Structure in Greek:
a Phonological, Acoustic, Physiological and Perceptual Study. Lund University
Press. JPhon 18: 65-9.
Bach, E. 1968. Two proposals concerning the simplicity metric in phonology. Glossa
4: 3-21.
Barry, M. 1984. Connected speech: processes, motivations and models. Cambridge
Papers in Phonetics and Experimental Linguistics 3.
1985. A palatographic study of connected speech processes. Cambridge Papers in
Phonetics and Experimental Linguistics 4.
1988. Assimilation in English and Russian. Paper presented at the colloquium of
the British Association of Academic Phoneticians, Trinity College, Dublin,
March 1988.
Beattie, G., A. Cutler, and M. Pearson. 1982. Why is Mrs Thatcher interrupted so
often? Nature 300: 744-7.
Beckman, M. E. 1986. Stress and Non-Stress Accent (Netherlands Phonetic Archives
7). Dordrecht: Foris.
Beckman, M. E. and J. Kingston. 1990. Introduction to J. Kingston and M. Beckman
(eds.), Papers in Laboratory Phonology I: Between the Grammar and the Physics
of Speech. Cambridge: Cambridge University Press, 1-16.
Beckman, M. E. and J. B. Pierrehumbert. 1986. Intonational structure in English and
Japanese. PY 3: 255-310.

425
References
Beddor, P. S., R. A. Krakow, and L. M. Goldstein. 1986. Perceptual constraints and
phonological change: a study of nasal vowel height. PY 3: 197-217.
Bell-Berti, F. and K. S. Harris. 1981. A temporal model of speech production.
Phonetica 38: 9-20.
Benguerel, A. P. and T. K. Bhatia. 1980. Hindi stop consonants: an acoustic and
fiberscopic study. Phonetica 37: 134—48.
Benguerel, A. P. and H. Cowan. 1974. Coarticulation of upper lip protrusion in
French. Phonetica 30: 41-55.
Berendsen, E. 1986. The Phonology of Cliticization. Dordrecht: Foris.
Bernstein, N. A. 1967. The Coordination and Regulation of Movements. London:
Pergamon.
Bertch, W. F., J. C. Webster, R. G. Klumpp, and P. O. Thomson. 1956. Effects of two
message-storage schemes upon communications within a small problem-solving
group. JASA 28: 550-3.
Bickley, C. 1982. Acoustic analysis and perception of breathy vowels. Working
Papers, MIT Speech Communications 1: 74—83.
Bing, J. M. 1979. Aspects of English prosody. Doctoral dissertation, University of
Massachusetts.
Bird, S. and E. Klein. 1990. Phonological events. JL 26: 33-56.
Bloch, B. 1941. Phonemic overlapping. American Speech 16: 278-84.
Bloomfield, L. 1933. Language. New York: Holt.
Blumstein, S. E. and K. N. Stevens. 1979. Acoustic invariance in speech production:
evidence from the spectral characteristics of stop consonants. JASA 66: 1011-17.
Bolinger, D. 1951. Intonation: levels versus configuration. Word!: 199-210.
1958. A theory of pitch accent in English. Word 14: 109-49.
1986. Intonation and its Parts. Stanford, CA: Stanford University Press.
Botha, R. P. 1971. Methodological Aspects of Transformational Generative Phonology.
The Hague: Mouton.
Botinis, A. 1989. Stress and Prosodic Structure in Greek: A Phonological, Acoustic,
Physiological and Perceptual Study. Lund: Lund University Press.
Boyce, S. 1986. The "trough" phenomenon in Turkish and English. JASA 80: S.95
(abstract).
Broe, M. 1988. A unification-based approach to prosodic analysis. Edinburgh
Working Papers in Linguistics 21: 63-82.
Bromberger, S. and M. Halle. 1989. Why phonology is different. L/20.1: 51-69.
Browman, C. P. 1978. Tip of the tongue and slip of the ear: implications for language
processing. UCLA Working Papers in Phonetics, 42.
Browman, C. P. and L. Goldstein. 1985. Dynamic modeling of phonetic structure. In
V. Fromkin (ed.), Phonetic Linguistics. New York: Academic Press.
1986. Towards an articulatory phonology. PY 3: 219-52.
1988. Some notes on syllable structure in articulatory phonology. Phonetica 45:
140-55.
1989. Articulatory gestures as phonological units. Phonology, 6.2: 201-51.
1990. Tiers in articulatory phonology with some implications for casual speech. In
J. Kingston and M. Beckman (eds.), Papers in Laboratory Phonology I: Between

426
References
the Grammar and the Physics of Speech. Cambridge: Cambridge University Press,
341-76.
Browman, C. P., L. Goldstein, E. L. Saltzman, and C. Smith. 1986. GEST: a
computational model for speech production using dynamically defined articula-
tory gestures. JASA, 80, Suppl. 1 S97 (abstract).
Browman, C. P., L. Goldstein, J. A. S. Kelso, P. Rubin, and E. L. Saltzman. 1984.
Articulatory synthesis from underlying dynamics. JASA 75: S22-3. (abstract).
Brown, G., K. Currie, and J. Kenworthy. 1980. Questions of Intonation. London:
Croom Helm.
Brown, R. W. and D. McNeill. 1966. The "tip of the tongue" phenomenon. JVLVB
5: 325-37.
Bruce, G. 1977. Swedish word accents in sentence perspective. Lund: Gleerup.
1982a. Developing the Swedish intonation model. Working Papers, Department of
Linguistics, University of Lund, 22: 51-116.
1982b. Textual aspects of prosody in Swedish. Phonetica 39: 274-87.
Bruce, G. and E. Garding. 1978. A prosodic typology for Swedish dialects. In E.
Garding, G. Bruce, and R. Bannert (eds.), Nordic Prosody. Lund: Gleerup.
Bullock, D. and S. Grossberg. 1988. The VITE model: a neural command circuit for
generating arm and articulator trajectories. In J. A. S. Kelso, A. J. Mandell, and
M. F. Shlesinger (eds.), Dynamic Patterns in Complex Systems. Singapore:
World Scientific, 305-26.
Carlson, L. 1983. Dialogue Games: an Approach to Discourse Analysis (Synthese
Language Library 17). Dordrecht: Reidel.
Carnochan, J. 1957. Gemination in Hausa. In Studies in Linguistic Analysis. The
Philological Society, Oxford: Basil Blackwell.
Catford, J. C. 1977. Fundamental Problems in Phonetics. Edinburgh: Edinburgh
University Press.
Chang, N-C. 1958. Tones and intonation in the Chengtu dialect (Szechuan, China).
Phonetica 2: 59-84.
Chao, Y. R. 1932. A preliminary study of English intonation (with American
variants) and its Chinese equivalents. T'sai Yuan Pei Anniversary Volume, Suppl.
Vol. 1 Bulletin of the Institute of History and Philology of the Academica Sinica.
Peiping.
Chatterjee, S. K. 1975. Origin and Development of the Bengali Language. Calcutta:
Rupa.
Chiba, T. and M. Kajiyama. 1941. The Vowel. Its Nature and Structure. Tokyo:
Taseikan.
Choi, J. 1989. Some theoretical issues in the analysis of consonant to vowel spreading
in Kabardian. MA thesis, Department of Linguistics, UCLA.
Chomsky, N. 1964. The nature of structural descriptions. In N. Chomsky, Current
Issues in Linguistic Theory. The Hague: Mouton.
1965. Aspects of the Theory of Syntax. Cambridge, MA: MIT Press.
Chomsky, N. and M. Halle. 1965. Some controversial questions in phonological
theory. JL 1:97-138.
1968. The Sound Pattern of English. New York: Harper and Row.

427
References
Clark, M. 1990. The Tonal System of Igbo. Dordrecht: Foris.
Clements, G. N. 1976. The autosegmental treatment of vowel harmony. In W.
Dressier and O. Pfeiffer (eds.), Phonologica 1976. Innsbruck: Innsbrucker
Beitrage zur Sprachwissenschaft.
1978. Tone and syntax in Ewe. In D. J. Napoli (ed.), Elements of Tone, Stress, and
Intonation. Georgetown: Georgetown University Press.
1979. The description of terrace-level tone languages. Lg 55: 536-58.
1981. The hierarchical representation of tone features. Harvard Studies in Phono-
logy 2: 50-105.
1984. Principles of tone assignment in Kikuyu. In G. N. Clements and J. Goldsmith
(eds.), Autosegmental Studies in Bantu Tone. Dordrecht: Foris, 281-339.
1985. The geometry of phonological features. PY 2: 225-52.
1986. Compensatory lengthening and consonant gemination in Luganda. In L.
Wetzels and E. Sezer (eds.), Studies in Compensatory Lengthening. Dordrecht:
Foris.
1987. Phonological feature representation and the description of intrusive stops.
Papers from the Parasession on Autosegmental and Metrical Phonology. Chicago
Linguistics Society, University of Chicago.
1990a. The role of the sonority cycle in core syllabification. In J. Kingston and M.
Beckman (eds.), Papers in Laboratory Phonology I: Between the Grammar and the
Physics of Speech. Cambridge: Cambridge University Press, 283-333.
1990b. The status of register in intonation theory. In J. Kingston and M. Beckman
(eds.), Papers in Laboratory Phonology I: Between the Grammar and the Physics
of Speech. Cambridge: Cambridge University Press, 58-71.
Clements, G. N. and J. Goldsmith. 1984. Introduction. In G. N. Clements and J.
Goldsmith (eds.), Autosegmental Studies in Bantu Tone. Dordrecht: Foris.
Clements, G. N. and S. J. Keyser. 1983. CV Phonology: a Generative Theory of the
Syllable. Cambridge, MA: MIT Press.
Cohen, A. and J. 't Hart. 1967. On the anatomy of intonation. Lingua 19: 177-92.
Cohen, A., R. Collier, and J. 't Hart. 1982. Declination: construct or intrinsic feature
of speech pitch? Phonetica 39: 254-73.
Cohen, J. and P. Cohen. 1983. Applied Multiple Regression/ Correlation Analysis for
the Behavioral Sciences, 2nd edn. Hillsdale, NJ: Lawrence Erlbaum.
Coleman, J. 1987. Knowledge-based generation of speech synthesis parameters. Ms.
Experimental Phonetics Laboratory, Department of Language and Linguistic
Science, University of York.
1989. The phonetic interpretation of headed phonological structures containing
overlapping constituents. Manuscript.
Coleman, J. and J. Local. 1991. "Constraints" in autosegmental phonology. To
appear in Linguistics and Philosophy.
Collier, R. 1989. On the phonology of Dutch intonation. In F. J. Heyvaert and F.
Steurs (eds.), Worlds behind Words. Leuven: Leuven University Press.
Collier, R. and J. 't Hart. 1981. Cursus Nederlandse intonatie. Leuven: Acco.
Connell, B. and D. R. Ladd 1990. Aspects of pitch realisation in Yoruba. Phonology
1: 1-29.

428
References
Cooper, A. forthcoming. Stress effects on laryngeal gestures.
Cooper, W. E. and J. M. Paccia-Cooper. 1980. Syntax and Speech. Cambridge, MA:
Harvard University Press.
Cooper, W. E. and J. Sorensen, 1981. Fundamental Frequency in Sentence Production.
New York: Springer.
Costa, P. J. and I. G. Mattingly. 1981. Production and perception of phonetic
contrast during phonetic change. Status Report on Speech Research Sr-67/68.
New Haven: Haskins Laboratories, 191-6.
Cotton, S. and F. Grosjean. 1984. The gating paradigm: a comparison of successive
and individual presentation formats. Perception and Psychophysics 35: 41-8.
Crompton, A. 1982. Syllables and segments in speech production. In A. Cutler (ed.),
Slips of the Tongue and Language Production. Amsterdam: Mouton.
Crystal, D. 1969. Prosodic Systems and Intonation in English. Cambridge: Cambridge
University Press.
Cutler, A. 1980. Errors of stress and intonation. In V. A. Fromkin (ed.), Errors in
Linguistic Performance. New York: Academic Press.
1987. Phonological structure in speech recognition. PY 3: 161-78.
Cutler, A., J. Mehler, D. Norris, and J. Segui. 1986. The syllable's differing role in the
segmentation of French and English. Journal of Memory and Language 25:
385-400.
Dalby, J. M. 1986. Phonetic Structure of Fast Speech in American English. Bloom-
ington: IULC.
Daniloff, R. and R. E. Hammarberg. 1973. On defining coarticulation. JPhon, 1:
239-48.
Daniloff, R., G. Shuckers, and L. Feth. 1980. The Physiology of Speech and Hearing.
Englewood Cliffs, NJ: Prentice Hall.
Dauer, R. M. 1980. Stress and rhythm in modern Greek. Doctoral dissertation,
University of Edinburgh.
Delattre, P. 1966. Les Dix Intonations de base du Francais. French Review 40: 1-14.
1971. Pharyngeal features in the consonants of Arabic, German, Spanish, French,
and American English. Phonetica 23: 129-55.
Dell, F. forthcoming. L'Accentuation dans les phrases en Francais. In F. Dell, J.-R.
Vergnaud, and D. Hirst (eds.), Les Representations en phonologic Paris:
Hermann.
Dev, A. T. 1973. Students' Favourite Dictionary. Calcutta: Dev Sahitya Kutir.
Diehl, R. and K. Kluender. 1989. On the objects of speech perception. Ecological
Psychology 1.2: 121-44.
Dinnsen, D. A. 1983. On the Characterization of Phonological Neutralization. IULC.
1985. A re-examination of phonological neutralisation. JL 21: 265-79.
Dixit, R. P. 1987. In defense of the phonetic adequacy of the traditional term "voiced
aspirated." UCLA Working Papers in Phonetics 67: 103-11.
Dobson, E. J. 1968. English Pronunciation. 1500-1700. 2nd edn. Oxford: Oxford
University Press.
Docherty, G. J. 1989. An experimental phonetic study of the timing of voicing in
English obstruents. Doctoral dissertation, University of Edinburgh.

429
References
Downing, B. 1970. Syntactic structure and phonological phrasing in English.
Doctoral dissertation, University of Texas.
Dowty, D. R., R. E. Wall, and S. Peters. 1981. Introduction to Montague Semantics.
Dordrecht: Reidel.
Erikson, D. 1976. A physiological analysis of the tones of Thai. Doctoral dissertation,
University of Connecticut.
Erikson, Y. 1973. Preliminary evidence of syllable locked temporal control of Fo.
STL-QPSR 2-3: 23-30.
Erikson, Y. and M. Alstermark. 1972. Fundamental frequency correlates of the grave
accent in Swedish: the effect of vowel duration. STL-QPSR 2-3: 53-60.
Fant, G. 1959. Acoustic analysis and synthesis of speech with applications to
Swedish. Ericsson Technics 1.
1960. Acoustic Theory of Speech Production. The Hague: Mouton.
Fant, G. and Q. Linn. 1988. Frequency domain interpretation and derivation of
glottal flow parameters. STL-QPSR 2-3: 1-21.
Ferguson, C. A. and M. Chowdhury. 1960. The phonemes of Bengali. Lg 36.1: 22-59.
Firth, J. R. 1948. Sounds and prosodies. Transactions of the Philological Society
127-52; also in F. R. Palmer (ed.) Prosodic Analysis. Oxford: Oxford University
Press.
1957. A synopsis of linguistic theory 1930-1955. In Studies in Linguistic Analysis.
The Philological Society, Oxford: Basil Black well.
Fischer-Jorgensen, E. 1975. Trends in Phonological Theory. Copenhagen: Akademisk.
Flanagan, J. L. 1972. Speech Analysis, Synthesis, and Perception, 2nd edn. Berlin:
Springer.
Folkins, J. W. and J. H. Abbs. 1975. Lip and jaw motor control during speech:
responses to resistive loading of the jaw. JSHR 18: 207-20.
Foss, D. J. 1969. Decision processes during sentence comprehension: effects of lexical
item difficulty and position upon decision times. JVLVB 8: 457-62.
Foss, D. J. and M. A. Blank. 1980. Identifying the speech codes. Cognitive Psychology
12: 1-31.
Foss, D. J. and M. A. Gernsbacher. 1983. Cracking the dual code: toward a unitary
model of phoneme identification. JVLVB 22: 609-32.
Foss, D. J. and D. A. Swinney. 1973. On the psychological reality of the phoneme:
perception, identification, and consciousness. JVLVB 12: 246-57.
Foss, D. J., D. A. Harwood, and M. A. Blank. 1980. Deciphering decoding decisions:
data and devices. In R. A. Cole (ed.), The Perception and Production of Fluent
Speech, Hillsdale, NJ: Lawrence Erlbaum.
Fourakis M. 1986. An acoustic study of the effects of tempo and stress on segmental
intervals in modern Greek. Phonetica 43: 172-88.
Fourakis, M. and R. Port. 1986. Stop epenthesis in English. JPhon 14: 197-221.
Fowler, C. A. 1977. Timing Control in Speech Production. Bloomington, IULC.
1980. Coarticulation and theories of extrinsic timing control. JPhon 8: 113-33.
1981a. Perception and production of coarticulation among stressed and unstressed
vowels. JSHR 24: 127-39.

430
References
1981b. A relationship between coarticulation and compensatory shortening. Pho-
netica 38: 35-50.
1985. Current perspectives on language and speech perception: a critical overview.
In R. Daniloff (ed.), Speech Science: Recent Advances. San Diego, CA:
College-Hill.
1986. An event approach to the study of speech perception from a direct-realist
perspective. JPhon 14: 3-28.
Fowler, C. A., P. Rubin, R. E. Remez, and M. T. Turvey. 1980. Implications for
speech production of a skilled theory of action. In B. Butterworth (ed.),
Language Production I. London: Academic Press.
Frederiksen, J. R. 1967. Cognitive factors in the recognition of ambiguous auditory
and visual stimuli. (Monograph) Journal of Personality and Social Psychology
7.
Franke, F. 1889. Die Umgangssprache der Nieder-Lausitz in ihren Lauten. Phone-
tische Studien II, 21.
Fritzell, B. 1969. The velopharyngeal muscles in speech. Acta Otolaryngologica.
Suppl. 250.
Fromkin, V. A. 1971. The non-anomalous nature of anomalous utterances. Lg 47:
27-52.
1976. Putting the emPHAsis on the wrong sylLABle. In L. M. Hyman (ed.), Studies
in Stress and Accent. Los Angeles: University of Southern California.
Fry, D. B. 1955. Duration and intensity as physical correlates of linguistic stress.
JASA 27: 765-8.
1958. Experiments in the perception of stress. Lg & Sp 1: 126-52.
Fudge, E. C. 1987. Branching structure within the syllable. JL 23: 359-77.
Fujimura, O. 1962. Analysis of nasal consonants. JASA 34: 1865-75.
1986. Relative invariance of articulatory movements: an iceberg model. In J. S.
Perkell and D. H. Klatt (eds.), Invariance and Variability in Speech Processes.
Hillsdale, NJ: Lawrence Erlbaum.
1987. Fundamentals and applications in speech production research. Proceedings
of the Eleventh International Congress of Phonetic Sciences. 6: 10-27.
1989a. An overview of phonetic and phonological research. Nihongo to Nihongo
Kyooiku 2: 365-89. (Tokyo: Meiji-shoin.)
1989b. Comments on "On the quantal nature of speech", by K. N. Stevens. JPhon
17: 87-90.
1990. Toward a model of articulatory control: comments on Browman and
Goldstein's paper. In J. Kingston and M. Beckman (eds.), Papers in Laboratory
Phonology I: Between the Grammar and the Physics of Speech. Cambridge:
Cambridge University Press, 377-81.
Fujimura, O. and M. Sawashima. 1971. Consonant sequences and laryngeal control.
Annual Bulletin of the Research Institute of Logopedics and Phoniatrics 5: 1-6.
Fujisaki, H. and K. Hirose. 1984. Analysis of voice fundamental frequency contours
for declarative sentences of Japanese. Journal of the Acoustical Society of Japan
5.4: 233-42.

431
References
Fujisaki, H. and H. Keikichi. 1982. Modelling the dynamic characteristics of voice
fundamental frequency with applications to analysis and synthesis of intonation.
Preprints of Papers, Working Group on Intonation, Thirteenth International
Congress of Linguists, Tokyo.
Fujisaki, H. and S. Nagashima. 1969. A model for the synthesis of pitch contours of
connected speech. Tokyo University Engineering Research Institute Annual
Report 28: 53-60.
Fujisaki, H. and H. Sudo. 1971a. A generative model for the prosody of connected
speech in Japanese. Tokyo University Engineering Research Institute Annual
Report 30: 75-80.
1971b. Synthesis by rule of prosodic features of Japanese. Proceedings of the
Seventh International Congress of Acoustics 3: 133-6.
Fujisaki, H., M. Sugito, K. Hirose, and N. Takahashi. 1983. Word accent and
sentence intonation in foreign language learning. Preprints of Papers, Working
Group on Intonation, Thirteenth International Congress of Linguists, Tokyo:
109-19.
Gage, W. 1958. Grammatical structures in American English intonation. Doctoral
dissertation, Cornell University.
Gamkrelidze, T. V. 1975. On the correlation of stops and fricatives in a phonological
system. Lingua 35: 231-61.
Garding, E. 1983. A generative model of intonation. In A. Cutler and D. R. Ladd
(eds.), Prosody: Models and Measurements. Heidelberg: Springer.
Garding, E., A. Botinis, and P. Touati. 1982. A comparative study of Swedish, Greek
and French intonation. Working Papers, Department of Linguistics, University of
Lund, 22: 137-52.
Gay, T. 1977. Articulatory movements in VCV sequences. JASA 62: 183-93.
1978. Articulatory units: segments or syllables. In A. Bell and J. Hooper (eds.),
Segments and Syllables. Amsterdam: North Holland.
1981. Mechanisms in the control of speech rate. Phonetica 38: 148-58.
Gazdar, G., E. Klein, G. Pullum, and I. Sag. 1985. Generalised Phrase Structure
Grammar. London: Basil Blackwell.
Gimson, A. C. 1960. The instability of English alveolar articulations. Le Maitre
Phonetique 113: 7-10.
1970. An Introduction to the Pronunciation of English. London: Edward Arnold.
Gobi, C. 1988. Voice source dynamics in connected speech. STL-QPSR 1: 123-59.
Goldsmith, J. 1976. Autosegmental Phonology. MIT Doctoral dissertation. New
York: Garland, 1979.
1984. Tone and accent in Tonga. In G. N. Clements and J. Goldsmith (eds.),
Autosegmental Studies in Bantu Tone (Publications in African Languages and
Linguistics 3). Dordrecht: Foris, 19-51.
Gracco, V. and J. Abbs. Variant and invariant characteristics of speech movements.
Experimental Brain Research 65: 165-6.
Greene, P.H. 1971. Introduction. In I. M. Gelfand, V. S. Gurfinkel, S. V. Fomin, and
M. L. Tsetlin (eds.), Models of Structural Functional Organization of Certain
Biological Systems. Cambridge, MA: MIT Press, xi-xxxi.

432
References
Gronnum, N. forthcoming. Prosodic parameters in a variety of regional Danish
standard languages, with a view towards Swedish and German. To appear in
Phone tica.
Grosjean, F. 1980. Spoken word recognition processes and the gating paradigm.
Perception and Psychophysics 28: 267-83.
Giinther, H. 1988. Oblique word forms in visual word recognition. Linguistics 26:
583-600.
Gussenhoven, C. 1983. Focus, mode and the nucleus. JL 19: 377-417.
1984. On the Grammar and Semantics of Sentence Accents. Dordrecht: Foris.
1988. Adequacy in intonational analysis: the case of Dutch. In H. van der Hulst
and N. Smith (eds.), Autosegmental Studies in Pitch Accent. Dordrecht: Foris.
forthcoming. Intonational phrasing and the prosodic hierarchy. Phonologica 1988.
Cambridge: Cambridge University Press.
Gussenhoven, C. and T. Rietveld. 1988. Fundamental frequency declination in
Dutch: testing three hypotheses. JPhon 16: 355-69.
Hakoda, K. and H. Sato. 1980. Prosodic rules in connected speech synthesis. Trans.
IECE. 63-D No. 9: 715-22.
Halle, M. and K. N. Stevens. 1971. A note on laryngeal features. MITPR 101:
198-213.
Halle, M. and J. Vergnaud. 1980. Three dimensional phonology. Journal of Linguistic
Research 1: 83-105.
Halliday, M. A. K. 1967. Intonation and Grammar in British English. The Hague:
Mouton.
Hammond, M. 1988. On deriving the well-formedness condition. LI 19: 319-25.
Han, M. S. 1962. Japanese Phonology: An Analysis Based on Sound Spectrograms.
Tokyo: Kenkyusha.
Haraguchi, S. 1977. The Tone Pattern of Japanese: An Autosegmental Theory of
Tonology. Tokyo: Kaitakushi.
Hardcastle, W. J. 1972. The use of electropalatography in phonetic research.
Phonetica 25: 197-215.
1976. Physiology of Speech Production: An Introduction for speech scientists.
London: Academic Press.
Harris, Z. H. 1944. Simultaneous components in phonology. Lg 20: 181-205.
Harshman, R., P. Ladefoged and L. Goldstein. 1977. Factor analysis of tongue
shapes. JASA 62: 693-707.
't Hart, J. 1979a. Naar automatisch genereeren van toonhoogte-contouren voor
tamelijk lange stukken spraak. IPO Technical Report No. 353, Eindhoven.
1979b. Explorations in automatic stylization of Fo curves. IPO Annual Progress
Report 14: 61-5.
1981. Differential sensitivity to pitch distance, particularly in speech. JASA 69:
811-21.
't Hart, J. and A. Cohen. 1973. Intonation by rule: a perceptual quest. JPhon 1:
309-27.
't Hart, J. and R. Collier. 1975. Integrating different levels of intonation analysis.
JPhon 3: 235-55.

433
References
1979. On the interaction of accentuation and intonation in Dutch. Proceedings of
The Ninth International Congress of Phonetic Sciences 2: 385-402.
Hawking, S. W. 1988. A Brief History of Time. London: Bantam Press.
Hawkins, S. 1984. On the development of motor control in speech: evidence from
studies of temporal coordination. In N. J. Lass (ed.), Speech and Language:
Advances in Basic Research and Practice 11. ZY1-1A.
Hayes, B. 1981. A Metrical theory of stress rules. Bloomington: IULC.
1986. Inalterability in CV Phonology. Lg 62.2: 321-51.
1989. Compensatory lengthening in moraic phonology. LI 20.2: 253-306.
Hayes, B. and A. Lahiri. 1991. Bengali intonational phonology. Natural Language
and Linguistic Theory 9.1: 47-96.
Hayes, B. and S. Puppel. 1985. On the rhythm rule in Polish. In H. van der Hulst and
N. Smith (eds.), Advances in Non-Linear Phonology. Dordrecht: Foris.
Hebb, D. O. 1949. The Organization of Behavior. New York: Wiley.
Helfrich, H. 1979. Age markers in speech. In K. R. Scherer and H. Giles (eds.), Social
Markers in Speech. Cambridge: Cambridge University Press.
Henderson, J. B. 1984. Velopharyngeal function in oral and nasal vowels: a cross-
language study. Doctoral dissertation, University of Connecticut.
Henke, W. L. 1966. Dynamic articulatory model of speech production using
computer simulation. Doctoral dissertation, MIT.
Hewlett, N. 1988. Acoustic properties of /k/ and /t/ in normal and phonologically
disordered speech. Clinical Linguistics and Phonetics 2: 29—45.
Hirose, K., H. Fujisaki, and H. Kawai. 1985. A system for synthesis of connected
speech - special emphasis on the synthesis of prosodic features. Onsei Ken-
kyuukai S85^13: 325-32. The Acoustical Society of Japan.
Hirose, K., H. Fujisaki, M. Yamaguchi, and M. Yokoo. 1984. Synthesis of funda-
mental frequency contours of Japanese sentences based on syntactic structure (in
Japanese). Onsei Kenkyuukai S83-70: 547-54. The Acoustical Society of Japan.
Hirschberg, J. and J. Pierrehumbert. 1986. The intonational structuring of discourse.
Proceedings of the 24th Annual Meeting, Association for Computational Linguis-
tics, 136-44.
Hjelmslev, L. 1953. Prolegomena to a Theory of Language, Memoir 7, translated by
F. J. Whitfield. Baltimore: Waverly Press.
Hockett, C. F. 1958. A Course in Modern Linguistics. New York: Macmillan.
Hombert, J-M. 1986. Word games: some implications for analysis of tone and other
phonological processes. In J. J. Ohala and J. J. Jaeger (eds.), Experimental
Phonology. Orlando, FL: Academic Press.
Honda, K. and O. Fujimura. 1989. Intrinsic vowel F o and phrase-final lowering:
Phonological vs. biological explanations. Paper presented at the 6th Vocal Fold
Physiology Conference, Stockholm, August 1989.
Hooper, J. B. 1976. An Introduction to Natural Generative Phonology. New York:
Academic Press.
Houlihan, K. and G. K. Iverson. 1979. Functionally constrained phonology. In D.
Dinnsen (ed.), Current Approaches to Phonological Theory. Bloomington:
Indiana University Press.

434
References
Householder, F. 1957. Accent, juncture, intonation, and my grandfather's reader.
Word 13: 234^45.
1965. On some recent claims in phonological theory. JL 1: 13-34.
Huang, C-T. J. 1980. The metrical structure of terraced level tones. In J. Jensen (ed.),
NELS 11. Department of Linguistics, University of Ottawa.
Huggins, A. W. F. 1964. Distortion of the temporal pattern of speech: interruption
and alternation. JASA 36: 1055-64.
Huss, V. 1978. English word stress in the post-nuclear position. Phonetica 35: 86-105.
Hyman, L. M. 1975. Phonology: Theory and Analysis. New York: Holt, Rinehart, and
Winston.
1985. A Theory of Phonological Weight. Dordrecht: Foris.
Jackendoff, R. 1972. Semantic Interpretation in Generative Grammar. Cambridge,
MA: MIT Press.
Jakobson, R., G. Fant, and M. Halle. 1952. Preliminaries to Speech Analysis: the
Distinctive Features and their Correlates. Cambridge, MA: MIT Press.
Jespersen, O. 1904. Phonetische Grundfragen. Leipzig: Teubner.
1920. Lehrbuch der Phonetik. Leipzig: Teubner.
Johnson, C. D. 1972. Formal Aspects of Phonological Description. The Hague:
Mouton.
Jones, D. 1909. Intonation Curves. Leipzig: Teubner.
1940. An Outline of English Phonetics. Cambridge: Heffer.
Joos, M. 1957. Readings in Linguistics 1. Chicago: University of Chicago Press.
Joseph, B. D. and I. Warburton. 1987. Modern Greek. London: Croom Helm.
Kahn, D. 1976. Syllable Based Generalizations in English Phonology. Bloomington,
IULC.
Kaisse, E. M. 1985. Connected Speech: the Interaction of Syntax and Phonology. New
York: Academic Press.
Kawasaki, H. 1982. An acoustical basis for universal constraints on sound sequences.
Doctoral dissertation, University of California, Berkeley.
1986. Phonetic explanation for phonological universals: the case of distinctive
vowel nasalization. In J. J. Ohala and J. J. Jaeger (eds.), Experimental Phonology.
Orlando, FL: Academic Press.
Kay, B. A., K. G. Munhall, E. Vatikiotis-Bateson, and J. A. S. Kelso. 1985. A note on
processing kinematic data: sampling, filtering and differentiation. Hashins
Laboratories Status Report on Speech Research SR-81: 291-303.
Kaye, J. 1988. The ultimate phonological units - features or elements? Handout for
paper delivered to LAGB, Durham Spring 1988.
Kaye, J., J. Lowenstamm and J.-R. Vergnaud. 1985. The internal structure of
phonological elements: a theory of charm and government. PY 2: 305-28.
Keating, P. A. 1983. Comments on the jaw and syllable structure. JPhon 11: 401-6.
1985. Universal phonetics and the organization of grammars. In V. Fromkin (ed.),
Phonetic Linguistics: Essays in Honor of Peter Ladefoged. Orlando, FL: Aca-
demic Press.
1988a. Underspecification in phonetics. Phonology 5: 275-92.
1988b. The phonology-phonetics interface. In F. Newmeyer (ed.), Cambridge

435
References
Linguistic Survey, vol. 1: Linguistic Theory: Foundations. Cambridge: Cambridge
University Press.
Kelly, J. and J. Local. 1986. Long domain resonance patterns in English. In
International Conference on Speech Input/Output; Techniques and Applications.
IEE Conference Publication 258: 304-9.
1989. Doing Phonology: Observing, Recording, Interpreting. Manchester: Manches-
ter University Press.
Kelso, J. A. S. and B. Tuller. 1984. A dynamical basis for action systems. In M.
Gazzaniga (ed.), Handbook of Cognitive Neuroscience. New York: Plenum,
321-56.
Kelso, J. A. S., E. Saltzman and B. Tuller. 1986a. The dynamic perspective on speech
production: data and theory. JPhon 14: 29-59.
1986b. Intentional contents, communicative context, and task dynamics: a reply to
the commentators. JPhon 14: 171-96.
Kelso, J. A. S., K. G. Holt, P. N. Kugler, and M. T. Turvey. 1980. On the concept of
coordinative structures as dissipative structures, II: Empirical lines of conver-
gence. In G. E. Stelmach and J. Requin (eds.), Tutorials in Motor Behavior.
Amsterdam: North-Holland, 49-70.
Kelso, J. A. S., B. Tuller, E. Vatikiotis-Bateson, and C. A. Fowler. 1984. Functionally
specific articulatory cooperation following jaw perturbations during speech:
evidence for coordinative structures. Journal of Experimental Psychology: Hu-
man Perception and Performance 10: 812-32.
Kelso, J. A. S., E. Vatikiotis-Bateson, E. L. Saltzman, and B. Kay. 1985. A qualitative
dynamic analysis of reiterant speech production: phase portraits, kinematics,
and dynamic modeling. JASA 11: 266-80.
Kenstowicz, M. 1970. On the notation of vowel length in Lithuanian. Papers in
Linguistics 3: 73-113.
Kent, R. D. 1983. The segmental organization of speech. In P. F. MacNeilage (ed.),
The Production of Speech. New York: Springer.
Kerswill, P. 1985. A socio-phonetic study of connected speech processes in Cam-
bridge English: an outline and some results. Cambridge Papers in Phonetics and
Experimental Linguistics. 4.
1987. Levels of linguistic variation in Durham. JL 23: 2 5 ^ 9 .
Kerswill, P. and S. Wright. 1989. On the limits of auditory transcription: a
sociophonetic approach. York Papers in Linguistics 14: 35-59.
Kewley-Port, D. 1982. Measurement of formant transitions in naturally produced
stop consonant-vowel syllables. JASA 72.2: 379-81.
King, M. 1983. Transformational parsing. In M. King (ed.), Natural Language
Parsing. London: Academic Press.
Kingston, J. 1990. Articulatory binding. In J. Kingston and M. Beckman (eds.),
Papers in Laboratory Phonology I: Between the Grammar and the Physics of
Speech. Cambridge: Cambridge University Press, 406-34.
Kingston, J. and M. E. Beckman (eds.). 1990. Papers in Laboratory Phonology I:
Between the Grammar and the Physics of Speech. Cambridge: Cambridge
University Press.

436
References
Kingston, J. and R. Diehl. forthcoming. Phonetic knowledge and explanation. Ms.,
University of Massachusetts, Amherst, and University of Texas, Austin.
Kiparsky, P. 1979. Metrical structure assignment is cyclic. LI 10: 421^2.
1985. Some consequences of Lexical Phonology. PY 2: 85-138.
Kiparsky, P. and C. Kiparsky. 1967. Fact. In M. Bierwisch and K. E. Heidolph (eds.),
Progress in Linguistics. The Hague: Mouton.
Klatt, D. H. 1976. Linguistic uses of segmental duration in English: acoustic and
perceptual evidence. JASA 59: 1208-21.
1980. Software for a cascade/parallel formant synthesizer. JASA 63.3: 971-95.
Klein, E. 1987. Towards a declarative phonology. Ms., University of Edinburgh.
Kohler, K. J. 1976. Die Instability wortfinaler Alveolarplosive im Deutschen: eine
elektropalatographische Untersuchung. Phonetica 33: 1-30.
1979a. Kommunikative Aspekte satzphonetischer Prozesse im Deutschen. In H.
Vater (ed.), Phonologische Probleme des Deutschen. Tubingen: Gunther Narr,
13-40.
1979b. Dimensions in the perception of fortis and lenis plosives. Phonetica 36:
332-43.
1990. Segmental reduction in connected speech in German: phonological facts and
phonetic explanations. In W. J. Hardcastle and A. Marchal (eds.), Speech
Production and Speech Modelling. Dordrecht: Kluwer, 62-92.
Kohler, K. J., W. A. van Dommelen, and G. Timmermann. 1981. Die Merkmalpaare
stimmhaft/stimmlos und fortis/lenis in der Konsonantenproduktion und -per-
zeption des heutigen Standard-Franzosisch. Institut fur Phonetik, Universitat
Kiel. Arbeitsberichte, 14.
Koutsoudas, A., G. Sanders and C. Noll. 1974. On the application of phonological
rules. Lg 50: 1-28.
Kozhevnikov, V. A. and L. A. Chistovich. 1965. Speech Articulation and Perception
(Joint Publications Research Service, 30). Washington, DC.
Krakow, R. A. 1989. The articulatory organization of syllables: a kinematic analysis
of labial and velar gestures. Doctoral dissertation, Yale University.
Krull, D. 1987. Second formant locus patterns as a measure of consonant-vowel
coarticulation. PERILUS V. Institute of Linguistics, University of Stockholm.
1989. Consonant-vowel coarticulation in continuous speech and in reference
words. STL-QPSR 1: 101-5.
Kruyt, J. G. 1985. Accents from speakers to listeners. An experimental study of the
production and perception of accent patterns in Dutch. Doctoral dissertation,
University of Leiden.
Kubozono, H. 1985. On the syntax and prosody of Japanese compounds. Work in
Progress 18: 60-85. Department of Linguistics, University of Edinburgh.
1988a. The organisation of Japanese prosody. Doctoral dissertation, University of
Edinburgh.
1988b. Constraints on phonological compound formation. English Linguistics 5:
150-69.
1988c. Dynamics of Japanese intonation. Ms., Nanzan University.
1989. Syntactic and rhythmic effects on downstep in Japanese. Phonology 6.1:
39-67.
437
References
Kucera, H. and W. N. Francis. 1967. Computational Analysis of Present-day
American English. Providence RI: Brown University Press.
Kuno, S. 1973. The Structure of the Japanese Language. Cambridge, MA: MIT
Press.
Kutik, E., W. E. Cooper, and S. Boyce. 1983. Declination of fundamental frequency
in speakers' production of parenthetical and main clauses. JASA 73: 1731-8.
Ladd, D. R. 1980. The Structure of Intonational Meaning: Evidence from English.
Bloomington: Indiana University Press.
1983a. Phonological features of intonational peaks. Lg 59: 721-59.
1983b. Levels versus configurations, revisited. In F. B. Agard, G. B. Kelley, A.
Makkai, and V. B. Makkai, (eds.), Essays in Honor of Charles F. Hockett.
Leiden: E. J. Brill.
1984. Declination: a review and some hypotheses. PY 1: 53-74.
1986a. Intonational phrasing: the case for recursive prosodic structure. PY 3:
311-40.
1986b. The representation of European downstep. Paper presented at the autumn
meeting of the LAGB, Edinburgh.
1987a. Description of research on the procedures for assigning Fo to utterances.
CSTR Text-to-Speech Status Report. Edinburgh: Centre for Speech Technology
Research.
1987b. A phonological model of intonation for use in speech synthesis by rule.
Proceedings of the European Conference on Speech Technology, Edinburgh.

1988. Declination "reset" and the hierarchical organization of utterances. JASA


84: 530-44.
1990. Metrical representation of pitch register. In J. Kingston and M. Beckman
(eds.), Papers in Laboratory Phonology I: Between the Grammar and the Physics
of Speech. Cambridge: Cambridge University Press, 35-57.
Ladd, D. R. and K. Silverman. 1984. Vowel intrinsic pitch in connected speech.
Phonetica 41: 41-50.
Ladd, D. R., K. Silverman, F. Tolkmitt, G. Bergmann, and K. R. Scherer. 1985.
Evidence for the independent function of intonation contour, pitch range, and
voice quality. JASA 78: 435-44.
Ladefoged, P. 1971. Preliminaries to Linguistic Phonetics. Chicago: University of
Chicago Press.
1977. The abyss between phonetics and phonology. Chicago Linguistic Society 13:
225-35.
1980. What are linguistic sounds made of? Lg 56: 485-502.
1982. A Course in Phonetics, 2nd edn. New York: Harcourt Brace Jovanovich.
Ladefoged, P. and N. Antonanzas-Baroso. 1985. Computer measures of breathy
voice quality. UCLA Working Papers in Phonetics 61: 79-86.
Ladefoged, P. and M. Halle. 1988. Some major features of the International Phonetic
Alphabet. Lg 64: 577-82.
Ladefoged, P. and M. Lindau. 1989. Modeling articulatory-acoustics relations: a
comment on Stevens' "On the quantal nature of speech." JPhon 17: 99-106.

438
References
Ladefoged, P. and I. Maddieson. 1989. Multiply articulated segments and the feature
hierarchy. UCLA Working Papers in Phonetics 72: 116-38.
Lahiri, A. and J. Hankamer. 1988. The timing of geminate consonants. JPhon 16:
327-38.
Lahiri, A. and J. Koreman. 1988. Syllable weight and quantity in Dutch. WCCFL 7:
217-28.
Lakoff, R. 1973. Language and woman's place. Language in Society 2: 45-79.
Langmeier, C , U. Luders, L. Schiefer, and B. Modi. 1987. An acoustic study on
murmured and "tight" phonation in Gujarati dialects - a preliminary report.
Proceedings of the Eleventh International Congress of Phonetic Sciences 1:
328-31.
Lapointe, S. G. 1977. Recursiveness and deletion. Linguistic Analysis 3.3: 227-65.
Lashley, K. S. 1930. Basic neural mechanisms in behavior. Psychological Review 37:
1-24.
Lass, R. 1976. English Phonology and Phonological Theory: Synchronic and Diachro-
nic Studies. Cambridge: Cambridge University Press.
1984a. Vowel system universals and typology: prologue to theory. PY 1: 75-112.
1984b. Phonology: an Introduction to Basic Concepts. Cambridge: Cambridge
University Press.
Lea, W. A. 1980. Prosodic aids to speech recognition. In W. A. Lea (ed.), Trends in
Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall.
Leben, W. R. 1973. Suprasegmental phonology. Doctoral dissertation, MIT.
1976. The tones in English intonation. Linguistic Analysis 2: 69-107.
1978. The representation of tone. In V. Fromkin (ed.), Tone: a Linguistic Survey.
New York: Academic Press.
Lehiste, I. 1970. Suprasegmentals. Cambridge, MA: MIT Press.
1972. The timing of utterances and linguistic boundaries. JASA 51: 2018-24.
1975. The phonetic structure of paragraphs. In A. Cohen and S. G. Nooteboom
(eds.), Structure and Process in Speech Perception. Heidelberg: Springer.
1980. Phonetic manifestation of syntactic structure in English. Annual Bulletin,
University of Tokyo RILP 14: 1-28.
Liberman, A. M. and I. G. Mattingly. 1985. The motor theory of speech perception
revised. Cognition 21: 1-36.
Liberman, M. Y. 1975. The Intonational System of English. Doctoral dissertation,
MIT. Distributed in 1978 by IULC.
Liberman, M. Y. and J. Pierrehumbert. 1984. Intonational invariance under changes
in pitch range and length. In M. Aronoff and R. T. Oehrle (eds.), Language
Sound Structure: Studies in Phonology Presented to Morris Halle. Cambridge,
MA: MIT Press, 157-233.
Liberman, M. Y. and A. Prince. 1977. On stress and linguistic rhythm. LI 8: 249-
336.
Licklider, J. C. R. and G. A. Miller. 1951. The perception of speech. In S. S. Stevens
(ed.), Handbook of Experimental Psychology. New York: John Wiley.
Lieberman, P. 1967. Intonation, Perception, and Language. Cambridge, MA: MIT
Press.

439
References
Lindau, M. and P. Ladefoged. 1986. Variability of feature specifications. In J. S.
Perkell and D. Klatt (eds.), Invariance and Variability of Speech Processes.
Hillsdale, NJ: Lawrence Erlbaum.
Lindblom, B. 1963. Spectrographic study of vowel reduction. JASA 35: 1773-81.
1983. Economy of speech gestures. In P. MacNeilage (ed.), The Production of
Speech. New York: Springer.
1984. Can the models of evolutionary biology be applied to phonetic problems? In
M. P. R. van den Broeke and A. Cohen (eds.), Proceedings of the Ninth
International Congress of Phonetic Sciences, Dordrecht: Foris.
1989. Phonetic invariance and the adaptive nature of speech. In B. A. G.
Elsdendoorn and H. Bouma (eds.), Working Models of Human Perception.
London: Academic Press.
Lindblom, B. and R. Lindgren. 1985. Speaker-listener interaction and phonetic
variation. PERILUS IV, Institute of Linguistics, University of Stockholm.
Lindblom, B. and J. Lubker. 1985. The speech homunculus and a problem of
phonetic linguistics. In V. Fromkin (ed.), Essays in Honor of Peter Ladefoged.
Orlando, FL: Academic Press.
Lindsey, G. 1985. Intonation and interrogation: tonal structure and the expression of
a pragmatic function in English and other languages. Doctoral dissertation,
UCLA.
Linell, P. 1979. Psychological Reality in Phonology. Cambridge: Cambridge Univer-
sity Press.
Local, J. K. 1990. Some rhythm, resonance and quality variations in urban Tyneside
speech. In S. Ramsaran (ed.), Studies in the Pronunciation of English: a Com-
memorative Volume in Honour of A. C. Gimson. London: Routledge, 282-92.
Local, J. K. and J. Kelly. 1986. Projection and "silences": notes on phonetic detail
and conversational structure. Human Studies 9: 185-204.
Lodge, K. R. 1984. Studies in the Phonology of Colloquial English. London: Croom
Helm.
Lofqvist, A., T. Baer, N. S. McGarr, and R. S. Story. 1989. The cricothyroid muscle
in voicing control. JASA 85: 1314-21.
Lubker, J. 1968. An EMG-cinefluorographic investigation of velar function during
normal speech production. Cleft Palate Journal 5.1.
1981. Temporal aspects of speech production: anticipatory labial coarticulation.
Phonetica 38: 51-65.
Lyons, J. 1962. Phonemic and non-phonemic phonology. International Journal of
American Linguistics, 28: 127-33.
McCarthy, J. J. 1979. Formal problems in Semitic phonology and morphology.
Doctoral dissertation, MIT.
1981. A prosodic theory of nonconcatenative morphology. LI 12.3: 373-418.
1989. Feature geometry and dependency. Phonetica 43: 84-108.
McCarthy, J. J. and A. Prince. 1986. Prosodic Morphonology. Manuscript to appear
with MIT Press.
McCawley, J. D. 1968. The Phonological Component of a Grammar of Japanese. The
Hague: Mouton.

440
References
Macchi, M. 1985. Segmental and suprasegmental features and lip and jaw articula-
tors. Doctoral dissertation, New York University.
1988. Labial articulation patterns associated with segmental features and syllables
in English. Phonetica 45: 109-21.
McClelland, J. L. and J. L. Elman. 1986. The TRACE model of speech perception.
Cognitive Psychology 18: 1-86.
McCroskey, R. L., Jr. 1957. Effect of speech on metabolism. Journal of Speech and
Hearing Disorders 22: 46-52.
MacKay, D. G. 1972. The structure of words and syllables: evidence from errors in
speech. Cognitive Psychology 3: 210-27.
Maddieson, I. 1984. Patterns of Sounds. Cambridge: Cambridge University Press.
Maeda, S. 1974. A characterization of fundamental frequency contours of speech.
MIT Quarterly Progress Report 114: 193-211.
1976. A characterization of American English intonation. Doctoral dissertation,
MIT.
Magen, H. 1989. An acoustic study of vowel-to-vowel coarticulation in English.
Doctoral dissertation, Yale University.
Makkai, V. B. 1972. Phonological Theory: Evolution and Current Practice. New York:
Holt, Rinehart, and Winston.
Malikouti-Drachman, A. and B. Drachman. 1980. Slogan chanting and speech
rhythm in Greek. In W. Dressier, O. Pfeiffer, and J. Rennison (eds.), Phonologica
1980. Innsbruck: Innsbrucker Beitrage zur Sprachwissenschaft.
Mandelbrot, B. 1954. Structure formelle des textes et communication. Word 10: 1-27.
Marslen-Wilson, W. D. 1984. Function and process in spoken word-recognition. In
H. Bouma and D. G. Bouwhuis (eds.), Attention and Performance X: Control of
Language Processes. Hillsdale, NJ: Lawrence Erlbaum.
1987. Functional parallelism in spoken word-recognition. In U. Frauenfelder and
L. K. Tyler (eds.), Spoken Word Recognition. Cambridge, MA: MIT Press.
Mascaro, J. 1983. Phonological levels and assimilatory processes. Ms., Universitat
Autonoma de Barcelona.
Massaro, D. W. 1972. Preperceptual images, processing time and perceptual units in
auditory perception. Psychological Review 79: 124-45.
Mehler, J. 1981. The role of syllables in speech processing. Philosophical Transactions
of the Royal Society B295: 333-52.
Mehler, J., J. Y. Dommergues, U. Frauenfelder, and J. Segui. 1981. The syllable's role
in speech segmentation. JVLVB 20: 298-305.
Mehrota, R. C. 1980. Hindi Phonology. Raipur.
Menn, L. and S. Boyce. 1982. Fundamental frequency and discourse structure. Lg &
Sp 25: 341-83.
Menzerath, P. and A. de Lacerda. 1933. Koartikulation, Steuerung und Lautabgren-
zung. Bonn.
Miller, J. E. and O. Fujimura. 1982. Graphic displays of combined presentations of
acoustic and articulatory information. The Bell System Technical Journal 61:
799-810.
Mills, C. B. 1980. Effects of context on reaction time to phonemes. JVLVB 19: 75-83.

441
References
Mohanan, K. P. 1983. The structure of the melody. Ms., MIT and University of
Singapore.
1986. The Theory of Lexical Phonology. Dordrecht: Reidel.
Monsen, R. B., A. M. Engebretson, and N. R. Vermula. 1978. Indirect assessment of
the contribution of sub-glottal pressure and vocal fold tension to changes
of fundamental frequency in English. JASA 64: 65-80.
Munhall, K., D. Ostry, and A. Parush. 1985. Characteristics of velocity profiles
of speech movements. Journal of Experimental Psychology: Human Perception
and Performance 2: 457-74.
Nakatani, L. and J. Schaffer. 1978. Hearing "words" without words: prosodic cues
for word perception. JASA 63: 234—45.
Nathan, G. S. 1983. The case for place - English rapid speech autosegmentally.
Chicago Linguistic Society 19: 309-16.
Nearey, T. M. 1980. On the physical interpretation of vowel quality: cinefluoro-
graphic and acoustic evidence. JPhon 8: 213-41.
Nelson, W. 1983. Physical principles for economies of speech movements. Biological
Cybernetics 46: 135-47.
Nespor, M. 1988. Rithmika charaktiristika tis Ellinikis (Rhythmic Characteristics of
Greek). Studies in Greek Linguistics: Proceedings of the 9th Annual Meeting
of the Department of Linguistics, Faculty of Philosophy, Aristotelian University
of Thessaloniki.
Nespor, M. and I. Vogel. 1982. Prosodic domains of external sandhi rules. In H. van
der Hulst and N. Smith (eds.) Advances in Non-Linear Phonology. Dordrecht:
Foris.
1983. Prosodic structure above the word. In A. Cutler and D. R. Ladd (eds.),
Prosody: Models and Measurements. Heidelberg: Springer.
1986. Prosodic Phonology. Dordrecht: Foris.
1989. On clashes and lapses. Phonology 6: 69-116.
Nittrouer, S., M. Studdert-Kennedy and R. S. McGowan. 1989. The emergence of
phonetic segments: evidence from the spectral structure of fricative-vowel
syllables spoken by children and adults. JSHR 32: 120-32.
Nittrouer, S., K. Munhall, J. A. S. Kelso, E. Tuller, and K. S. Harris. 1988. Patterns
of interarticulator phasing and their relationship to linguistic structure. JASA
84: 1653-61.
Nolan, F. J. 1986. The implications of partial assimilation and incomplete neutralisa-
tion. Cambridge Papers in Phonetics and Experimental Linguistics, 5.
Norris, D. G. and A. Cutler. 1988. The relative accessibility of phonemes and
syllables. Perception and Psychophysics 45: 485-93.
O'Connor, J. D. and G. F. Arnold. 1973. Intonation of Colloquial English, 2nd edn.
London: Longman.
Ohala, J. J. 1974. Experimental historical phonology. In J. M. Anderson and C. Jones
(eds.), Historical Linguistics, vol. II: Theory and Description in Phonology.
North-Holland, Amsterdam, 353-89.
1975. Phonetic explanations for nasal sound patterns. In C. A. Ferguson, L. M.

442
References
Hyman, and J. J. Ohala (eds.), Nasalfest: Papers from a Symposium on Nasals
and Nasalization. Stanford: Language Universals Project.
1976. A model of speech aerodynamics. Report of the Phonology Laboratory
(Berkeley) 1: 93-107.
1978. Phonological notations as models. In W. U. Dressier and W. Meid (eds.),
Proceedings of the Twelfth International Congress of Linguists, Vienna 1977.
Innsbruck: Innsbrucker Beitrage zur Sprachwissenschaft.
1979a. Universals of labial velars and de Saussure's chess analogy. Proceedings of
the Ninth International Congress of Phonetic Sciences, vol. II. Copenhagen:
Institute of Phonetics.
1979b. The contribution of acoustic phonetics to phonology. In B. Lindblom and
S. Ohman (eds.), Frontiers of Speech Communication Research. London: Aca-
demic Press.
1981a. Speech timing as a tool in phonology. Phonetica 43: 84-108.
1981b. The listener as a source of sound change. In C. S. Masek, R. A. Hendrick,
and M. F. Miller (eds.), Papers from the Parasession on Language and Behavior.
Chicago: Chicago Linguistic Society.
1982. Physiological mechanisms underlying tone and intonation. In H. Fujisaki
and E. Garding (eds.), Preprints, Working Group on Intonation, Thirteenth
International Congress of Linguists, Tokyo, 29 Aug.-4 Sept. 1982. Tokyo.
1983. The origin of sound patterns in vocal tract constraints. In P. F. MacNeilage
(ed.), The Production of Speech. New York: Springer.
1985a. Linguistics and automatic speech processing. In R. De Mori and C.-Y. Suen
(eds.), New Systems and Architectures for Automatic Speech Recognition and
Synthesis. Berlin: Springer.
1985b. Around flat. In V. Fromkin (ed.), Phonetic Linguistics. Essays in Honor of
Peter Ladefoged. Orlando, FL: Academic Press.
1986. Phonological evidence for top down processing in speech perception. In J. S.
Perkell and D. H. Klatt (eds.), Invariance and Variability in Speech Processes.
Hillsdale, NJ: Lawrence Erlbaum.
1987. Explanation in phonology: opinions and examples. In W. U. Dressier, H. C.
Luschiitzky, O. E. Pfeiffer, and J. R. Rennison (eds.), Phonologica 1984.
Cambridge: Cambridge University Press.
1989. Sound change is drawn from a pool of synchronic variation. In L. E. Breivik
and E. H. Jahr (eds.), Language Change: Do we Know its Causes Yet? (Trends in
Linguistics). Berlin: Mouton de Gruyter.
1990a. The phonetics and phonology of aspects of assimilation. In J. Kingston and
M. Beckman (eds.), Papers in Laboratory Phonology I. Between the Grammar and
Physics of Speech. Cambridge: Cambridge University Press, 258-75.
1990b. The generality of articulatory binding: comments on Kingston's "Articula-
tory binding". In J. Kingston and M. Beckman (eds.), Papers in Laboratory
Phonology I. Between the Grammar and Physics of Speech. Cambridge: Cam-
bridge University Press, 445-50.
forthcoming. The costs and benefits of phonological analysis. In P. Downing,

443
References
S. Lima, and M. Noonan (eds.), Literacy and Linguistics. Amsterdam: John
Benjamins.
Ohala, J. J. and B. W. Eukel. 1987. Explaining the intrinsic pitch of vowels. In R.
Channon and L. Shockey, (eds.), In Honor of Use Lehiste. Use Lehiste Puhendus-
teos. Dordrecht: Foris, 207-15.
Ohala, J. J. and D. Feder. 1987. Listeners' identity of speech sounds is influenced by
adjacent "restored" phonemes. Proceedings of the Eleventh International Con-
gress of Phonetic Sciences. 4: 120-3.
Ohala, J. J. and J. J. Jaeger (eds.), 1986. Experimental Phonology. Orlando, FL:
Academic Press.
Ohala, J. J. and H. Kawasaki. 1984. Phonetics and prosodic phonology. PY \\
113-27.
Ohala, J. J. and J. Lorentz. 1977. The story of [w]: an exercise in the phonetic
explanation for sound patterns. Berkeley Linguistic Society, Proceedings, Annual
Meeting 3: 577-99.
Ohala, J. J. and C. J. Riordan. 1980. Passive vocal tract enlargement during voiced
stops. Report of the Phonology Laboratory, Berkeley 5: 78-87.
Ohala, J. J., M. Amador, L. Araujo, S. Pearson, and M. Peet. 1984. Use of synthetic
speech parameters to estimate success of word recognition. JASA 75: S.93.
Ohala, M. 1979. Phonological features of Hindi stops. South Asian Languages
Analysis 1: 79-87.
1983. Aspects of Hindi Phonology. Delhi: Motilal Banarsidass.
Ohman, S. E. G. 1966a. Coarticulation in VCV utterances: spectrographic measure-
ments. JASA 39: 151-68.
1966b. Perception of segments of VCCV utterances. JASA 40: 979-88.
Olive, J. 1975. Fundamental frequency rules for the synthesis of simple declarative
sentences. JASA 57: 476-82.
Oiler, D. K. 1973. The effect of position in utterance on speech segment duration in
English. JASA 54: 1235^47.
O'Shaughnessy, D. 1979. Linguistic features in fundamental frequency patterns.
JPhonl: W9-A5.
O'Shaughnessy, D. and J. Allen. 1983. Linguistic modality effects on fundamental
frequency in speech. JASA 74, 1155-71.
Ostry, D. J. and K. G. Munhall. 1985. Control of rate and duration of speech
movements. JASA 11: 640-8.
Ostry, D. J., E. Keller, and A. Parush. 1983. Similarities in the control of speech
articulators and the limbs: kinematics of tongue dorsum movement in
speech. Journal of Experimental Psychology: Human Perception and Performance
9: 622-36.
Otsu, Y. 1980. Some aspects of rendaku in Japanese and related problems. Theoreti-
cal Issues in Japanese Linguistics (MIT Working Papers in Linguistics 2).
Palmer, F. R. 1970. Prosodic Analysis. Oxford: Oxford University Press.
Perkell, J. S. 1969. Physiology of Speech Production: Results and implications of a

444
References
quantitative cineradiographic study (Research Monograph 53). Cambridge, MA:
MIT Press.
Peters, S. 1973. On restricting deletion transformations. In M. Gross, M. H. Halle,
and M. Schutzenberger (eds.), The Formal Analysis of Language. The Hague:
Mouton.
Peters, S. and R. W. Ritchie 1972. On the generative power of transformational
grammars. Information Sciences 6: 49-83.
Peterson, G. E. and H. Barney. 1952. Control methods used in a study of vowels.
JASA 24: 175-84.
Peterson, G. E. and I. Lehiste. 1960. Duration of syllable nuclei in English. JASA 32:
693-703.
Pierrehumbert, J. 1980. The Phonetics and Phonology of English Intonation. Doctoral
dissertation, MIT; distributed 1988, Bloomington: IULC.
1981. Synthesizing intonation. JASA 70: 985-95.
forthcoming. A preliminary study of the consequences of intonation for the voice
source. STL-QPSR 4: 23-36.
Pierrehumbert, J. and M. E. Beckman. 1988. Japanese Tone Structure (Linguistic
Inquiry Monograph Series 15). Cambridge, MA: MIT Press.
Pierrehumbert, J. and J. Hirschberg. 1990. The meaning of intonation contours in the
interpretation of discourse. In P. Cohen, J. Morgan, and M. Pollack (eds.), Plans
and Intentions in Communication. Cambridge, MA: MIT Press, 271-312.
Pike, K. L. 1945. The Intonation of American English. Ann Arbor: University of
Michigan Press.
Pollard, C. and I. Sag. 1987. Information-based Syntax and Semantics. Stanford:
CSLI.
Poon, P. G. and C. A. Mateer. 1985. A study of Nepali stop consonants. Phonetica
42: 39^7.
Poser, W. J. 1984. The phonetics and phonology of tone and intonation in Japanese.
Doctoral dissertation, MIT.
Pulleyblank, D. 1986. Tone in Lexical Phonology. Dordrecht: Reidel.
1989. Non-linear phonology. Annual Review of Anthropology 18: 203-26.
Pullum, G. K. 1978. Rule Interaction and the Organization of a Grammar. New York:
Garland.
Recasens, D. 1987. An acoustic analysis of V-to-C and V-to-V coarticulatory effects
in Catalan and Spanish VCV sequences. JPhon 15: 299-312.
Repp, B. R. 1981. On levels of description in speech research. JASA 69.5: 1462-4.
1986. Some observations on the development of anticipatory coarticulation. JASA
79: 1616-19.
Rialland, A. and M. B. Badjime. 1989. Reanalyse des tons du Bambara: des tons du
nom a Torganisation generate du systeme. Studies in African Linguistics 20.1:
1-28.
Rietveld, A. C. M. and C. Gussenhoven. 1985. On the relation between pitch
excursion size and pitch prominence. JPhon 13: 299-308.

445
References
Riordan, C. 1977. Control of vocal tract length in speech. JASA 62: 998-1002.
Roach, P. 1983. English Phonetics and Phonology: a Practical Course. Cambridge:
Cambridge University Press.
Roca, I. 1986. Secondary stress and metrical rhythm. PY 3: 341-70.
Roudet, L. 1910. Elements de phonetique generate. Paris.
Rubin, P., T. Baer, and P. Mermelstein. 1981. An articulatory synthesizer for
perceptual research. JASA 70: 321-8.
Sagart, L., P. Halle, B. de Boysson-Bardies, and C. Arabia-Guidet. 1986. Tone
production in modern standard Chinese: an electromyographic investigation.
Paper presented at nineteenth International Conference on Sino-Tibetan Lan-
guages and Linguistics, Columbus, OH, 12-14 September 1986.
Sagey, E. 1986a. The representation of features and relations in non-linear phono-
logy. Doctoral dissertation, MIT.
1986b. On the representation of complex segments and their formulation in
Kinyarwanda. In E. Sezer and L. Wetzels (eds.), Studies in Compensatory
Lengthening. Dordrecht: Foris.
Saiasnrv A anH D Pisnni 1Q85 Tnfprartinn nf VnnwIpHcrp cnnrrpc in cr»r»1r^n wr*rH
References
Sereno, J. A., S. R. Baum, G. C. Marean, and P. Lieberman. 1987. Acoustic analysis
and perceptual data on anticipatory labial coarticulation in adults and children.
JASA 81: 512-19.
Setatos, M. 1974. Phonologia tis Kinis Neoellinikis {Phonology of Standard Greek).
Athens: Papazisis.
Sharf, D. J. and R. N. Ohde. 1981. Physiologic, acoustic and perceptual aspects of
coarticulation: implications for the remediation of articulatory disorders. In
N. J. Lass (ed.), Speech and Language: Advances in Basic Research and Practice,
vol. 5. New York: Academic Press, 153-247.
Shattuck-Hufnagel, S. and D. H. Klatt. 1979. Minimal use of features and marked-
ness in speech production: evidence from speech errors. JVLVB 18: 41-55.
Shattuck-Hufnagel, S., V. W. Zue, and J. Bernstein. 1978. An acoustic study of
palatalization of fricatives in American English. JASA 64: S92(A).
Shieber, S. M. 1986. An Introduction to Unification-based Approaches to Grammar.
Stanford: CSLI.
Sievers, E. 1901. Grundzuge der Phonetik. Leipzig: Breitkopf and Hartel.
Silverman, K. E. A. and J. Pierrehumbert. 1990. The timing of prenuclear high
accents in English. In J. Kingston and M. Beckman (eds.), Papers in Laboratory
Phonology I. Between the Grammar and Physics of Speech. Cambridge: Cam-
bridge University Press, 72-106.
Simada, Z. and H. Hirose. 1971. Physiological correlates of Japanese accent patterns.
Annual Bulletin of the Research Institute of Logopedics and Phoniatrics 5: 41-9.
Soundararaj, F. 1986. Acoustic phonetic correlates of prominence in Tamil words.
Work in Progress 19: 16-35. Department of Linguistics, University of
Edinburgh.
Sprigg, R. K. 1963. Prosodic analysis and phonological formulae, in Tibeto-Burman
linguistic comparison. In H. L. Shorto (ed.), Linguistic Comparison in South East
Asia and the Pacific. London: School of Oriental and African Studies.
1972. A polysystemic approach, in proto-Tibetan reconstruction, to tone and
syllable initial consonant clusters. Bulletin of the School of Oriental and African
Studies, 35. 3: 546-87.
Steele, S. 1986. Interaction of vowel Fo and prosody. Phonetica 43: 92-105.
1987. Nuclear accent Fo peak location: effects of rate, vowel, and number of
following syllables. JASA 80: Suppl. 1, S51.
Steele, S. and M. Y. Liberman. 1987. The shape and alignment of rising intonation.
JASA 81: S52.
Steriade, D. 1982. Greek prosodies and the nature of syllabification. Doctoral
dissertation, MIT.
Stetson, R. 1951. Motor Phonetics: a Study of Speech Movements in action. Amster-
dam: North-Holland.
Stevens, K. N. 1972. The quantal nature of speech: evidence from articulatory-
acoustic data. In E. E. David, Jr. and P. B. Denes (eds.), Human Communication:
a Unified View. New York: McGraw-Hill.
1989. On the quantal nature of speech. JPhon 17: 3^5.

448
References
Stevens, K. N. and S. J. Keyser 1989. Primary features and their enhancement in
consonants. Lg 65: 81-106.
Stevens, K. N., S. J. Keyser, and H. Kawasaki. 1986. Toward a phonetic and
phonological theory of redundant features. In J. S. Perkell and D. H. Klatt
(eds.), Invariance and Variability in Speech Processes. Hillsdale, NJ: Lawrence
Erlbaum.
Strange, W., R. R. Verbrugge, D. P. Shankweiler, and T. R. Edman. 1976. Consonant
environment specifies vowel identity. JASA 60: 213-24.
Sugito, M. and H. Hirose. 1978. An electromyographic study of the Kinki accent.
Annual Bulletin of the Research Institute of Logopedics and Phoniatrics 12: 35-51.
Summers, W. V. 1987. Effects of stress and final consonant voicing on vowel
production: articulatory and acoustic analyses. JASA 82: 847-63.
Summers, W. V., D. B. Pisoni, R. H. Bernacki, R. I. Pedlow, and M. A. Stokes. 1988.
Effects of noise on speech production: acoustic and perceptual analyses. JASA
84: 917-28.
Sussman, H., P. MacNeilage, and R. Hanson. 1973. Labial and mandibular dynamics
during the production of bilabial consonants: preliminary observations. JSHR
17: 397-420.
Sweet, H. 1877. A Handbook of Phonetics. Oxford: Oxford University Press.
Swinney, D. and P. Prather. 1980. Phoneme identification in a phoneme monitoring
experiment: the variable role of uncertainty about vowel contexts. Perception and
Psychophysics 27: 104-10.
Taft, M. 1978. Evidence that auditory word perceptions is not continuous: The DAZE
effect. Paper presented at the fifth Australian Experimental Psychology Confer-
ence, La Trobe University.
1979. Recognition of affixed words and the word frequency effect. Memory and
Cognition 7: 263-72.
Talkin, D. 1989. Voicing epoch determination with dynamic programming. JASA 85:
S149(A).
Thorsen, N. 1978. An acoustical analysis of Danish intonation. JPhon 6: 151-75.
1979. Interpreting raw fundamental frequency tracings of Danish. Phonetica 36:
57-78.
1980a. A study of the perception of sentence intonation: evidence from Danish.
JASA 67: 1014-30.
1980b. Intonation contours and stress group patterns in declarative sentences of
varying length in ASC Danish. ARIPUC 14: 1-29.
1981. Intonation contours and stress group patterns in declarative sentences of
varying length in ASC Danish - supplementary data. ARIPUC 15: 13-47.
1983. Standard Danish sentence intonation - phonetic data and their represen-
tation. Folia Linguistica 17: 187-220.
1984a. Variability and invariance in Danish stress group patterns. Phonetica 41:
88-102.
1984b. Intonation and text in standard Danish with special reference to the abstract
representation of intonation. In W. U. Dressier, H. C. Luschiitzky, O. E.

449
References
Pfeiffer, and J. R. Rennison (eds.), Phonologica 1984. Cambridge: Cambridge
University Press.
1985. Intonation and text in Standard Danish. JASA IT 1205-16.
1986. Sentence intonation in textual context: supplementary data. JASA 80:
1041-7.
Touati, P. 1987. Structures prosodiques du Suedois et du Francais. Lund: Lund
University Press.
Thrainsson, H. 1978. On the phonology of Icelandic aspiration. Nordic Journal of
Linguistics 1: 3-54.
Trager, G. L. and H. L. Smith. 1951. An Outline of English Structure (Studies in
Linguistics, Occasional Papers 3). Norman, OK: Battenburg Press.
Trim, J. L. M. 1959. Major and minor tone-groups in English. Le Maitre Phonetique
111:26-9.
Trubetzkoy, N. S. 1939. Grundzuge der Phonologic transl. Principles of Phonology.
C. A. M. Baltaxe, 1969. Berkeley: University of California Press.
Turnbaugh, K. R., P. R. Hoffman, R. G. Daniloff, and R. Absher. 1985. Stor>-vowel
coarticulation in 3-year-old, 5-year-old, and adult speakers. JASA 77: 1256-7.
Tyler, L. K. and J. Wessels. 1983. Quantifying contextual contributions to word
recognition processes. Perception and Psychophysics 304: 409-20.
Uldall, E. T. 1958. American "molar" r and "flapped" r. Revisto do Laboratorio de
Fonetica Experimental (Coimbra) 4: 103—6.
Uyeno, T., H. Hayashibe, K. Imai, H. Imagawa, and S. Kiritani. 1981. Syntactic
structures and prosody in Japanese: a case study on pitch contours and the
pauses at phrase boundaries. University of Tokyo, Research Institute of Logo-
paedics and Phoniatrics, Annual Bulletin 1: 91-108.
Vaissiere, J. 1988. Prediction of velum movement from phonological specifications.
Phonetica 45: 122-39.
van der Hulst, H. and N. Smith. 1982. The Structure of Phonological Representations.
Part I. Dordrecht: Foris.
Vatikiotis-Bateson, E. 1988. Linguistic Structure and Articulatory Dynamics. Doc-
toral dissertation, Indiana University. Distributed by IULC.
Vogten, L. 1985. LVS-manual. Speech processing programs on IPO-VAX 11/780.
Eindhoven: Institute for Perception Research.
Waibel, A. 1988. Prosody and Speech Recognition. London: Pitman; San Mateo:
Morgan Kaufman.
Wang, W. S-Y. and J. Crawford. 1960. Frequency studies of English consonants. Lg
&Sp 3: 131-9.
Warren, P. and W. D. Marslen-Wilson. 1987. Continuous uptake of acoustic cues in
spoken word recognition. Perception and Psychophysics 43: 262-75.
1988. Cues to lexical choice: discriminating place and voice. Perception and
Psychophysics 44: 21-30.
Warren, R. M. 1970. Perceptual restoration of missing speech sounds. Science 167:
392-3.
Wells, J. C. 1982. Accents of English 1: An Introduction. Cambridge: Cambridge
University Press.

450
References
Westbury, J. R. 1983. Enlargement of the supraglottal cavity and its relation to stop
consonant voicing. JASA 73: 1322-36.
Whitney, W. 1879. Sanskrit Grammar. Cambridge, MA: Harvard University Press.
Williams, B. 1985. Pitch and duration in Welsh stress perception: the implications for
intonation. JPhon 13: 381-406.
Williams, C. E. and K. N. Stevens. 1972. Emotions and speech: some acoustical
correlates. JASA 52: 1238-50.
Wood, S. 1982. X-ray and model studies of vowel articulation. Working Papers, Dept.
of Linguistics, Lund University 23.
Wright, J. T. 1986. The behavior of nasalized vowels in the perceptual vowel space. In
Ohala, J. J. and J. J. Jaeger (eds.), 1986. Experimental Phonology. Orlando, FL:
Academic Press.
Wright, S. and P. Kerswill. 1988. On the perception of connected speech processes.
Paper delivered to LAGB, Durham, Spring 1988.
Yaeger, M. 1975. Vowel harmony in Montreal French. JASA 57: S69.
Yip, M. 1989. Feature geometry and co-occurrence restrictions. Phonology 6: 349-74.
Zimmer, K. E. 1969. Psychological correlates of some Turkish morpheme structure
conditions. Lg 46: 309-21.
Zsiga, E. and D. Byrd. 1988. Phasing in consonant clusters: articulatory and acoustic
effects. Ms.
Zue, V. W. and S. Shattuck-Hufnagel. 1980. Palatalization of /s/ in American
English: when is a /§/ not a /§/? JASA 67: S27.

451
Name index

Abbs, J. H., 11, 122 Bolinger, D., 322, 325, 332


Albrow, K. H., 194 Botha, R. P., 191
Alfonso, P., 26 Botinis, A., 399^03, 412, 414, 418
Ali, L. H. 255 Boves, L., 5
Allen, J., 116,328 Boyce, S., 64, 326
Alstermark, M., 178 Broe, M., 4, 94, 149, 198
Anderson, L. B., 178 Bromberger, S., 196
Anderson, S. R., 5, 27, 192, 200, 298 Browman, C. P., 2-4, 9, 14, 16, 18, 20-7, 30,
Antonanzas-Baroso, N., 304 42, 44, 56-67, 69-70, 87, 116, 122, 128,
Archangeli, D., 154, 198, 234 136, 165, 190, 194, 199-200, 225, 257,
Aristotle, 227 287-8, 314
Arvaniti, A., 4, 398 Brown, R. W., 181
Bruce, G., 5, 90, 325
Baer, T., 26-7 Bullock, D., 70
Barney, H., 169 Byrd, D., 288
Barry, M., 205, 264, 267-8, 278
Beckman, M. E., 2-4, 64, 68, 77, 84, 87-90, Carlson, L., 393
92-4, 111, 114, 118, 120, 122-3, 125-6, Carnochan, J., 196
193, 196, 324, 326-7, 331, 333-4, 342-3, Catford, J. C , 313
345, 368-9, 372-3, 375, 385-7, 389-91, Chang, N - C , 328
393, 396 Chiba, T., 170, 175
Beddor, P. S., 179, 181 Chistovich, L. A., 141
Bell-Berti, F., 26, 141 Chomsky, N., 116, 149, 165, 167, 183, 195-
Benguerel, A. P., 311 6, 216, 296-8, 398
Berendsen, E., 399 Chowdhury, M., 235
van den Berg, R., 125-6, 326, 331, 334-5, Clements, G. N., 84, 150, 155, 158-9, 178,
359-61, 363^1, 366-7, 379, 384-5, 388- 183-7, 192, 198, 230, 262, 285, 343, 370,
91, 394 381,389
Bernstein, J., 210 Cohen, A., 322, 331, 339
Bernstein, N. A., 14 Cohen, J., 63
Bertsch, W. F., 168 Cohen, P., 63
Bever, T. G., 293 Coleman, J., 192
Bhatia, T. K., 311 Collier, R., 322, 331
Bickley, C , 304 Cooper, A., 121
Blank, M. A., 292-3 Cooper, W. E., 213, 333-5
Bloomfield, L., 313 Costa, P. J., 280
Blumstein, S. E., 136 Crawford, J., 176

452
Name index

Crompton, A., 294 Gernsbacher, M. A., 292


Crystal, D., 325 Gimson, A. C, 143, 203-5, 219
Cutler, A., 2, 5, 180, 213, 290, 293-5 Gobi, C, 91, 95
Goldsmith, J., 150, 166, 184, 192
Dalby, J. M., 203 Goldstein, L., 2-3, 9, 14, 16, 18, 20-7, 30,
Daniloff, R., 136, 141 35, 42, 44, 56-67, 69-70, 87, 116, 120,
Dauer, R. M., 399-400, 418 122, 128, 136, 165, 179, 181, 190, 194,
Delattre, P., 172 199-200,225,288,314
Dev, A. T., 239 Gracco, V., 122
Diehl, R., 65 Greene, P. H., 14
Dinnsen, D. A., 274 Gronnum, N. {formerly Thorsen), 328, 334,
Dixit, R. P., 300 346, 359, 365, 367, 388-9
Dobson, E. J., 169 Grosjean, F., 236
Docherty, G. J., 279 Grossberg, S., 70
van Dommelen, W. A., 142 Gunther, H., 257
Dowty, D. R., 194 Gussenhoven, C, 125-6, 326, 331, 334-5,
Drachman, B., 399-402, 406, 409, 414, 416, 338, 339, 341, 347, 359-61, 363-4, 366-
418 7, 379, 384-5, 388-91, 394
Dunn, M., 260
Hakoda, K., 380
Edman, T. R., 173 Halle, M., 116, 118, 149, 151-2, 159, 167,
Edwards, J., 2, 3, 68, 87-9, 92, 111, 114, 120, 183, 195-6, 199, 216, 227, 279, 296-8,
122-3, 125-6 398
Engebretson, A. M., 395 Hammarberg, R. E., 141
Erikson, D., 178, 395 Han, M. S., 380
Hankamer, J., 249, 252, 278
Fant, G., 93, 100, 118, 144, 170, 175, 199, Hanson, R., 122
227, 279 Hardcastle, W. J., 67, 264
Feder, D., 256 Harris, K. S., 26, 141, 261
Ferguson, C. A., 235 Harshman, R., 35
Feth, L., 136 't Hart, J., 322, 325, 331-3, 339, 360
Firth, J. R., 142, 192, 225, 261 Harwood, D. A., 292-3
Flanagan, J. L., 144 Haskins Laboratories, 3, 68-9, 122, 288
Fletcher, J., 2-3, 5, 68, 87-9, 92, 111, 114, Hawking, S. W., 188
120, 122-3, 125-6 Hawkins S., 3, 5, 56, 59, 69
Folkins, J. W., 11 Hayes, B., 4, 151, 230, 280, 327, 398
Foss, D. J., 292-3 Hebb, D. O., 10
Fourakis, M., 279, 399-^00, 418 Henderson, J. B., 260
Fowler, C. A., 14, 18, 26, 56, 69, 201, 225, Henke, W. L., 141
278 Hewlett, N., 128-9, 138^1, 143-5
Francis, W. N., 239 Hiki, S., 395
Franke, F., 66 Hirose, K., 171, 348, 372, 380, 395^6
Frauenfelder, U., 293 Hirschberg, J., 334, 388-9
Frederiksen, J. R., 256 Hombert, J.-M., 5, 180
Fromkin, V., 180, 292 Honda, K., 395
Fry, D. B., 332 Hooper, J. B., 228
Fudge, E. C, 197 Houlihan, K., 316
Fujimura, O., 2-3, 21, 31, 87, 89, 117-9, 174, House, J., 5
176, 369, 395 Huggins, A. W. F., 294
Fujisaki, H., 348, 368, 372, 379 van der Hulst, H., 332
Hunnicutt, M. S., 116
Gamkrelidze, T. V., 176 Huss, V., 333
Girding, E., 331, 348, 366 Hyman, L. M., 84
Gay, T., 64, 68, 201
Gazdar, G., 216 Isard, S. D., 5

453
Name index
Iverson, G. K., 317 Lass, R., 155, 160, 198,203,219
Lea, W. A., 116
Jackendoff, R., 393 Leben, W. R., 166, 185
Jaeger, J. J., 185 Lehiste, L, 89, 172
Jakobson, R., 118, 199, 227, 279 Liberman, A. M., 26, 64, 83-^, 90, 178,
Jesperson, O., 66 398
Johnson, C. D., 191 Liberman, M. Y., 326, 332, 335, 347, 354^5,
Jones, D., 204 389, 391-3, 418
Joseph, B. D., 399^00, 414 Licklider, J. C. R., 168
Lieberman, P., 129, 327
Kahn, D., 122 Lindau, M., 31, 193
Kajiyama, M., 170, 175 Lindblom, B., 5, 65, 129, 143, 167-8, 286
Kakita, Y., 395 Lindgren, R., 129
Kawai, H., 372 Lindsey, G., 328
Kawasaki, H., 167-9, 182, 256 Linell, P., 191
Kay, B. A., 72 Linn, Q., 100
Kaye, J., 193, 196 Local, J., 4, 142, 190, 196, 204, 210, 213,
Keating, P., 26, 59, 123, 283 216, 224^8
Keller, E., 60, 69 Lodge, K. R., 205
Kelly, J., 5, 142, 190, 204, 210, 213, 226 Lofqvist, A., 171
Kelso, J. A. S., 10-11, 13-14, 18, 20, 27-8, Lorentz, J., 170, 176
60, 65, 68-70, 123 Lowenstamm, J., 196
Kenstowicz, M., 150 Lubker, J., 168
Kent, R. D., 136 Lyons, J., 142
Kerswill, P., 264, 267-8, 270, 278
Kewley-Port, D., 194 McCarthy, J. J., 150, 155, 159, 178, 187, 192,
Keyser, S. J., 150 230
King, M., 191 Macchi, ML, 87, 123
Kingston, J., 3, 60, 65, 121, 177 McCrosky, R. L., 169
Kiparsky, P., 236 McGowan, R. S., 129
Kiritani, S., 118 MacKay, D. G., 294
Klatt, D. H., 68, 116, 180, 190, 194-5 McNeil, D., 181
Klein, E., 200 MacNeilage, P. F., 14, 122
Kluender, K., 65 Maddieson, I., 176, 282
Kohler, K. J., 4, 142-3, 224, 225 Maeda, S., 326, 335, 388
Koreman, J., 230 Magen, H., 27, 56
Koutsoudas, A., 191 Malikouti-Drachman, A., 399-402, 406, 409,
Kozhevnikov, V. A., 141 414,416,418
Krakow, R. A., 181, 258 Mandelbrot, B., 173
Krull, D., 129 Marslen-Wilson, W. D., 2, 229, 231, 233,
Kubuzono, H., 125-6, 331, 345, 368-9, 372- 237-8, 255-7
3, 379-80, 382, 391-4 Mascaro, J., 5
Kucera, H., 239 Massaro, D. W., 294
Kuno, S., 369 Mattingly, I. G., 26, 280
Max-Planck Speech Laboratory, 239
de Lacerda, A., 139-40 Mehler, J., 293-^
Ladd, D. R., 94, 321, 325-6, 333^6, 342, Mehrota, R. C , 299
343, 346-9, 355, 385-6, 388, 392 Menn, L., 326
Ladefoged, P., 31, 35-6, 93, 137, 141, 159, Menzerath, P., 139-^0
191, 193-5, 282, 296-8, 304, 311-12, Mermelstein, P., 27
314 Miller, G. A., 168
Lahiri, A., 2, 229-30, 249, 252, 255-7, 274, Miller, J. E., 31
327 Mills, C. B., 293
Lapointe, S. G., 191 Mohanan, K. P., 159, 200, 203, 205
Lashley, K. S., 10 Monsen, R. B., 395

454
Name index
Munhall, K., 9, 13-15, 17, 19, 23, 30, 68-70, Rossi, M., 4
116, 121 Roudet, L., 66
Rubin, P., 27
Nakatani, L., 333
Nathan, G. S., 205 Sag, I., 217
Neary, T. M., 35 Sagart, L., 395
Nelson, W., 60 Sagey, E., 153, 160, 187, 198, 282
Nespor, M., 94, 125, 334, 390, 399-403, 405, Saltzman, E., 9-10, 13-15, 17, 19-20, 23, 27-
414,418 8, 30, 45, 60, 65, 69, 116, 122-3, 288
Nittrouer, S., 71, 129, 136 Samuel, A. G., 294
Nolan, F., 2, 4, 261, 267, 280-8 Sanders, A. G., 191
Noll, C , 191 Sato, H., 372, 380
Norris, D. G., 293 De Saussure, F., 227
Savin, H. B., 293
Ohala, J. J., 65, 137, 167-8, 170, 172, 176-9, Sawashima, M., 118, 171, 395
181-9, 225, 247, 255-6, 286-7 Schaffer, J., 333
Ohala, M., 296-8, 310-12 Schein, B., 162
Ohde, R. N., 129, 137 Scherzer, J., 294
Ohman, S., 26, 67, 173, 178, 201 Schiefer, L., 296-9, 301, 304^5, 311-18
O'Shaughnessy, D., 328 Schindler, F., 143
Ostry, D. J., 60, 68-70, 121 Scott, D., 213
Otsu, Y., 372 Segui, J., 238, 203-4
Selkirk, E., 5, 83, 94, 125, 313, 370, 372-3,
Paccia-Cooper, J-M., 213, 333-4 375, 379, 390, 393
Parush, A., 60, 69, 121 Sereno, J. A., 129
Perkell, S., 35, 67, 201 Setatos, M., 399-400, 403, 405, 414
Peters, P. S., 191 Sharf, D. J., 129, 137
Peters, S., 194 Shattuck-Hufnagel, S., 210
Peterson, G. E., 169 Shieber, S. M., 192, 197, 217
Pickering, B., 5 Shockey, L., 128, 138^1, 143-5
Pierrehumbert, J., 2-4, 64, 84, 90, 92-4, 97, Shuckers, G., 136
117-27, 193, 283, 324-7, 331-2, 334-5, Sievers, E., 66
342-3, 345, 347-8, 354-5, 368-9, 372-3, Silverman, K., 84, 92
375, 385-93, 396 Simada, Z., 395-6
Pike, K., 325 Smith, H. L., 398, 418
Plato, 180 Smith, N., 332
Pollard, C , 217 Sorensen, J., 335
Port, R., 279 Spriggs, R. K., 199
Poser, W., 368, 372-3, 385, 390, 393 Steele, S., 86, 178, 347
Prather, P., 293 Steriade, D., 153, 162
Prince, A., 230, 316, 332, 389, 398, 418 Stetson, R., 65
Pulleyblank, D., 153 Stevens, K. N., 136, 173, 175, 206-7
Pullum, G. K., 191 Strange, W., 173
Puppel, S., 398 Studdert-Kennedy, M., 129
Sudo, H., 368, 372
Rialland, A., 185 Sugito, M , 395
Recasens, D., 5, 56 Summers, W. V., 75, 143
Repp, B. R., 129 Sussman, H., 14, 122
Rietveld, T., 125-6, 326, 331, 334-5, 347, Sweet, H., 204
359-61, 363-4, 366-7, 379, 384-5, 388- Swinney, D. A., 293
91, 394
Riordan, C. J., 171, 301 Taft, M., 257
Ritchie, R. W., 191 Talkin, D., 2-3, 90, 92, 99, 117-27
Roach, P., 204, 219 Tateishi, K., 5, 370, 372-3, 379, 393
Roca, I., 398 Terken, J., 5

455
Name index
Thorsen, N., see Gronnum Vogten, L., 352
Thrainsson, H., 185 Voltaire, 176
Timmermann, G., 142
Touati, P., 331 Waibel, A., 116
Trager, G. L., 398, 418 Wall, R. E., 194
Trubetzkoy, N. S., 176, 193 Wang, W. S-Y., 176
Tuller, B., 13-14, 18, 65, 123 Warburton, I., 399, 414
Turnbaugh, K. R., 129 Warren, P., 233, 237-8, 257
Tyler, L. K., 236 Watson, I., 5
Wessels, J., 236
Uldall, E. T., 172 Westbury, J. R., 58
Uyeno, T., 380 Whitney, W., 162
Williams, B., 333
Vaissiere, J., 119 Wood, S., 31
Vatikiotis-Bateson, E., 86 Wright, J. T., 179, 270
Verbrugge, D. P., 173
Vergnaud, J-R., 151-2, 196 Yaeger, M., 178
Vermula, N. R., 395
Vogel, I., 2, 94, 124-5, 334, 379, 390, 399- Zimmer, K. E., 178
403, 405, 414, 418 Zue, V. W., 210

456
Subject index

abstract mass, see task dynamics of nasality, 234


accent, ch. 3passim, 109, 359-67, ch. 14passim partial, 152, 159, 267n.
kinematics of, 73, 105, 109 partial assimilation in Kolami, 153
accentual lengthening, 84, 86 perception of, 268-76
accent level rule, 372 of place, 278, 283-5; in English, 287
nuclear, 109, 388, 392 regressive assimilation in Hausa, 152
word accent, 327; in Japanese, 382 single-feature assimilation, 159
phrasal accent, 120, 327 site, 264
action theory, 14, 19 total, 159
aerodynamics, 172 association domain (AD), 126, 337-47,
aerodynamic variables, 23 349-51, 356, 358
airflow, 137, 179 assonance, 180
volume velocity of airflow, 144 autosegmental phonology, 119, 149, ch. 6
affricates, 150 passim, 166-7, 177-9, 184, 191-2, 201,
alliteration, 180 230,235,261,288
allophone separation, see coarticulation autosegmental spreading, 152
allophonic variation, 30, 90, 93, 117, 191, autosegmental tier(s), 94^-5, 118, 152, 159,
272, 316 164
alveolar stops, 143 autosegments, 166, 180, 183, 206, 285
alveolar weakening, 283, 285 left-to-right mapping, 166, 177
ambisyllabicity, 197
anapest, 180 Bambara, 185-6
articulatory setting, 19, 22, 57, 66, 137 Bantu languages, 186
articulatory syllable, see coarticulation base of articulation, see articulatory setting
articulatory synthesis, 29, 268 baseline (for Fo), 326, 329-34, 396
aspiration, ch. 12 passim; see also stops, Bengali, ch. 9 passim, 214, 327, 333n.
voice-onset time boundary strength, 333-4; see also phrasing
assimilation, 4, 149-55, 158-9, 181, ch. 8 boundary tones, 322, 326-7, 339, 363-4,
passim, ch. 10 passim; see also 389
coarticulation breathy phonation, 93, 300, 303
assimilation site, 264 breathy voiced stops in Hindi, ch. 12
autosegmental representation of, 263 passim
coda and onset assimilation, 197, 213
in German, 226 casual speech, 30, 60
gradual, 264, 268-9 catathesis, see downstep
multiplanar and coplanar formulations of, cinefluorography, 35
153-4 coarticulation, 14, 23, 24, 56-8, 63-5, ch. 5

457
Subject index
passim, 198, 202-3, 213-15, 218, 293; of syllables, 76
see also assimilation, coproduction speech timing, 68
accounts of; articulatory syllable, 141; see also length, geminates
feature spreading, 129, 141, 142, 154, Dutch, 325-6, 333n., ch. 14 passim,
158, 162 389-90
anticipatory effects, 56 dysarthic speech, 58
CV coarticulation, 129-30, 133, 138-9,
213-14 effector system, see task dynamics
domain of, 141, 203 electroglottography (EGG), 93, 97, 99, 118
in connected speech, 135 electromyography (EMG), 114, 267
measures of; allophone separation, 129, electropalatography (EPG), 206, 207, 209,
133, 136, 140; target-locus relationship, 213n., 263-73
129, 135, 140 emphasis, 348
VV coarticulation, 178 English, 10, 35, 204, 207, 237-9, 258, 278,
cocatenation, 201-2, ch. % passim 280-1, 283, 325, 328, 333n. 336, 338n.
coda, see syllable 387, 389-91, 398, 408
cohort model of spoken word recognition, English vowels, 35
229-33, 238n. Middle English, 169
compensation, see immediate compensation nasalized vowels in, 238-9, 246-8, 255-6
compositionality of phonological features, Old English, 169
199 epenthesis, 151, 153, 191, 224, 279
compound accent rules, 372 in Palestinian, 151
connected speech processes, 135, 142, 264 Ewe, 370n.
consonant harmony, 162-3 exponency, see phonetic exponency
contour segment, 150; see also segment
coordination, temporal, 13, 56-8, 169, 171— F o , see intonation, pitch accent, pitch range
2, 174, 176, 177, 181, 184; see also task features, 117-8, ch. 6 passim, 178-80, 196,
dynamics 199, 203, 231, 262, 281-3, 294, 297-8,
coproduction, 24, 26, 55, 57, 63, 67, 201, 311-18, 394
278; see also coarticulation articulator-bound, 159
CV-skeleton (CV-core, CV-tier), 150-2, 158, category-valued, 156
166-7, 177, 183, 230; see also emergence of, 93
autosegmental phonology interdependencies among, 178
intrinsic content of, 177
damping, 15-18, 43, 89; see also task n-ary valued, 156,316,318
dynamics natural classes of, 155, 161, 187, 296, 313
critical, 15, 18-20, 64, 87 phonetic corelates of, 93, 174—6, 184, 194-7
Danish, 345, 361, 366 suprasegmental, 189
declarative phonology, 198, 202, ch. % passim feature asynchrony, 186, 187, 189
declination, 117, 329-31, 346; see also feature dependence, 155
downstep feature geometry, ch. 6 passim, 178-9,
deletion, 202, 224 187
demisyllable, 294 feature specification theory, 230
dependency phonology, 191 feature spreading, see coarticulation
devoicing, 142 feature structures, 197, 200, 214-19, 224
diphone, 294 fiberscope, 118
displacement, see task dynamics final lengthening, 77-9, 83, 84, 89, 110, 116;
distinctive features, see features see also phrasing
downstep, 95, 331, ch. 14 passim, ch. 15 kinematics of, 77
passim; see also declination final lowering, 116, 326n, 354, 364, 389; see
accentual, 335, 343-7, 351-2, 358, 366-7 also phrasing
in Japanese, ch. 15 passim Firthian phonology, 184, 190, 192, 1 9 3 ^ ,
phrasal, 335, 345-7, 350, 356-8, 366-7 198, 218-9
duration focus, pragmatic, 390-2
as cue to stress, 333 foot, 142, 191

458
Subject index
French, 27, 294 integer coindexing, 218
fricatives, 129 intonation, 92, 116, ch. 13 passim, 327,
spectra of, 129 394
voiceless, 171 effects on source characteristics, 92
fundamental frequency, (Fo), see intonation, of Dutch, ch. 14 passim
pitch accent, pitch range notation of, 326, ch. 14 passim
intonational boundary, 109; see also
gating task, 236, 237-9, 240, 242, 244^5, boundary tones
250-1, 253, 257-8 intonational locus, ch. 14 passim
hysteresis effects in, 237n., 256 inverse filtering, 93, 95
geminate consonants, 150-1, 153, ch. 9 Italian, 287, 370n.
passim
geminate integrity, 151 Japanese, 324, 331, 333n., 345, ch. 15 passim
generalized phrase-structure grammar jaw, see mandible
(GPSG), 156
generative phonology, 155; see also SPE Kabardian, 283n.
German, 226, 326, 336 Kikagu, 185
GEST, 29f., 288 Kolani, 153
gesture, ch. 1 passim, 27, 7If., 288, 298; see
also task dynamics Latin, 287
opening, 73-9 length (feature), ch. 9 passim
closing, 73-9 lexical-functional grammar (LFG), 156
relation between observed and predicted lexical items, mental representation of, 229-
duration of, 79-81 31, ch. 9 passim, 255, 291
gestural amplitude, 68-70 lexical stress, see stress
gestural blending, 19, 30 long vowels, 150
gestural duration, 88-9, 120, 124 Ludikya, 186; see also word games
gestural magnitude, 105, 109, 111, 114, Luganda, 185
116, 120, 124, 169
gestural overlap, 14, 22, 28, 289; see also Maithili, 298
coarticulation Mandarin, 395
gestural phasing, 17, 18, 21, 44, 70, 72, 76 mandible, 10-13, 87-8, 114, 122-3, 140, 144
gestural score, 14, 17, 22, 27-9, 30f, 44-6, durations of movements, 88-9
52-5, 57, 60, 64, 165 Marathi, 316
gestural stiffness, 11, 16, 17, 19, 44, 68-70, metrical boost (MB), 379-86
76; lower limit on, 80-2 metrical phonology, 180, 257, 331, ch. 16
gestural structure, 28 passim
gestural truncation, 70-2 major phrase, 373, 379, 383-5, 392; see also
glottal configurations in [h] and [?], 93-4 phrasing
glottal cycle, 9 3 ^ , 98, 100 formation of, 372
graph theory, 157, 196-7, 210 mass-spring system, see task dynamics
Greek, 177, ch. 16 passim minor phrase, 377, 392; see also phrasing
Gujarati, 309n. formation of, 372
mora, 86, 324
Hausa, 387-8 stability of, 186
Hindi; stops in, 296, ch. 12 passim', see also morphology and word recognition, 257
stops morphophonologization, 274
motor equivalence, see task dynamics
iamb, 180 movement trajectories, see task dynamics
icebergs, 89, 174n. movement velocity, see task dynamics
Icelandic, 185, 187
Igbo,185-6 n-Retroflexion, 150, 162, 163
immediate compensation, 10, 13 nasal prosodies, 178
initial lengthening, 116 nasalization, 178n., 179, 182, 243, 247n.; see
insertion, see epenthesis also assimilation, nasality

459
Subject index
nasality, 234 prosody, 91, 180, ch. 13 passim, 399-400
in English and Bengali, ch. 9 passim prosodic effects across gesture type, 120-2
Nati, see n-Retroflexion prosodic structure, ch. 3 passim, 331-2,
Nepali, 309n. 338, 369, 394; effect on segmental
neutralization, 213, 231, 274 production, 112
neutral attractor, see task dynamics phrasal vs. word prosody, 91, 92, 94-7,
neutral vowel, 65; see also schwa 114-6, 121
nuclear tone, 325 Proto-Indo-European, 286
nucleus, see syllable psycholinguistics, 2, 229, 254, 293-5
Punjabi, 297
object, see task dynamics
obligatory contour principle, 166; see also rhythm, 398-402; see also stress
autosegmental phonology realization rules, 124, 215-16, 386
onset, see syllable redundancy, 173, ch. 9 passim
optoelectronic tracking, 72 register shift, 372, 380, 384-6; see also
overlap, 28, 30, 56, 57, 141, 289; see also downstep, upstep
coarticulation, task dynamics relaxation target, see task dynamics
reset, 335, 343-7, 349, 366, 380, 384f.; see
perturbation experiments, 11 also phrasing, pitch range, register shift
phase, 72, 87, 3 1 3 ^ as local boost, 343
phase space, see task dynamics as register shift, 343-7
phoneme monitoring task, 292 retroflex sounds, 162; see also n-Retroflexion
phonetic exponency, 194, 202-3, 218-19, 225 rewrite rules, 226
phonetic representations, 118, 193^4, 224-5, rhyme (rime), see syllable
278, 315, 388 rule interaction, 191
phonological domains, consistency of, 126 rule ordering, 191
phonological representations, 118, 149, 200, Russian, 278
225, 285, 315-8, 388, 394; see also
autosegmental phonology, Firthian Sanskrit, 317
phonology, metrical phonology schwa, 26, ch. 2 passim
abstractness of, 93, 229-30 epenthetic, 53
monostratal, 193 in English, ch. 2 passim
phonology-phonetics interface, 3, 24, 58, in French, 27
124-7, 193, 195, 205, 224-8, 394 intrusive, 59
phrase-final lengthening, see final lengthening simulations of, 51, 52, 65
phrase-initial lengthening, see initial targetless, ch. 2 passim, 66
lengthening unspecified, 52, 53
phrasing, 73, 337, 366 Scots, 160
effects on /h/, 109-14 secondary articulation, 179
effects on gestural magnitude, 111-4 secondary stress, ch. 16 passim
phrase boundary effects, 110-14, 336-8, segment, 4, 128, 141, ch. 6 passim, ch. 7
363; see also phrasal vs. word prosody passim, 198, 201, ch. 10 passim, ch. 11
Pig Latin, 292; see also word games passim
pitch accent, 322-6, 332, ch. 14 passim boundaries, 181
361-7, 387-90; see also accent contour segments, 150, 160, 282, 285
bitonal, 326 complex segments, 150, 160, 283, 285
pitch change, 360 segmental categories, 207
pitch range, 330, 348, 352, 356, 358, 387, hierachical view of segment structure, 161
392-7; see also register shift, declination relation to prosodic structure, 94, 117-8
point attractor, see task dynamics spatial and temporal targets in, 3
prenasalized stops, 150 steady-state segments, 172-6
prevoiced stops, ch. 12 passim segment transitions, 172—4, 176, 181
primary stress, 407-9; see also stress, Semitic verb morphology, 186
secondary stress shared feature convention, 166
prominence, 348, 387, 392-3 skeletal slots, 154; see also CV-skeleton

460
Subject index
skeleton, see CV-skeleton targets, phonetic, ch. 2 passim, ch. 3 passim,
slips of the tongue, see speech errors 322, 325
sonority, 68, 69, 83-5, 123-4 targets, tonal, ch. 14 passim
sonority space, 84, 86 scaling of, 347-51
sound change, 182, 286-7 task dynamics, 3, ch. 1 passim, ch. 2 passim,
The Sound Pattern of English (SPE), 116, 68-72; see also gesture, overlap
149, 155, 191-2, 195, 201, 296-7 abstract mass, 20
speech errors, 180, 292, 294 articulator weightings, 17
speech rate 23, 65, 136, 264, 268-9 articulator motions, 46
speech recognition, 294 articulator space, 12
speech style, 23, 129, 137, 144-5, 267, 362; body space, 12
see also casual speech coordination in, 23, 65
speech synthesis, 29, 101, 102, 190, 192, 225, displacement, 16, 70, 72-3, 76, 80, 122
268 effector system, 12, 13
speech technology, 116 evidence from acquisition of skilled
stiffness, see task dynamics movements, 22, 24, 143
stops, ch. 5 passim, 171, 175-6, ch. 12 passim mass-spring system, 15, 17, 19, 87-9
bursts, 131-3, 137, 140, 144, 172, 176 motor equivalence, 10
preaspirated stops in Icelandic, 185, 187 movement trajectories, 10, 16
stress, 4, 120-3, 151, 191, 392, ch. 16 passim; movement velocity, 70; lower limit on, 82
see also accent neutral attractor, 19, 22
accent, 332 object, 11
contrastive, 91 oscillation in task dynamics, 87
effects of phrasal stress on /h/, 103-9 phase space, 18
in English, 332 point attractor, 15
in Greek, 4, ch. 16 passim relaxation target, 66
lexical stress, 399^02, 405-6, 417-19 stiffness, 16, 17, 87
rhythmic stress, 416, 418-19; see also task dynamic model, 9, 27, 57, 69
rhythm task space, 12
word stress, 120-1 tract variables, ch. 1 passim, 23, 27, 28f.,
sub-lexical representations, 291, 295 122-3, 124; passive, 29
Swedish, 325, 328, 400 tempo, 68, 79-84; see also speech rate, final
syllabification, 294 lengthening
syllable, 73-5, 84, 90, 109, 120-3, 142-3, temporal modulation, 89
181, 191, 194, 293-4, 304, 324, 332-3, temporal overlap, see overlap
360-3, 399-401, 408-13, 417 timing slots, 166, 177; see also CV-skeleton
accented, 73-7, 84-6, 94, 324, 337-8, 341, tip of the tongue (TOT) recall, 180-1
359-60 Tokyo X-ray archive, 31
coda, 199, 213, 216-20, 228 tonal space, 329-31, 334
deaccented, 94 tone(s), 84, 116, 150-1, 185, 282, 326-9, ch.
durations of, 76, 77, 79f, 80-2 14 passim, ch. 15 passim; see also
nucleus, 204, 213, 214 downstep, upstep
onset, 119, 191, 200, 213, 216-18 as a grammatical function, 186
prosodic organization of, 83 boundary tones, 322, 326-7, 339, 363-4,
rhyme (rime), 180, 200, 214 389
unaccented, 77 in Igbo, 185, 186
syllable target, 294 in Kikuyu, 185
syntax, effects on downstep, 369, 379 and prosodic environment, 90, 91, 94
syntax-phonology interaction, 369, ch. 15 tone scaling, 84, 347-51, 395-7
passim; see also phrasing tone spreading, 178, 336, 341
of English monosyllables, 214 starred and unstarred, 326, 336
validation using synthetic signals, 101, 102 trace model of speech perception, 233n.
tract variables, see task dynamics
Tamil, 333n. transformational generative phonology
target parameter, 29, 139 (TGP), see SPE

461
Subject index
transillumination, 121 voice-onset time (VOT), 121-2, 131, 279,
trills, 179 296-8,311-12
trochee, 180 voicing, strategies for maintaining, 58
tunes, 324; see also CV-skeleton vowel centralization, 129-30, 133, 135, 137;
see also schwa, neutral vowel
underspecification, 26, 28, 54, 119, 200, 216- vowel harmony, 142, 152, 154, 178, 187
19, 198, 228, 255, 257 vowel space, 129, 169
unification grammar (UG), 156, 192
upstep, 348, 379, 382, 385 Welsh, 333
within-speaker variability, 290
variation word games, 180, 181, 294
interspeaker, 68, 119
intraspeaker, 290 X-ray micro-beam, 32^0, 46-50, 62-3, 267
situation-specific, 290
contextual variation of phonetic units, 55 Yoruba, 326

462

You might also like