Chapter2 Production

9
Chapter 2. Acoustic Theory of Speech Production

2.1 Introduction
Speech is an acoustic signal with communicative intent, having structure at many
levels, e.g. syllables, words, phrases, sentences, etc. Current speech recognition
equipment is mainly concerned with decoding the words (although the technology is now
in transition to a higher level of understanding), but it should not be forgotten that other
kinds of information are also carried. These may convey the speakers emotional or
physical state, mark grammatical structure, or help to control the dialogue between two
or more speakers.
A complete model of speech production must account for all processes occurring
from the inception of the intent to communicate, to the final acoustic signal. The science
of Phonetics has concentrated on the processes between articulation and the acoustic
signal, but the preceding neurological stages have not been neglected (Laver, 1979;
Abbs, 1996). Figure 2-1 shows the processes occurring at the most abstract level
(adapted from Laver, 1979). Feedback loops exist at many levels within this system, but
are not shown here, in order to keep the diagram simple. As this work is concerned with
the effects of stress on speech production, Figure 2-1includes an indication of the levels
at which various kinds of stressors affect the process (based on Murray, Baber and
South, 1996). Stressors are classified according to the mechanism of their main effect.
Thus the psychological stressors, e.g. fear, act at the highest levels of the process
through the conscious mind and may affect ideation and linguistic programming. At the
10
next level, the speaker perceives a problem with communication, such as a noisy
environment, and alters the articulatory targets to compensate.
Physiological stressors do not affect any mental processes directly, but alter the way in
which the firing patterns of the motor neurons are translated into forces by the muscles.
Ideation
Linguistic
Programming
Generation of
Articulatory
Targets
Generation of
Neuromuscular
Commands
Muscular
Actions
Articulator
Movements
Speech
Consciousness
Emotional
state
auditory feedback
proprioceptive
feedback
3rd order
(psychological)
stressors
2nd order
(perceptual)
stressors
1st order
(physiological)
stressors
0th order
(physical)
stressors
para-linguistics
.
Figure 2-1 Outline of the overall process of speech production
11
Such stressors would include drugs or fatigue, for example. At the physical level, the
stressors are forces acting on the articulators but the forces generated by the muscles are
assumed to be unaffected.
In reality, it is not possible to separate the effects of stressors so neatly. A
physical stressor such as vibration may induce a degree of fear and also fatigue if
exposure continues over a long period of time. Nevertheless, it seems useful to base this
classification on the primary effect of the stressors.
In this work, we are primarily concerned with changes occurring at the
articulatory and acoustic levels, as a result of the speaker being subjected to physical
stresses. While the speaker will attempt to compensate for the effects of such stresses
(given that he is able to monitor his speech production), it is apparent that such attempts
are not always completely successful, resulting in speech that is different in some ways
from normal speech. The nature and extent of the changes vary considerably from
person to person but some general patterns have been observed in studies of the effects
of physical stresses (Bond, Moore, Anderson, 1986; Taylor and Leeks, 1989). It may
also be expected that variability will be less with a highly selected, trained, and motivated
set of subjects, such as the RAF personnel who were the speakers for some of the
recordings studied in this work.
Speech is produced by a combination of one or more sources of acoustic energy
and resonant cavities formed by the oral and nasal tracts. The excitation may be supplied
by a periodic sound produced by the vocal folds, or by quasi-random noise produced by
turbulence occurring when air is forced through a constriction in the vocal tract. Both
kinds of excitation may be used together. The sound produced is then modified by the
12
resonances of the vocal tract, and sometimes the nasal cavity, and subsequently radiated
from the mouth or nostrils, which also affects its spectral characteristics. Different
sounds are produced by moving the articulators, i.e. the tongue, lips, velum, and to a
lesser extent the jaw. This is generally known as the Source-Filter Model of speech
production (Fant, 1970).
The information is carried largely in spectral and temporal patterns which change
on a time scale of between 10 ms and 100 ms. At the most basic level, it may be
considered that certain spectral (and temporal) patterns correspond to linguistic units
such as phonemes, syllables or words. However, the history of speech recognition
technology has shown that this is a gross over-simplification (Ainsworth, 1988, p41-53;
Markowitz,1996, p11-17): the correspondence between a given spectral pattern and a
certain phoneme may be very weak, and highly dependent on context. In fluent speech,
an articulator may not reach its target position for one phoneme before beginning to
move towards that for the next phoneme. In addition, the movements of the different
articulators are not always exactly synchronised with each other or with voicing, so the
various acoustic features of adjacent phonemes may overlap.
2.2 Physiology of the Vocal System
Some knowledge of the various physiological systems involved is necessary to
understand speech production. The three major systems of the body involved are the
respiratory system, the larynx, and the articulatory system. The respiratory system,
which consists of the lungs, chest wall and diaphragm (Figure 2-2), provides the airflow
and aerodynamic energy; the larynx generates periodic acoustic energy and is used for
some other functions, such as the articulation of the glottal stop, and the articulatory
13
system controls the overall spectral envelope and thus the sound quality. Non-periodic
sound energy may also be generated within the articulatory system. Most phonetics text
books contain descriptions of the physiology of speech production at some level; what is
presented here is intended to be of a level adequate for an understanding of the possible
effects of physical stresses such as g-force and pressure breathing. Detailed descriptions
of respiratory and laryngeal function can be found in (Stevens, 1998, Chapter 1).
2.2.1 The Respiratory System
The flow of air into and out of the lungs is controlled mainly by the intercostal
muscles, which connect the ribs together (Ladefoged, 1967). The external intercostals
are connected so as to lift the rib cage outwards when they are activated, thus expanding
Figure 2-2 Major parts of the Vocal System
14
the lungs. The internal intercostals act in the opposite way, pulling the rib cage down
and reducing lung volume. If neither set of muscles is activated (nor any others
associated with lung function), the rib cage will settle at its equilibrium point, determined
by elastic forces.
Air is taken into the lungs by a combination of expansion of the chest wall and
lowering of the diaphragm, thus increasing the volume of the lungs. The consequent
reduction in pressure causes air to flow in through the nose or mouth and trachea.
Expiration is achieved mainly by the elastic contraction of the chest wall and lung,
reducing lung volume and raising the internal pressure above that of the atmosphere. If
expiration needs to continue past the equilibrium point (about 40% of total lung
capacity), muscles (initially the internal intercostals) will be brought into play to continue
the compression of the lungs.
The production of speech requires a constant sub-glottal pressure of 6-10 cm
H
2
O to be maintained against the aerodynamic resistance of the larynx and/or vocal tract.
The rate of air flow is less than occurs in normal breathing, so muscular activity must be
involved at lung volumes above the equilibrium point to resist the normal elastic forces.
As the intensity and pitch of voiced sounds is strongly dependent on sub-glottal pressure,
there must be constant and very precise adjustments of muscle tensions to control
prosodic aspects of the speech produced.
2.2.2 The Larynx
The larynx is a complex structure of cartilage, ligament and muscle, situated at
the top of the trachea. A very detailed description of its structure and function may be
15
found in (Orlikoff and Kahane, 1996). It may have originated as a simple sphincter to
close off the trachea to prevent food or water from entering the lungs, but has evolved
into a highly specialised organ of speech. The vocal folds, consisting of muscle and
ligament, stretch across the opening of the larynx and are capable of closing it off
completely by being brought together (adduction), or of being opened (abduction)
sufficiently to allow relatively unrestricted airflow. In addition there is a continuum of
states between being tightly adducted and wide open; during production of voiced
speech, the folds are adducted lightly. In this state, the increased air pressure in the
lungs is capable of forcing the vocal folds apart. The reduction in pressure in the space
between the vocal folds (which is called the glottis) due to the Bernoulli effect, combined
with their elasticity and muscle tension, causes them to be drawn together again. The
combination of aerodynamic and mechanical effects causes a regular opening and closing
of the vocal folds, allowing periodic puffs of air to enter the vocal tract. This is the
source of acoustic energy in voiced speech.
The frequency of oscillation of the vocal folds is dependent on the mass of the
tissues involved, the tension in the controlling muscles, and, to a lesser extent, the sub-
glottal air pressure. Males, having more massive vocal folds, have lower pitched voices
than females, but the muscular control is such that individuals may be able to phonate
over a range of nearly three octaves. In normal speech, the degree of adduction of and
muscle tension in the vocal folds are constantly changing to differentiate voiced sounds
from unvoiced, and to control the prosodic features. In addition, an intermediate degree
of abduction is used to produce the glottal fricative /h/.
16
2.2.3 The Articulatory System
The articulatory system consists of the tongue, lips, velum, jaw, teeth and palate
(Figure 2-3). The palate and upper teeth are sometimes referred to as passive
articulators (Laver, 1994, p136) because they do not move, but provide fixed surfaces
against which the active articulators (the tongue, lips, jaw and velum) operate. These
control the shape of the oral cavity, and whether the nasal cavity is connected to it, and
hence control the degree to which the various frequencies present in the excitation signal
are amplified or attenuated. In the case of vowels, the articulatory system only modifies
the sound produced by the vocal folds, but in certain consonants, notably the fricatives in
English, it also supplies the excitation. Some other languages include non-pulmonic
sounds such as clicks, in which the airflow is not generated by the respiratory system, but
within the articulatory system.
Figure 2-3 Components of the Vocal Tract (Adapted from Holmes, 1988)
17
The dominant factors in controlling the frequency response of the vocal tract are
the position and degree of the main constriction (Fant, 1970; Kent and Read, 1992). The
tongue is an extremely flexible organ and plays the most important part in controlling the
shape of the oral cavity and hence its acoustic frequency response. In addition, it may be
used to close off the cavity momentarily for some stop consonants, or for longer periods
for nasal consonants.
The velum, or soft palate, has a relatively simple function, that of controlling the
aperture between the oral and nasal cavities, which may be open to varying degrees or
completely closed.
The lips can be used to close the oral cavity completely, to produce a narrow
constriction causing turbulent airflow for certain consonants, or, by protruding forwards,
alter the effective length of the vocal tract.
The jaw is the carrier of the tongue, lower lips and lower teeth, and its
movements during speech are largely to extend the range of movements of the tongue
and lips (Laver, 1994, p124).
2.3 Voice Source
The excitation, or source of the sound energy for the speech signal, may take two
forms, periodic or aperiodic (although speech sounds may include both). In the former
case, the sound energy is generated by periodic vibrations of the vocal folds as explained
above, and has a fundamental frequency and an identifiable pitch. The progressive
opening of the vocal folds to a maximum and then closing again results in a pulse of air
entering the pharynx. Figure 2-4(a) shows a typical glottal pulse waveform in terms of
18
volume velocity against time. The pulse has a relatively slow rise at the beginning, but a
steep trailing edge and abrupt finish. The spectrum of a repeated sequence of such
pulses is shown in Figure 2-4(b). The amplitude of the harmonics decreases at a rate of
around 12 dB per octave, although the exact roll-off is dependent on the shape of the
glottal pulse and varies from person to person and with supra-segmental aspects of
speech.
Aperiodic excitation is generated by constricting the vocal tract to such an extent
(a)
(b)
Figure 2-4 (a) Waveform of the Glottal pulse (Ainsworth 1988)
19
that the airflow ceases to be laminar and becomes turbulent. This generates random
noise the spectrum of which is determined by the exact mechanism of turbulence
production. In the simplest cases, a jet of air emerges from the constriction and becomes
turbulent as it mixes with the surrounding air; in other cases, such as /s/ and /S/, it has
been argued that the sound is generated at the point where the jet of air strikes a rigid
surface approximately normal to its velocity (Shadle C, Scully C, 1995). In either case,
the sound generated will be modified by the resonances of those parts of the vocal tract
anterior to the constriction.
2.4 Filter Models.
The envelope of the speech spectrum is controlled by the shape of the vocal tract,
which amplifies or attenuates the various frequencies produced by the sound source.
The vocal tract is a non-uniform tube, generally about 17 cm long in the adult male (14
cm in the adult female) from the larynx to the lips. The tube is regarded as closed at the
larynx and open at the lips. The cross-sectional area may vary from 10 cm
2
to zero and
is controlled mainly by movements of the tongue. The shape of the vocal tract is
complex but quite reasonable results can be achieved by approximating it as a small
number of uniform tubelets connected together. Analytical solutions can be obtained for
up to nine tubelets in theory, but the practical limit is four or five (Schoentgen and
Ciocea, 1994). For more accurate representations, numerical methods must be used, but
the power of computers is such that models using 40 or more tubes are now
commonplace (see, for example, Story, Titze, and Hoffman, 1996, 1998).
Several standard texts cover the mathematical analysis of the vocal tract response
in great detail (e.g. Fant, 1970; Flanagan, 1972; Deller, Proakis, and Hansen 1993;
20
Stevens, 1998). The following sections illustrate the principles involved, using a model
consisting of only two uniform tubes, with suitable terminations to represent the closed
glottis and lip radiation.
2.4.1 Uniform Tube.
It is usual for analyses of vocal tract response to begin by considering a uniform
tube, which is considered to be an approximation of the configuration of the vocal tract
when producing the neutral vowel, or schwa, /@/, such as occurs in the first syllable of
ago. Recent work using MRI data (Story and Titze, 1998) suggests that the actual
shape of the vocal tract for the neutral vowel is significantly different from being a
uniform tube, but produces formants at almost the same frequencies. Nevertheless, a
single uniform tube is still the simplest case and therefore the best point at which to start
the analysis.
An analytical solution of plane wave propagation in a straight tube of uniform
cross section may be obtained by analogy with electrical transmission lines. The voltage
in the electrical circuit is analogous to pressure in the acoustic system, and current
corresponds to volume velocity. This electrical analogy is valid only if the lateral
dimensions of the tube are small compared with the wavelength of the sound, so that the
sound can be considered to propagate as a plane wave (Flanagan, 1972). This is a
reasonable approximation up to about 5 kHz, which, while not the full bandwidth of
speech, is enough for practical purposes. Figure 2-5 shows the cross section of a
uniform tube of length L and cross-sectional area A, and its T-section equivalent circuit.
The impedances in the equivalent circuit are defined by:
21
( ) ( ) Z Z L Z Z L
a b
= =
0 0
2 tanh cosech
where Z
o
is the characteristic impedance which, if losses are ignored, is equal to
c/A, where is the density of the air and c the velocity of sound. is the propagation
constant, equal to je/c in the lossless case, where e is the angular frequency and j = \-1.
In order to represent the production of the neutral vowel, terminations must be
added to the uniform tube model to represent the glottis and the lip opening. The lip
opening appears as a very low impedance, the outside volume being very much larger
than the vocal tract. The tube is considered to be closed at the laryngeal end, as the
glottis is closed for about half of the cycle and the peak value of glottal area during the
Area = A
Length = L
(a)
Z
a
Z
a
Z
b
V
1
V
2
I
1
I
2
(b)
Figure 2-5 Uniform tube (a) and its T-section equivalent circuit (b)
22
open phase is only about 15 mm
2
(Flanagan, 1972). During vowel production, the
smallest constriction of the vocal tract will be several times larger than this.
Figure 2-6 shows the tube open at one end with the excitation supplied by a solid
piston at the closed (laryngeal) end, and its equivalent circuit.
The voltages around the right hand loop must sum to zero, so, using Ohms Law:
( ) I I Z I Z
b a 2 1 2
0 + =
Hence the transfer function of the tube is:
I
I
Z
Z Z
b
a b
2
1
=
+
Area = A
Length = L
(a)
Z
a
Z
a
Z
b
V
1
I
1
I
2
(b)
Figure 2-6 Neutral vocal tract model (a) and equivalent circuit (b)
23
Substituting the equations for Z
a
and Z
b
given above and simplifying
( )
I
I L
c
2
1
1
=
cos
e
Resonances occur when ( ) cos eL c = 0 , i.e. when ( ) e t L c n = 2 1 2, or,
( )
Fn
n c
L
=
2 1
4
where n = 1, 2, 3. Fn is the frequency in Hertz. For a tube of length L = 17 cm., the
first resonance occurs at about 500 Hz, with others at 1000 Hz intervals.
The curvature of the vocal tract has also been shown to have a small effect on the
formant frequencies (Sondhi, 1986), but this effect is not significant given the level of
accuracy expected from such a simple model.
2.4.2 Two Tube Model
A uniform tube of fixed length, with fixed terminations, is capable of producing
only a single, fixed set of resonances. A model consisting of two uniform tubes joined
end to end (Figure 2-7) can produce a wide variety of responses if the lengths and cross-
sectional areas of each tube are varied (but keeping the total length of the two tubes
constant).
As for the single tube case, the glottis is modelled as a high impedance source
and the lip opening as a short circuit. The loop equations are:
( ) ( ) ( ) I I Z I Z Z I I Z
b a a b 2 1 1 2 1 2 2 3 2
0 + + + =
( ) I I Z I Z
b a 3 2 2 3 2
0 + =
24
Eliminating I
2
and simplifying gives the transfer function:
I
I Z
Z
Z
Z
Z
Z
Z
Z
a
b
a
b
a
b
a
b
3
1
2
2
1
1
2
1
2
1
1
1 1
=
+
|
\
|
.
| + +
|
\
|
.
| +
Substituting the formulae for Z
a
and Z
b
given above, it can be shown that the
resonances of the transfer function occur when
( ) ( )
A
A
L L
1
2
2 2 1 1
tanh coth =
In the lossless case, this reduces to:
Cross-sectional area A
2
L
2
(a)
Z
a1
Z
a1
Z
b1
V
1
I
1
I
3
(b)
L
1
Cross-sectional area A
1
Z
a2
Z
a2
Z
b2
I
2
Figure 2-7 Two tube model of vocal tract and its equivalent circuit
25
A
A
L
c
L
c
1
2
2 1
tan cot
e e |
\
|
.
|
=
|
\
|
.
|
The roots of this equation may be found by graphical methods; several examples
are given by Flanagan (1972).
2.4.3 Perturbation Theory
The formant positions of the various vowel sounds can be explained by
considering the effect of constricting the vocal tract area function at various points (Kent
and Read 1992, p24; Stevens, 1998, p148). This is known as perturbation theory. As
explained above, a uniform tube of length 17 cm. produces formants at 500 Hz, 1500
Hz, 2500 Hz and so on. Each of these formants is associated with a standing wave in the
tube, as shown in Figure 2-8 which illustrates the waves in terms of volume velocity.
The volume velocity must always be zero at the closed (laryngeal) end of the
tube, and maximum at the open end. The effect of a constriction in the tube on each
formant frequency will depend on the position of the constriction with respect to the
nodes and anti-nodes of the standing wave. The effects are as follows:
- A constriction near a volume velocity maximum decreases the formant frequency.
- A constriction near a volume velocity minimum increases the formant frequency.
26
It can be seen immediately from Figure 2-8 that a constriction at the lips, where
all standing waves have a volume velocity maximum, will decrease all formant
frequencies. (This effect is compounded because constriction of the lips is generally
accompanied by protrusion, hence lengthening the vocal tract and decreasing all formant
frequencies further.) At other positions within the vocal tract, however, each formant is
likely to be affected differently. As the first formant standing wave is only a quarter
wavelength, a constriction anywhere in about the front two-thirds of the vocal tract will
cause the frequency to decrease. Only in the pharynx will a constriction cause the first
formant frequency to increase.
The second formant standing wave has a node about 11 cm. from the larynx, in
the area of the front of the tongue (see Figure 2-3). An anti-node occurs further back in
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Distance from Larynx, cm.
R
e
l
a
t
i
v
e

V
o
l
u
m
e

V
e
l
o
c
i
t
y
F1
-F1
F2
-F2
F3
-F3
Figure 2-8 Standing waves in a uniform tube closed at one end and
open at the other
27
the pharynx, at about 6 cm. A constriction at the front will therefore raise F2 while one
at the back will decrease it.
These rules can be used to explain the general structure of the F1-F2 chart, in
which vowels are represented in terms of the frequencies of their first two formants.
Figure 2-9 shows such a chart, based on data from American English subjects (Peterson
and Barney, 1952) except for the vowels /O, @/ which are from British English subjects.
The neutral vowel /@/ appears roughly in the middle, with F1 = 500 Hz and F2 = 1430
Hz. To form the high front vowel /i/, the front of the tongue is raised close to the palate.
This is the area of the F2 node, so F2 is increased, while, as explained above, F1 is
decreased. As the height of the front of the tongue is progressively reduced in the
500
700
900
1100
1300
1500
1700
1900
2100
2300
200 300 400 500 600 700 800
F1, Hz
F
2
,

H
z
/i/
/I/
/e/
/{/
/A/
/u/
/@/
/O/
/U/
/Q/
/V/
Figure 2-9 F1/F2 chart illustrating the "Vowel Triangle"
(for male speakers)
28
sequence of vowels /i/, /I/, /e/, /{/, F2 is reduced and F1 increased. A constriction at the
back of the tongue, combined with lip-rounding, causes F1 and F2 both to be low and
forms the vowel /u/. As the tongue height decreases in the back vowel sequence /u/, /U/,
/O/, /V/, /A/ accompanied by a reduction in lip-rounding, both F1 and F2 increase.
2.4.4 Distinctive Regions and Modes Model
As the number of tubelets used in a vocal tract model increases, so does the
number of parameters needed to describe the model. When larger numbers of tubelets
are used, they are usually all of equal length so that the only parameters required are the
cross-sectional areas. The Distinctive Regions and Modes (DRM) model (Mrayati,
Carr and Guerin, 1988) represents the vocal tract as a series of concatenated tubes of
differing but fixed lengths. The lengths of the tubes are derived from perturbation theory
(see section 2.4.3), such that the boundaries between the tubes are determined by the
zero-crossings of the sensitivity functions of the formants in a uniform tube. If only the
first two formants are taken into account, there are four zero-crossings in the sensitivity
function, so the model will have four tubes with lengths L/6, L/3, L/3, L/6, where L is
the total length of the model. Adding the sensitivity function of the third formant results
in an eight-tube model with tube lengths of L/10, L/15, 2L/15, L/5, L/5, 2L/15, L/15,
and L/10. In this case, the tubes in the model correspond approximately to distinct
regions of the vocal tract marked A B C D D C B A , , , , , , , , in Figure 2-10.
This model has been shown to be capable of representing measured area
functions more accurately than a model with eight tubes of equal length (Ciocea and
Schoentgen, 1998, p55). In this study, each model was fitted to the measured area
29
function by matching the volume of each tubelet to the volume of the corresponding
region of the measured area function. The goodness-of-fit was determined by summing
the absolute differences between the model and the measured area function at each
measurement point, and dividing by the total volume of the measured area function.
Comparisons were made using five area functions from (Fant, 1970) and twelve from
(Story, Titze, and Hoffman, 1996) and in every case the DRM model fitted the measured
area function more closely than the model with equal tube lengths.
The DRM model has been criticised (Bo and Perrier, 1990) on the grounds that
the sensitivity functions underlying perturbation theory are only applicable to small
Figure 2-10 DRM model superimposed on mid-sagittal vocal
tract profile (from Carr and Mrayati, 1991)
30
perturbations from the uniform tube, but the use of a model in which the tubelets
correspond (albeit approximately) to distinctive regions of the vocal tract would seem to
have advantages when considering the effects of positive pressure breathing.
2.4.5 Radiation Load
Standard treatments of the acoustics of speech production naturally assume that
the speaker is in the open air. In this case, the transfer function may include a radiation
model, which converts the volume velocity at the lips into a pressure wave in the far
field. These effects may be modelled in terms of the electrical analogy as an inductance
in series with a resistance (Figure 2-11).
The main effect of the radiation model is to produce a high frequency boost of
approximately 6 dB per octave (Flanagan 1972) on the axis of the mouth. The variations
of response for angles within about 60
o
of the axis are small, but become significant
behind the head. There is also a slight reduction in formant frequencies and increase of
formant bandwidths.
Vocal Tract Filter
wL
R
Source Radiation Load
Figure 2-11 Simple model of radiation load
31
2.5 Oxygen Mask Load
The crew of military fast-jets always wear oxygen masks, which serve several
purposes. The survival and functioning of a human under reduced atmospheric pressure
depends on the maintenance of a minimum partial pressure of oxygen in his breathing
gas. As the ambient pressure is reduced, the proportion of oxygen in the gas must be
increased, until, at a pressure corresponding to an altitude of about 40,000 ft, pure
oxygen must be supplied. If the ambient pressure drops below this level, the pressure of
the breathing gas must be maintained at a minimum of 130 mmHg to avoid severe
hypoxia (Ernsting 1966). The primary function of the oxygen mask is to allow this
degree of control over the breathing gas mixture and pressure. The cockpits of modern
military aircraft are pressurised to some degree but not to the same extent as civil
airliners.
A secondary function of the mask is to shield the pilots microphone from the
high levels of noise present in the cockpit, thus aiding communications. In current
designs of mask, the microphone is mounted in the anterior end of the mask, directly in
front of the lips and about 3 cm. away from them (Figure 2-12). The mask gives 10-20
dB of attenuation, depending on how well it fits and whether the expiratory valve is open
or closed. The valve will normally be open when the pilot is speaking. The noise
attenuation is also a function of frequency (James, 1991).
A third function of the mask is to allow pressurisation of the breathing gas for G
protection. It has been shown that increasing the pressure of the breathing gas above
ambient pressure causes a reflex increase in blood pressure (Ernsting 1966) which helps
to maintain consciousness under high levels of vertical acceleration (+Gz). Combat
32
aircraft currently under development will be capable of manoeuvering at well over 6g, so
means must be supplied to keep the pilot functioning under these conditions.
Placing a small closed (or nearly so) cavity over the mouth seems likely to have a
significant effect on the response of the vocal system, but the literature on this subject is
rather sparse. Early work studied the effects of diving masks. Morrow (1948) studied a
model of vowel production consisting of only two capacitances and two inductances, to
which he added an extra capacitance to model the effect of a non-radiating enclosure
over the mouth. Extensive measurements were made, showing a tendency for the cavity
to increase the frequencies of the formants, but the results were inconsistent. Later
work by the same author (Morrow and Brouns, 1971) used an acoustic impedance
calibrator to determine the effects of various mask cavity sizes in helium under high
Figure 2-12 RAF P/Q type oxygen mask
(DERA Photographic Library)
33
pressure as well as air at normal pressure. The sound source in these experiments was
specially constructed to have a high acoustic impedance.
Singer (1981) examined the effects of the pilots mask and microphone response
on the performance of LPC vocoders. He considered the possibility that additional
resonances introduced by the mask may lead to the failure of the 10
th
order linear
prediction model used in military vocoders. The acoustic effects of the mask were
modelled by replacing the resistance in the radiation model (Figure 2-11) by a
capacitance as shown in Figure 2-13, ignoring the loss due to the expiratory valve. As in
(Morrow and Brouns, 1971), the vocal tract was assumed to be a high impedance source
in comparison with the load. It is stated that the model predicts a reduction of the
bandwidth for the high frequency formants, changes in formant frequencies and possibly
the appearance of additional formants. Singer (1981) gives no details of the analysis.
The response is also analysed in terms of the ratio of the pressure response at the
lips with the mask to that in free air. This shows a low frequency boost of 12 dB/oct, a
zero in the response where e
e
L
C
=
1
, and an asymptotic approach to unity gain at
Vocal Tract Filter
wL
1/wC
Source Oxygen Mask Load
Figure 2-13 Model of acoustic load of oxygen mask (after Singer 1981).
34
high frequencies. It should be noted that the pressure response at the lips is not
necessarily the same as that at the microphone, which is at the opposite end of the mask
from the lips in most designs of oxygen mask.
Singer also analysed the spectra of vowel segments produced in the oxygen mask
fitted with a noise cancelling (pressure gradient) microphone. He reports no additional
resonances and no significant changes to formant frequencies, although the bandwidths
of higher formants were reduced, as predicted. Speech intelligibility studies using the
Diagnostic Rhyme Test procedure compared performance of the standard LPC-10
vocoder on oxygen mask speech with that of a 12
th
order LPC algorithm. The standard
vocoder produced a loss of intelligibility of about 10% compared with unprocessed
speech; the LPC-12 algorithm performed no better, tending to confirm that the mask
produced no significant additional resonances within the 4 kHz bandwidth of the vocoder
algorithms.
Later work (Wheeler, Elliott, and Darlington, 1984; Gant, 1986) actually
measured the acoustic impedances of the mask and the vocal tract. A probe was
constructed, consisting of a miniature source and a miniature pressure microphone
mounted close together. The probe was calibrated in a small cavity, and then checked in
a straight, uniform tube, open at one end. The measured impedance of the tube was a
good approximation to the true value, which was easily calculated. The probe was then
used to measure the input impedance of an oxygen mask with the aid of a mannequin
head. Measurements were made with the expiratory valve both open and closed; little
difference was seen between the two conditions.
35
The probe was also used to measure the impedance of the vocal tract from the
mouth. Subjects were required to close their lips around the probe while keeping their
vocal tract in the configuration used for pronouncing various vowels. It is not stated
whether the glottis was open or closed. The impedance varies with frequency, vocal
tract configuration and the individual, but these measurements showed that it is of the
same order of magnitude as that of the mask over most of the frequency range
considered (up to 5 kHz). It was concluded that the assumption made by Singer that the
vocal tract impedance was much higher was incorrect. However, the internal volume of
the mask used by the US Air Force is greater than that of the mask used by the RAF and
this would result in weaker coupling with the vocal tract, possibly accounting for the
differences between these two studies.
These investigators then proceeded to make recordings of the same subjects in a
mask and in free field conditions, using the same microphone in both cases. The
differences in spectra were calculated for the phones [i], [u], [m], and [S] uttered in
isolation and for a long term average of continuous speech. It was concluded that the
oxygen mask does cause significant changes to the spectral content of speech (as much
as 20 dB in some cases), but the changes are both speaker and utterance dependent. The
low frequency boost was confirmed, however.
Another study (Bond, Moore and Gable, 1988) compared utterances recorded in
an oxygen mask with the same words spoken by the same subjects into a boom
microphone. The recordings were made with ambient noise levels between 85 and 100
dB, with the subject wearing a flying helmet. The results show that the mask causes a
compression of the vowel space in the F1-F2 plane, particularly in the F1 dimension.
36
This was considered to be a result of the effective lengthening of the vocal tract by some
3 cm. The restriction of jaw movement by the mask was also thought to play a part in
reducing the vowel space.
2.6 Speech Produced under Sustained Acceleration
The accelerations encountered in flight are classified according to the direction in
which they act relative to the aircraft as shown in Figure 2-14 (Glaister and Prior, 1999,
p133)
1
. The x-axis is the fore-and-aft direction, with +x forwards, the y-axis is the
lateral direction, with +y being to the right, and the z-axis is vertical, +z being upwards.
In normal flight, the highest levels of acceleration will occur in the +z direction, as this is
the direction in which the wings are designed to produce lift. Turns are achieved by

1
This is the convention adopted in the field of aerospace medicine. Aeronautical engineers use
a different convention, in which the direction of +Z is reversed (Bramwell, 1976).
+Z
+X
+Y
+Z
+X
+Y
Figure 2-14 Convention for reference frame axes
37
banking the aircraft in the desired direction, then pulling back on the control column to
increase the lift generated by the wings, thus the acceleration is still +Gz relative to the
aircraft. Modern agile combat aircraft are designed to pull as much as 9g; higher levels
could be achieved but for the problem of keeping the crew conscious and functioning.
Straight and level flight is, of course, 1g; some negative Gz may be encountered in
inverted turns, but this will seldom be greater than -1g.
The accelerations encountered on the other axes are relatively low: Gx levels of
the order of +1g will be fairly common, but significant levels of Gy occur only during
spins or other abnormal manoeuvres.
Only one study of the effects of sustained acceleration on the acoustic-phonetic
characteristics of speech gives any details of the results (Bond, Moore and Anderson,
1986). Recordings were made from two male subjects at 1g and 6g; five words were
uttered in isolation five times each at 1g and from two to four times at 6g. The subjects
wore oxygen masks, but no details are given of any other protective clothing. The
recordings were analysed for fundamental frequency, vowel and diphthong formant
frequencies, word and segment duration, and amplitude. At 6g, the mean fundamental
frequency of both speakers in stressed syllables was increased relative to that at 1g, and
the range was greater. In unstressed syllables, one speaker showed no effect while the
other increased his range only. Vowel formant frequencies were measured for /i, e, u/.
The mean values of the first formant frequency of /i/ and /e/ increased under G for both
speakers, while the second formant was lowered for these vowels. In /u/, the two
speakers first formants changed in opposite directions, while the second formant was
raised for both speakers.. No consistent effects were seen in the third formant. Overall,
38
the general effect was a reduction of the vowel space on the F1-F2 plane. The two
diphthongs studied, /aI/ and /@U/, showed consistent changes: F1 was increased and F2
decreased for both components.
The two subjects had very different speaking styles with respect to word
duration; one showed very little variation between utterances of the same word, while
the other showed large variations, even under normal conditions. The consistent speaker
showed a slight increase in duration for most words at 6g, while the other showed no
consistent pattern. Analysis of segment durations showed that most of the increase of
word duration was accounted for by variation in the lengths of the vowels.
These results appear to show some consistent patterns, but must be treated with
caution because of the small samples involved. Clearly, a more extensive analysis must
be made before any general conclusions can be drawn.
Gulli et al., (1992) also reported analysis of speech recorded under acceleration.
Five male subjects and one female recorded isolated vowels, disyllables, words and short
phrases over a range of G levels. Exact details of the vocabulary and recording
conditions are not given, although it is stated that the subjects wore oxygen mask, flying
helmet and anti-G trousers. A second phase of the experiment included recordings of
cockpit command phrases of up to eight words, for tests on speech recognition
equipment. During the second phase, the acceleration levels were 1.4, 3, and 6g, but the
acceleration was not continuous. During each run, three 15 s periods of 3g or 6g were
separated by similar duration periods of 1.4g, the whole run lasting just under two
minutes. This is much more like the conditions of air-to-air combat than a continuous
39
period at a steady acceleration. Although it is not stated, it is likely that the recordings
were made in the same manner in the first phase.
Measurements were made of fundamental frequency, formant frequencies,
spectral slope and total energy using a variety of analysis techniques. Unfortunately,
detailed results were not included in this paper. It is stated that fundamental frequency
and overall energy increased considerably at high G, that high frequencies generally were
reinforced (i.e. a change of spectral slope), and that the higher formants (meaning F3 and
F4) became more variable. The latter finding is related, in a very vague and qualitative
fashion, to displacement of the constriction in the vocal tract towards the glottis, but no
serious attempt is made to describe the effects in detail or to account for them in
articulatory terms. From the point of view of the current work, this paper is very
disappointing, as it concentrates on the signal processing techniques used in the work
and not on the results.
2.7 Speech Produced under Positive Pressure Breathing
As described above, positive pressure breathing may be used to maintain the
pilots consciousness during high G turns. While very detailed studies have been made of
the physiological effects of pressure breathing (Ernsting, 1966), the author has found no
previous work on its effects on speech production. It is obvious that increasing the
pressure inside the vocal tract relative to that outside will tend to expand the cheeks and
throat (Figure 2-15), increasing the cross-sectional area of at least some parts, and hence
changing the resonant frequencies of the vocal cavity. It is also apparent from listening
to speech produced under these conditions, that the production of sounds requiring the
40
contact of two articulators becomes difficult. This is particularly noticeable in bilabial
and alveolar consonants, but less so in velar consonants.
It is possible to estimate the effects of positive pressure breathing on the formant
frequencies of vowels via an n-tube model such as was described in Section 2.4. The
main difficulty lies in estimating the magnitude of the displacement of the vocal tract
walls under pressure. Chapter 3 describes two models of vowel production under
positive pressure breathing, the first using a simple four tube model, the second using the
Dynamic Regions and Modes model.
Figure 2-15 (left) Subject prior to pressure breathing, (right) Neck distension
during pressure breathing at 70 mmHg (from Ernsting 1966)

Chapter2 Production

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter2 Production

Uploaded by

Copyright:

Available Formats

9

Chapter 2. Acoustic Theory of Speech Production

You might also like