Professional Documents
Culture Documents
|
.
| + +
|
\
|
.
| +
Substituting the formulae for Z
a
and Z
b
given above, it can be shown that the
resonances of the transfer function occur when
( ) ( )
A
A
L L
1
2
2 2 1 1
tanh coth =
In the lossless case, this reduces to:
Cross-sectional area A
2
L
2
(a)
Z
a1
Z
a1
Z
b1
V
1
I
1
I
3
(b)
L
1
Cross-sectional area A
1
Z
a2
Z
a2
Z
b2
I
2
Figure 2-7 Two tube model of vocal tract and its equivalent circuit
25
A
A
L
c
L
c
1
2
2 1
tan cot
e e |
\
|
.
|
=
|
\
|
.
|
The roots of this equation may be found by graphical methods; several examples
are given by Flanagan (1972).
2.4.3 Perturbation Theory
The formant positions of the various vowel sounds can be explained by
considering the effect of constricting the vocal tract area function at various points (Kent
and Read 1992, p24; Stevens, 1998, p148). This is known as perturbation theory. As
explained above, a uniform tube of length 17 cm. produces formants at 500 Hz, 1500
Hz, 2500 Hz and so on. Each of these formants is associated with a standing wave in the
tube, as shown in Figure 2-8 which illustrates the waves in terms of volume velocity.
The volume velocity must always be zero at the closed (laryngeal) end of the
tube, and maximum at the open end. The effect of a constriction in the tube on each
formant frequency will depend on the position of the constriction with respect to the
nodes and anti-nodes of the standing wave. The effects are as follows:
- A constriction near a volume velocity maximum decreases the formant frequency.
- A constriction near a volume velocity minimum increases the formant frequency.
26
It can be seen immediately from Figure 2-8 that a constriction at the lips, where
all standing waves have a volume velocity maximum, will decrease all formant
frequencies. (This effect is compounded because constriction of the lips is generally
accompanied by protrusion, hence lengthening the vocal tract and decreasing all formant
frequencies further.) At other positions within the vocal tract, however, each formant is
likely to be affected differently. As the first formant standing wave is only a quarter
wavelength, a constriction anywhere in about the front two-thirds of the vocal tract will
cause the frequency to decrease. Only in the pharynx will a constriction cause the first
formant frequency to increase.
The second formant standing wave has a node about 11 cm. from the larynx, in
the area of the front of the tongue (see Figure 2-3). An anti-node occurs further back in
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Distance from Larynx, cm.
R
e
l
a
t
i
v
e
V
o
l
u
m
e
V
e
l
o
c
i
t
y
F1
-F1
F2
-F2
F3
-F3
Figure 2-8 Standing waves in a uniform tube closed at one end and
open at the other
27
the pharynx, at about 6 cm. A constriction at the front will therefore raise F2 while one
at the back will decrease it.
These rules can be used to explain the general structure of the F1-F2 chart, in
which vowels are represented in terms of the frequencies of their first two formants.
Figure 2-9 shows such a chart, based on data from American English subjects (Peterson
and Barney, 1952) except for the vowels /O, @/ which are from British English subjects.
The neutral vowel /@/ appears roughly in the middle, with F1 = 500 Hz and F2 = 1430
Hz. To form the high front vowel /i/, the front of the tongue is raised close to the palate.
This is the area of the F2 node, so F2 is increased, while, as explained above, F1 is
decreased. As the height of the front of the tongue is progressively reduced in the
500
700
900
1100
1300
1500
1700
1900
2100
2300
200 300 400 500 600 700 800
F1, Hz
F
2
,
H
z
/i/
/I/
/e/
/{/
/A/
/u/
/@/
/O/
/U/
/Q/
/V/
Figure 2-9 F1/F2 chart illustrating the "Vowel Triangle"
(for male speakers)
28
sequence of vowels /i/, /I/, /e/, /{/, F2 is reduced and F1 increased. A constriction at the
back of the tongue, combined with lip-rounding, causes F1 and F2 both to be low and
forms the vowel /u/. As the tongue height decreases in the back vowel sequence /u/, /U/,
/O/, /V/, /A/ accompanied by a reduction in lip-rounding, both F1 and F2 increase.
2.4.4 Distinctive Regions and Modes Model
As the number of tubelets used in a vocal tract model increases, so does the
number of parameters needed to describe the model. When larger numbers of tubelets
are used, they are usually all of equal length so that the only parameters required are the
cross-sectional areas. The Distinctive Regions and Modes (DRM) model (Mrayati,
Carr and Guerin, 1988) represents the vocal tract as a series of concatenated tubes of
differing but fixed lengths. The lengths of the tubes are derived from perturbation theory
(see section 2.4.3), such that the boundaries between the tubes are determined by the
zero-crossings of the sensitivity functions of the formants in a uniform tube. If only the
first two formants are taken into account, there are four zero-crossings in the sensitivity
function, so the model will have four tubes with lengths L/6, L/3, L/3, L/6, where L is
the total length of the model. Adding the sensitivity function of the third formant results
in an eight-tube model with tube lengths of L/10, L/15, 2L/15, L/5, L/5, 2L/15, L/15,
and L/10. In this case, the tubes in the model correspond approximately to distinct
regions of the vocal tract marked A B C D D C B A , , , , , , , , in Figure 2-10.
This model has been shown to be capable of representing measured area
functions more accurately than a model with eight tubes of equal length (Ciocea and
Schoentgen, 1998, p55). In this study, each model was fitted to the measured area
29
function by matching the volume of each tubelet to the volume of the corresponding
region of the measured area function. The goodness-of-fit was determined by summing
the absolute differences between the model and the measured area function at each
measurement point, and dividing by the total volume of the measured area function.
Comparisons were made using five area functions from (Fant, 1970) and twelve from
(Story, Titze, and Hoffman, 1996) and in every case the DRM model fitted the measured
area function more closely than the model with equal tube lengths.
The DRM model has been criticised (Bo and Perrier, 1990) on the grounds that
the sensitivity functions underlying perturbation theory are only applicable to small
Figure 2-10 DRM model superimposed on mid-sagittal vocal
tract profile (from Carr and Mrayati, 1991)
30
perturbations from the uniform tube, but the use of a model in which the tubelets
correspond (albeit approximately) to distinctive regions of the vocal tract would seem to
have advantages when considering the effects of positive pressure breathing.
2.4.5 Radiation Load
Standard treatments of the acoustics of speech production naturally assume that
the speaker is in the open air. In this case, the transfer function may include a radiation
model, which converts the volume velocity at the lips into a pressure wave in the far
field. These effects may be modelled in terms of the electrical analogy as an inductance
in series with a resistance (Figure 2-11).
The main effect of the radiation model is to produce a high frequency boost of
approximately 6 dB per octave (Flanagan 1972) on the axis of the mouth. The variations
of response for angles within about 60
o
of the axis are small, but become significant
behind the head. There is also a slight reduction in formant frequencies and increase of
formant bandwidths.
Vocal Tract Filter
wL
R
Source Radiation Load
Figure 2-11 Simple model of radiation load
31
2.5 Oxygen Mask Load
The crew of military fast-jets always wear oxygen masks, which serve several
purposes. The survival and functioning of a human under reduced atmospheric pressure
depends on the maintenance of a minimum partial pressure of oxygen in his breathing
gas. As the ambient pressure is reduced, the proportion of oxygen in the gas must be
increased, until, at a pressure corresponding to an altitude of about 40,000 ft, pure
oxygen must be supplied. If the ambient pressure drops below this level, the pressure of
the breathing gas must be maintained at a minimum of 130 mmHg to avoid severe
hypoxia (Ernsting 1966). The primary function of the oxygen mask is to allow this
degree of control over the breathing gas mixture and pressure. The cockpits of modern
military aircraft are pressurised to some degree but not to the same extent as civil
airliners.
A secondary function of the mask is to shield the pilots microphone from the
high levels of noise present in the cockpit, thus aiding communications. In current
designs of mask, the microphone is mounted in the anterior end of the mask, directly in
front of the lips and about 3 cm. away from them (Figure 2-12). The mask gives 10-20
dB of attenuation, depending on how well it fits and whether the expiratory valve is open
or closed. The valve will normally be open when the pilot is speaking. The noise
attenuation is also a function of frequency (James, 1991).
A third function of the mask is to allow pressurisation of the breathing gas for G
protection. It has been shown that increasing the pressure of the breathing gas above
ambient pressure causes a reflex increase in blood pressure (Ernsting 1966) which helps
to maintain consciousness under high levels of vertical acceleration (+Gz). Combat
32
aircraft currently under development will be capable of manoeuvering at well over 6g, so
means must be supplied to keep the pilot functioning under these conditions.
Placing a small closed (or nearly so) cavity over the mouth seems likely to have a
significant effect on the response of the vocal system, but the literature on this subject is
rather sparse. Early work studied the effects of diving masks. Morrow (1948) studied a
model of vowel production consisting of only two capacitances and two inductances, to
which he added an extra capacitance to model the effect of a non-radiating enclosure
over the mouth. Extensive measurements were made, showing a tendency for the cavity
to increase the frequencies of the formants, but the results were inconsistent. Later
work by the same author (Morrow and Brouns, 1971) used an acoustic impedance
calibrator to determine the effects of various mask cavity sizes in helium under high
Figure 2-12 RAF P/Q type oxygen mask
(DERA Photographic Library)
33
pressure as well as air at normal pressure. The sound source in these experiments was
specially constructed to have a high acoustic impedance.
Singer (1981) examined the effects of the pilots mask and microphone response
on the performance of LPC vocoders. He considered the possibility that additional
resonances introduced by the mask may lead to the failure of the 10
th
order linear
prediction model used in military vocoders. The acoustic effects of the mask were
modelled by replacing the resistance in the radiation model (Figure 2-11) by a
capacitance as shown in Figure 2-13, ignoring the loss due to the expiratory valve. As in
(Morrow and Brouns, 1971), the vocal tract was assumed to be a high impedance source
in comparison with the load. It is stated that the model predicts a reduction of the
bandwidth for the high frequency formants, changes in formant frequencies and possibly
the appearance of additional formants. Singer (1981) gives no details of the analysis.
The response is also analysed in terms of the ratio of the pressure response at the
lips with the mask to that in free air. This shows a low frequency boost of 12 dB/oct, a
zero in the response where e
e
L
C
=
1
, and an asymptotic approach to unity gain at
Vocal Tract Filter
wL
1/wC
Source Oxygen Mask Load
Figure 2-13 Model of acoustic load of oxygen mask (after Singer 1981).
34
high frequencies. It should be noted that the pressure response at the lips is not
necessarily the same as that at the microphone, which is at the opposite end of the mask
from the lips in most designs of oxygen mask.
Singer also analysed the spectra of vowel segments produced in the oxygen mask
fitted with a noise cancelling (pressure gradient) microphone. He reports no additional
resonances and no significant changes to formant frequencies, although the bandwidths
of higher formants were reduced, as predicted. Speech intelligibility studies using the
Diagnostic Rhyme Test procedure compared performance of the standard LPC-10
vocoder on oxygen mask speech with that of a 12
th
order LPC algorithm. The standard
vocoder produced a loss of intelligibility of about 10% compared with unprocessed
speech; the LPC-12 algorithm performed no better, tending to confirm that the mask
produced no significant additional resonances within the 4 kHz bandwidth of the vocoder
algorithms.
Later work (Wheeler, Elliott, and Darlington, 1984; Gant, 1986) actually
measured the acoustic impedances of the mask and the vocal tract. A probe was
constructed, consisting of a miniature source and a miniature pressure microphone
mounted close together. The probe was calibrated in a small cavity, and then checked in
a straight, uniform tube, open at one end. The measured impedance of the tube was a
good approximation to the true value, which was easily calculated. The probe was then
used to measure the input impedance of an oxygen mask with the aid of a mannequin
head. Measurements were made with the expiratory valve both open and closed; little
difference was seen between the two conditions.
35
The probe was also used to measure the impedance of the vocal tract from the
mouth. Subjects were required to close their lips around the probe while keeping their
vocal tract in the configuration used for pronouncing various vowels. It is not stated
whether the glottis was open or closed. The impedance varies with frequency, vocal
tract configuration and the individual, but these measurements showed that it is of the
same order of magnitude as that of the mask over most of the frequency range
considered (up to 5 kHz). It was concluded that the assumption made by Singer that the
vocal tract impedance was much higher was incorrect. However, the internal volume of
the mask used by the US Air Force is greater than that of the mask used by the RAF and
this would result in weaker coupling with the vocal tract, possibly accounting for the
differences between these two studies.
These investigators then proceeded to make recordings of the same subjects in a
mask and in free field conditions, using the same microphone in both cases. The
differences in spectra were calculated for the phones [i], [u], [m], and [S] uttered in
isolation and for a long term average of continuous speech. It was concluded that the
oxygen mask does cause significant changes to the spectral content of speech (as much
as 20 dB in some cases), but the changes are both speaker and utterance dependent. The
low frequency boost was confirmed, however.
Another study (Bond, Moore and Gable, 1988) compared utterances recorded in
an oxygen mask with the same words spoken by the same subjects into a boom
microphone. The recordings were made with ambient noise levels between 85 and 100
dB, with the subject wearing a flying helmet. The results show that the mask causes a
compression of the vowel space in the F1-F2 plane, particularly in the F1 dimension.
36
This was considered to be a result of the effective lengthening of the vocal tract by some
3 cm. The restriction of jaw movement by the mask was also thought to play a part in
reducing the vowel space.
2.6 Speech Produced under Sustained Acceleration
The accelerations encountered in flight are classified according to the direction in
which they act relative to the aircraft as shown in Figure 2-14 (Glaister and Prior, 1999,
p133)
1
. The x-axis is the fore-and-aft direction, with +x forwards, the y-axis is the
lateral direction, with +y being to the right, and the z-axis is vertical, +z being upwards.
In normal flight, the highest levels of acceleration will occur in the +z direction, as this is
the direction in which the wings are designed to produce lift. Turns are achieved by
1
This is the convention adopted in the field of aerospace medicine. Aeronautical engineers use
a different convention, in which the direction of +Z is reversed (Bramwell, 1976).
+Z
+X
+Y
+Z
+X
+Y
Figure 2-14 Convention for reference frame axes
37
banking the aircraft in the desired direction, then pulling back on the control column to
increase the lift generated by the wings, thus the acceleration is still +Gz relative to the
aircraft. Modern agile combat aircraft are designed to pull as much as 9g; higher levels
could be achieved but for the problem of keeping the crew conscious and functioning.
Straight and level flight is, of course, 1g; some negative Gz may be encountered in
inverted turns, but this will seldom be greater than -1g.
The accelerations encountered on the other axes are relatively low: Gx levels of
the order of +1g will be fairly common, but significant levels of Gy occur only during
spins or other abnormal manoeuvres.
Only one study of the effects of sustained acceleration on the acoustic-phonetic
characteristics of speech gives any details of the results (Bond, Moore and Anderson,
1986). Recordings were made from two male subjects at 1g and 6g; five words were
uttered in isolation five times each at 1g and from two to four times at 6g. The subjects
wore oxygen masks, but no details are given of any other protective clothing. The
recordings were analysed for fundamental frequency, vowel and diphthong formant
frequencies, word and segment duration, and amplitude. At 6g, the mean fundamental
frequency of both speakers in stressed syllables was increased relative to that at 1g, and
the range was greater. In unstressed syllables, one speaker showed no effect while the
other increased his range only. Vowel formant frequencies were measured for /i, e, u/.
The mean values of the first formant frequency of /i/ and /e/ increased under G for both
speakers, while the second formant was lowered for these vowels. In /u/, the two
speakers first formants changed in opposite directions, while the second formant was
raised for both speakers.. No consistent effects were seen in the third formant. Overall,
38
the general effect was a reduction of the vowel space on the F1-F2 plane. The two
diphthongs studied, /aI/ and /@U/, showed consistent changes: F1 was increased and F2
decreased for both components.
The two subjects had very different speaking styles with respect to word
duration; one showed very little variation between utterances of the same word, while
the other showed large variations, even under normal conditions. The consistent speaker
showed a slight increase in duration for most words at 6g, while the other showed no
consistent pattern. Analysis of segment durations showed that most of the increase of
word duration was accounted for by variation in the lengths of the vowels.
These results appear to show some consistent patterns, but must be treated with
caution because of the small samples involved. Clearly, a more extensive analysis must
be made before any general conclusions can be drawn.
Gulli et al., (1992) also reported analysis of speech recorded under acceleration.
Five male subjects and one female recorded isolated vowels, disyllables, words and short
phrases over a range of G levels. Exact details of the vocabulary and recording
conditions are not given, although it is stated that the subjects wore oxygen mask, flying
helmet and anti-G trousers. A second phase of the experiment included recordings of
cockpit command phrases of up to eight words, for tests on speech recognition
equipment. During the second phase, the acceleration levels were 1.4, 3, and 6g, but the
acceleration was not continuous. During each run, three 15 s periods of 3g or 6g were
separated by similar duration periods of 1.4g, the whole run lasting just under two
minutes. This is much more like the conditions of air-to-air combat than a continuous
39
period at a steady acceleration. Although it is not stated, it is likely that the recordings
were made in the same manner in the first phase.
Measurements were made of fundamental frequency, formant frequencies,
spectral slope and total energy using a variety of analysis techniques. Unfortunately,
detailed results were not included in this paper. It is stated that fundamental frequency
and overall energy increased considerably at high G, that high frequencies generally were
reinforced (i.e. a change of spectral slope), and that the higher formants (meaning F3 and
F4) became more variable. The latter finding is related, in a very vague and qualitative
fashion, to displacement of the constriction in the vocal tract towards the glottis, but no
serious attempt is made to describe the effects in detail or to account for them in
articulatory terms. From the point of view of the current work, this paper is very
disappointing, as it concentrates on the signal processing techniques used in the work
and not on the results.
2.7 Speech Produced under Positive Pressure Breathing
As described above, positive pressure breathing may be used to maintain the
pilots consciousness during high G turns. While very detailed studies have been made of
the physiological effects of pressure breathing (Ernsting, 1966), the author has found no
previous work on its effects on speech production. It is obvious that increasing the
pressure inside the vocal tract relative to that outside will tend to expand the cheeks and
throat (Figure 2-15), increasing the cross-sectional area of at least some parts, and hence
changing the resonant frequencies of the vocal cavity. It is also apparent from listening
to speech produced under these conditions, that the production of sounds requiring the
40
contact of two articulators becomes difficult. This is particularly noticeable in bilabial
and alveolar consonants, but less so in velar consonants.
It is possible to estimate the effects of positive pressure breathing on the formant
frequencies of vowels via an n-tube model such as was described in Section 2.4. The
main difficulty lies in estimating the magnitude of the displacement of the vocal tract
walls under pressure. Chapter 3 describes two models of vowel production under
positive pressure breathing, the first using a simple four tube model, the second using the
Dynamic Regions and Modes model.
Figure 2-15 (left) Subject prior to pressure breathing, (right) Neck distension
during pressure breathing at 70 mmHg (from Ernsting 1966)