You are on page 1of 18

Available online at www.sciencedirect.

com

Computer Speech and Language 28 (2014) 580–597

Speaking in noise: How does the Lombard effect improve acoustic


contrasts between speech and ambient noise?夽
Maëva Garnier ∗ , Nathalie Henrich
Department of Speech and Cognition, GIPSA-Lab (UMR 5216: CNRS, INPG, University Stendhal, UJF), Grenoble, France
Received 14 October 2012; received in revised form 26 July 2013; accepted 29 July 2013
Available online 15 August 2013

Abstract
What makes speech produced in the presence of noise (Lombard speech) more intelligible than conversational speech produced
in quiet conditions? This study investigates the hypothesis that speakers modify their speech in the presence of noise in such a way
that acoustic contrasts between their speech and the background noise are enhanced, which would improve speech audibility.
Ten French speakers were recorded while playing an interactive game first in quiet condition, then in two types of noisy conditions
with different spectral characteristics: a broadband noise (BB) and a cocktail-party noise (CKTL), both played over loudspeakers
at 86 dB SPL.
Similarly to (Lu and Cooke, 2009b), our results suggest no systematic “active” adaptation of the whole speech spectrum or vocal
intensity to the spectral characteristics of the ambient noise. Regardless of the type of noise, the gender or the type of speech segment,
the primary strategy was to speak louder in noise, with a greater adaptation in BB noise and an emphasis on vowels rather than any
type of consonants.
Active strategies were evidenced, but were subtle and of second order to the primary strategy of speaking louder: for each gender,
fundamental frequency (f0 ) and first formant frequency (F1) were modified in cocktail-party noise in a way that optimized the release
in energetic masking induced by this type of noise. Furthermore, speakers showed two additional modifications as compared to
shouted speech, which therefore cannot be interpreted in terms of vocal effort only: they enhanced the modulation of their speech
in f0 and vocal intensity and they boosted their speech spectrum specifically around 3 kHz, in the region of maximum ear sensitivity
associated with the actor’s or singer’s formant.
© 2013 Elsevier Ltd. All rights reserved.

Keywords: Lombard speech; Noise; Production; Speech audibility; Auditory detection; Segregation; Energetic masking

1. Introduction

Noise exposure triggers an adaptation in speech production, commonly referred to as the Lombard effect. When
communicating in noisy environments, speakers commonly increase vocal intensity and fundamental frequency (f0 )
as compared to communicating in quiet environments (Castellanos et al., 1996; Junqua, 1993; Van Summers et al.,
1988). Speech produced in noise (also called Lombard speech) is also characterized by a higher first-formant frequency
of vowels (F1), boosted energy above 2 kHz and increased vowel/consonant (V/C) ratio in both vocal intensity and

夽 This paper has been recommended for acceptance by ‘Dr. Martin Cooke’.
∗ Corresponding author. Tel.: +33 4 76 57 50 61.
E-mail address: maeva.garnier@gipsa-lab.grenoble-inp.fr (M. Garnier).

0885-2308/$ – see front matter © 2013 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.csl.2013.07.005
M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597 581

duration (Boril and Pollak, 2005; Castellanos et al., 1996; Egan, 1972; Junqua, 1993; Kadiri, 1998; Mokbel, 1992;
Stanton et al., 1988; Van Summers et al., 1988).
Lombard speech has been shown to be more intelligible than conversational speech produced in quiet condition
(Dreher and O’Neill, 1957; Lu and Cooke, 2008; Pittman and Wiley, 2001; Van Summers et al., 1988). It remains
unclear which speech modifications contribute to this gain in intelligibility and which aspects of intelligibility are
improved by this speech adaptation (phoneme recognition, speech audibility, utterance parsing, etc.).
In this article, we focus on whether and how speakers may try to improve their audibility in noise, i.e. the detection and
perception of speech information by their interlocutor within the background noise. Before envisaging and suggesting
some possible strategies, let us first review the current knowledge on the mechanisms and factors that influence speech
perception in noise.
First, it is known that the audibility of a sound is considerably degraded when it is heard simultaneously with a
competing noise or sound stream that contains energy in the same critical frequency bands. The energetic-masking
effect increases with increased spectral overlap and decreased signal-to-noise (SNR) ratio between the target sound and
the masker (Hornsby and Ricketts, 2001; French and Steinberg, 1947). In the case of speech, multi-talker noise degrades
the perception of vowels more than consonants, whereas white Gaussian noise has the opposite effect (Junqua, 1993).
Similarly, speech is more degraded by a competing speech produced by a speaker of the same gender (Brungart, 2001).
Auditory fusion is another perceptual phenomenon that occurs when two or more sound streams are heard at the
same time. The concurrent streams are interpreted as coming from the same source when they present similar acoustic
characteristics (such as the average intensity, pitch and timbre, but also in the temporal modulation of these parameters).
They are segregated from each other when acoustic contrasts exceed a given threshold (Darwin et al., 2003).
Both phenomena of energetic masking and auditory fusion underlie the “cocktail-party effect”, i.e. the difficulty
of following a voice and understanding what is said within a multitude of other competing voices (Arons, 1992).
Psychoacoustic research showed how the segregation of a target speech from another competing speech is particularly
difficult when both voices are similar in spectral content and fundamental frequency (f0 ) (Assmann and Summerfield,
1990), first-formant frequency (F1) (Darwin et al., 2003), when the target voice is not modulated in f0 (Marin and
McAdams, 1991), and when it is compressed in amplitude (Hornsby and Ricketts, 2001).
What could speakers then do to improve their speech audibility and segregation in noise?

(1) Speakers may try to decrease the amount of energetic masking and enhance acoustic contrasts by increasing the
global vocal intensity of their speech, and more specifically the spectral energy in frequency regions where the
background noise presents maximum energy (boosting strategies).
(2) They may try to shift the spectral energy, or at least important phonetic cues coded in frequency (f0 , formants), to
spectral bands where the background noise presents minimum energy (bypass strategies).
(3) They may try to increase the temporal modulation of their speech in f0 and vocal intensity (modulation strategies).

Evidence of boosting strategies was provided by Mokbel (1992) and Junqua et al. (1998). Mokbel (1992) demon-
strated that a speaker enhances his speech energy more in the frequency band where noise was concentrated. In
a single-talker experiment comparing speech adaptation to broadband noises filtered by different band-pass filters,
Junqua et al. (1998) showed that, at constant masker level, the increase of vocal intensity varies with noise spectral
tilt. Bypass strategies were observed in recent studies (Lu and Cooke, 2008, 2009a,b) that showed how the center of
gravity (CoG) of the speech spectrum increases in frequency when speaking in low-pass noises (multi-babble noise,
driving noise or low-pass filtered broadband noise). However, they did not observe such bypass strategies in high-pass
filtered noises (Lu and Cooke, 2009a,b), in which the speech CoG and additional spectral cues (f0 , F1) were not shifted
down but were still shifted to higher frequencies. No study has specifically explored the use of modulation strategies in
noise. Nevertheless, enhanced intonation contours and wider f0 ranges in noise have been reported by several authors
(Boril and Pollak, 2005; Garnier et al., 2006, 2010; Welby, 2006).
In line with these previous studies, this study aims at examining whether speakers adopt boosting, bypass or
modulation strategies in noise to adapt to the spectral characteristics of the background noise and enhance acoustic
contrasts between their speech and that noise. Two different types of noise frequently encountered in ecological
situations were chosen here to test speech adaptation to varying energetic masking: a broadband noise (with equally
distributed energy below 10 kHz) and a cocktail-party noise (with concentrated energy below 1 kHz). Similarly to
previous studies, evidence of boosting, bypass or modulation strategies was searched in the global adaptation of speech
582 M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597

Table 1
List and phonetic transcription of the 16 French target words used in the study.
Bijou [biu] Chausson [ʃosɔ˜] Cochon [koʃɔ˜] Dauphin [dofε˜]
Fusil [fyzi] Gitans [ita˜] Guenon [g∅nɔ˜] Lagon [lagɔ˜]
Marie [m␧Ri] Navet [nav␧] Panda [pa˜da] Requin [Rek␧˜]
Sommet [sɔmm␧] Toupie [tupi] Vallée [vale] zébu [zeby]

intensity and spectrum. In addition, more specific and local adaptations were searched (1) in the modification of spectral
cues such as f0 and F1, (2) in the temporal modulation of f0 and vocal intensity (3) in the potentially different effects
that noise type can have on speech modifications made by both genders and for different types of speech segments.

2. Materials and methods

2.1. Speech production experiment

The corpus is similar to the experiments presented in Garnier (2008) and Garnier et al. (2010).

2.1.1. Participants
Ten native French speakers (five men and five women) aged 20–28 years old took part in the recording. Only one
of them had some basic knowledge about the Lombard effect. None reported any speech or hearing difficulties.

2.1.2. Task
To account for the effect of communicative interaction on the Lombard effect, speakers were recorded while
playing a collaborative game in pairs. The game was inspired by the Map Task game (Brown et al., 1983). Its rules are
summarized here, and more details can be found in Garnier et al. (2010). The speakers had to exchange information
about 16 items drawn on their map, so as to reconstruct a path that linked the items. The items corresponded to target
words comprised of two CV (consonant–vowel) syllables (see Table 1). They were selected to represent most of the
French phonemes (all of the vowels except [œ], all the consonants except [ŋ] and none of the semi-vowels [w], [j]
and [Ч]). The experimental condition aimed at reproducing as much as possible a realistic situation of face-to-face
interaction in noisy conditions. Speakers were seated two meters from and facing each other. They could use audio
and visual information from the face only, since hands and head movements were constrained. No carrier sentence
was imposed, so as to preserve spontaneity. The speech content could not be predicted so that speakers had to adjust
their intelligibility level. An example of utterances produced by the speech partners (translated from French) is given
below.1 As in realistic noisy conditions, speakers were sometimes unsuccessful in their communication and had to
repeat or reformulate their utterance. These repetitions were included in the long-term acoustical analysis (long-term
average spectrum, distribution of f0 and vocal intensity over the game duration). Only the first occurrence of the target
words was considered for the analysis of syllables and segments.

2.1.3. Experimental conditions


Speakers played the game in a sound-treated booth in three sound conditions: (1) quiet, (2) 86 dB SPL of broadband
noise (BB) and (3) 86 dB SPL of non-intelligible cocktail-party noise (CKTL). The two types of noise were selected
from the BD Bruit database (Zeiliger et al., 1994). Their spectral characteristics are given in Fig. 1. In the BB noise,
spectral energy is attenuated above 10 kHz. The CKTL noise is made from the non-intelligible speech of four males
and four females. The spectral energy is concentrated below 800 Hz. It presents two maxima around 170 and 500 Hz,

1 Leader: “The first item is the shark (\Rək␧˜/ in French).”

Follower: “Well. . .the shark is associated with the summit (/sɔmm␧/ in French)”
Leader: “with the what ?”
Follower: “with the summit !”
Follower: “Yes”
Leader: “Ah, ok. Hmm. . .Then I’m going to the pig (/koʃɔ˜/ in French)”
M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597 583

Fig. 1. Spectral characteristics of the broadband noise (BB) and cocktail-party noise (CKTL).

and a local minimum around 340 Hz. Noise files were 3 min long. Noise started with a fade-in and was turned off once
the two speakers had completed the game after approximately 2–3 min.
To avoid perturbing the speakers’ self-monitoring feedback and their perception of the background noise, the two
noises were played over two loudspeakers (Tannoy System 600) instead of headphones. Loudspeakers were positioned
1.5 m from the speakers in each lateral direction and at the level of their ears. Noise levels were calibrated using a
1/2 pressure microphone (B&K 4165) and an artificial head placed where the speaker would be seated inside the
booth.

2.1.4. Audio recordings


The audio speech signal was recorded with a cardioid headset microphone (Beyerdynamic Opus 54) placed about
5 cm in front of the mouth. The signal was pre-amplified (RME Octamic) and sampled at 44.1 kHz and 16 bits (RME
ADI 8 Pro converter and RME DIGI 9652 HDSP sound card). Speech intensity was calibrated prior to the experiment
by recording the audio signal of a sustained vowel produced by the speaker, and by measuring the corresponding sound
pressure level (SPL) at the microphone with a digital Sound Level Meter.
Noise was removed from the speech recordings using a dedicated noise-canceling method (Ternström et al., 2002).
Before each noisy condition, 10 s of white noise was played into the loudspeakers and recorded at the microphone with
the speaker remaining quiet, which enabled the estimation of the impulse response of the loudspeaker-to-microphone
channel. For each noisy condition, it was then possible to estimate the entire noise signal that would have been recorded
at the microphone if the speaker had remained quiet, and to subtract this estimation, in the time-domain, from the actual
noisy recording of speech. The result of this subtraction gave the audio signal of speech produced in noise, with little
remaining signal from the surrounding noise. Such a method can be used in laboratory conditions where the noise
signal is perfectly known, and where loudspeakers and microphones remain at the same place. However, the position
of the speaker in the room also affects the channel estimation. As a consequence, it is necessary to restrain speakers’
movements to maximize the denoising performance and to guarantee the validity of acoustic measurements. When
head movements are restrained, the possible bias introduced by the denoising is less than 0.2 dB for intensity of voiced
segments, 0.7 dB for unvoiced ones, 0.13 tones for F0, 10 Hz for the first formant frequency and 3 Hz for the centroid
of the speech spectrum (Garnier et al., 2010). In this experiment participants remained still during the 2–3 min of noise
exposure. In between each speaking condition the participants were allowed to move and relax. The headset microphone
was firmly attached to the participant’s head to avoid any movement of the face away from the microphone. The map
used for the interactive game was placed high enough on a stand so that the speakers could see both the map and their
speech partner by moving their eyes but not their head. Writing on the map could be achieved with wrist movements
only.

2.2. Analysis

Sentences, syllables and segments of the target words were manually segmented from the audio signal using Praat
software. All the acoustic analyses were made with MATLAB.
584 M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597

2.2.1. Long-term spectral and frequency descriptors


The long-term average spectrum (LTAS) was computed from 0 to 6 kHz over the concatenated utterances produced
in each experimental condition by each speaker (∼1 min 30 s). The LTAS center of gravity (CoG) was measured from
speech normalized in intensity in order to compare the energy distribution between conversational and Lombard speech
spectra. The mean energy (in dB SPL) in the 0–1 kHz frequency band (corresponding to f0 and F1) in the 2–4 kHz
band (corresponding to the actor’s formant area) and in the 1–2 kHz and 4–6 kHz remaining bands were derived from
this LTAS.
The fundamental frequency (f0 ) was estimated by autocorrelation over the voiced parts of these concatenated
utterances. The distribution of f0 values was computed using a quantification step of 5 Hz and was normalized so that
its integration summed to 100%. The mode of the f0 distribution was detected to estimate the f0 value that is produced
the most frequently by a speaker in a given condition. The amplitude of f0 modulation was expressed in tones by taking
out f0 values that corresponded to the lowest 15% of the distribution (i.e. the least frequent occurrences of f0 ) then
extracting the width of that new distribution.

2.2.2. Syllable descriptors


Mean f0 and mean vocal intensity were measured for each CV syllable of the target word. Female speakers have
higher f0 and greater intra-speaker f0 variations (in Hertz) than males. To account for the gender difference and allow
intra-speaker comparisons, f0 was expressed in tones (from fref = 50 Hz). The magnitude of amplitude modulation was
calculated for each CV syllable as the difference between the maximal intensity of the vowel and the minimal intensity
of the preceding consonant.

2.2.3. Segment descriptors


For each talker and each condition, the mean frequency of the first formant was semi-automatically measured for
the vowels [a] (4 occurrences), [i] (5 occurrences) and [u] (2 occurrences) of the target word, using a conventional
autocorrelation-based LPC method.
Segments mean duration and mean vocal intensity were measured for vowels (32 measurements), sonorants with
a formant structure ([n], [m] and [l]: 8 measurements), and unvoiced consonants ([f], [s], [ʃ], [p], [t] and [k]: 12
measurements) of the target word.

2.2.4. Statistical analysis


Using SPSS software, a one-way analysis of variance (ANOVA) with repeated measures was conducted on each
parameter, considering one mean value per speaker and per condition.
Main effects were tested first for the factor CONDITION (three levels: CONDITION 1 – quiet (i.e. no noise),
CONDITION 2 – BB noise at 86 dB SPL, CONDITION 3 – CKTL noise at 86 dB SPL) and the inter-subject factor
GENDER (two levels: 1-female, 2-male). The statistical interaction between these two factors was also tested.
Secondly, specific contrasts were examined using Bonferroni adjustments. The effect of noise exposure on speech
production was tested as the contrast between quiet (CONDITION 1) and the two types of noise (CONDITION 2 and
3). The effect of noise type on speech modification was tested as the contrast between BB noise (CONDITION 2) and
CKTL noise (CONDITION 3). The influence of gender on speech modification in noise (GENDER × (CONDITION
1 vs. CONDITIONS 2–3)) and on the effect of noise type (GENDER × (CONDITION 2 vs. CONDITION 3)) were
also tested.
All the results are reported in Figs. 2–4 and 6–8. The conventional notation was adopted for indicating statistical
significance of the ANOVA tests: *** for p < .001, ** for p < . 01, * for p < .05 and ns (not significant) for p > .05.

3. Results

This section explores the three types of strategies mentioned in the introduction that speakers may adopt to improve
their speech audibility and segregation from a background noise (boosting, bypass, modulation). Where necessary, the
hypothesis is stated again, together with the expected results for a given strategy. The observations are then described
and compared to the expectations.
M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597 585

Fig. 2. Effect of noise exposure on syllable mean intensity as a function of noise type (Broadband noise (BB), Cocktail-party noise (CKTL)) and
speaker’s gender. On the left graph, error bars indicate the standard deviation over the five speakers of each gender. The black horizontal line across
the three panels represents the level of BB and CKTL noises (86 dB SPL). It shows how the increase of vocal intensity in noisy conditions enables
the speakers to have a positive signal to noise ratio, whereas it would be negative if they did not adapt from quiet to noise. The table on the right side
summarizes the main effects and interaction of the factors CONDITION and GENDER on the syllable mean intensity. Specific contrasts tested the
effect of noise exposure ((BB and CKTL noises) vs. quiet) and noise type (BB vs. CKTL), as well as the influence of gender (Females vs. Males)
on these effects.

3.1. Boosting strategies

3.1.1. Adaptation of speech intensity to the degree of energetic masking


Fig. 2 shows the variation of syllable intensity with noise exposure, noise type and speaker’s gender. As expected,
syllable intensity increased significantly when speakers adapted from quiet (i.e. 40 dB SPL of room noise) to 86 dB
SPL of noise, regardless of the type of noise. The average increase was 16.6 ± 3.1 dB2 (see Fig. 2). Despite the increase
of syllable intensity in noise, the SNR ratio decreased on average by 29.4 dB SPL between speech produced in quiet
and that produced in noise. However, it would have decreased by 46 dB SPL if the speakers had not adapted their
vocal intensity in noise. Fig. 2 shows how speech modification contributes to maintain a positive SNR in noise of
12.5 ± 3.7 dB SPL on average, whereas, without adaptation, it would have been negative for all but one male speaker.
The CKTL noise is made of simultaneous voices and therefore induces greater energetic masking of speech than does
BB noise. However, speakers did not increase their syllable intensity more in CKTL noise. On the contrary, the increase
of syllable intensity was significantly greater in BB noise (17.7 ± 2.7 dB SPL) than in CKTL noise (15.4 ± 3.8 dB SPL)
(see Fig. 2).
Furthermore, the long-term average spectrum of the CKTL noise demonstrates a first energy peak around 170 Hz
and a globally more similar envelope to the female voice spectrum. In comparison, BB noise has a flat spectrum, and
thus does not induce greater energetic masking on one gender over another. However, no significant interaction was
observed between the noise type and the speaker’s gender: females did not increase their syllable intensity more than
males in either BB noise or in CKTL noise (see Fig. 2).

3.1.2. Adaptation of segment intensity and duration to the degree of energetic masking
Vowels and sonorants have a greater spectral overlap
 with the CKTL noise, compared with the BB noise. On the
contrary, the spectrum of unvoiced fricatives ([s], [ ], [f]) and stop consonants ([p], [t] and [k]) is more similar to the
BB noise spectrum. Consequently, if speakers adopted boosting strategies to compensate for the degree of energetic
masking, we could expect that:

- [h1]: the variation of sonorants’ intensity and duration follow the same tendency as vowels, because they demonstrate
a comparable spectrum.

2 This standard deviation is calculated from the mean modification observed in each participant (thus over 10 values).
586 M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597

Fig. 3. Effect of noise exposure on segments intensity (a) and duration (b), as a function of noise type (Broadband noise (BB), Cocktail-party noise
(CKTL)) and segment type (vowels, sonorants [n], [m] and [l], and unvoiced plosive and fricative consonants).
On the left graphs, each bar represents the average variation of intensity or duration from quiet to noise, as well as the inter-speaker variability in
this adaptation.
The table on the right side summarizes the main effect of the factor CONDITION segments intensity and duration. Specific contrasts tested the
effect of noise exposure ((BB and CKTL noises) vs. quiet) and noise type (BB vs. CKTL).

- [h2]: there is an interaction between the segment type and the noise type. Vowels and sonorants would be more
emphasized in intensity and duration in CKTL noise compared to BB noise [h2a], and more emphasized than
unvoiced consonants in CKTL noise [h2b]. On the contrary, unvoiced consonants may be more emphasized in
intensity and duration in BB noise compared to CKTL noise [h2c], and more emphasized than vowels and sonorants
in BB noise [h2d].

Fig. 3 shows how speakers modified the vocal intensity and the duration of vowels, sonorants and unvoiced fricatives
and plosives in BB and CKTL noises. Results of statistical analysis are given in the table on its right side.
The results did not support our hypotheses.
Sonorants’ intensity and duration were not found to be modified in noise in a similar way to vowels (hypothesis
[h1]) but rather similarly to unvoiced consonants. The intensity of sonorants and unvoiced consonants increased on
average by 12.3 dB and 12.8 dB respectively, whereas vowels’ intensity increased by 17.3 dB. Vowels were significantly
lengthened (on average by 33 ± 12 ms), whereas consonants were significantly shortened for unvoiced fricatives and
plosives (by −10 ± 7 ms). Sonorants were shortened by 6 ms, however not significantly.
In contradiction to our hypotheses [h2], segment intensity and duration were not found to vary with noise exposure
with a significant interaction between segment type and noise type. Instead, the following general tendencies were
observed:

- Regardless of the type of noise and the type of consonants (unvoiced ones or sonorants), vocal intensity increased
more for vowels than consonants, and segment duration increased for vowels whereas it tended to decrease for
M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597 587

Fig. 4. Effect of noise exposure on speech spectrum, as a function of noise type (Broadband noise (BB), Cocktail-party noise (CKTL)) and speaker’s
gender.
On the left graphs, lines represent the spectrum envelope stylized from the energy measured in the four frequency bands 0–1 kHz, 1–2 kHz, 2–4 kHz
and 4–6 kHz. Speech signals were normalized in intensity. Error bars indicate the standard deviation over the five speakers of each gender.
The table on the right side summarizes the main effects and interaction of the factors CONDITION and GENDER on the energy in the 0–1 kHz,
1–2 kHz, 2–4 kHz and 4–6 kHz frequency band. Specific contrasts tested the effect of noise exposure ((BB and CKTL noises) vs. quiet) and noise
type (BB vs. CKTL), as well as the influence of gender (Females vs. Males) on these effects.

consonants. As a consequence, in all cases, the vowels to consonants ratio was found to increase in noise in both
intensity (4.7 ± 1.8 dB) and duration (41 ± 10 ms).
- Vowels intensity and duration increased more in BB noise (by 18.7 ± 3.1 dB SPL) than in CKTL noise (by
15.8 ± 3.7 dB SPL) (hypothesis [h2a]), whereas no significant effect of noise type was found on consonants’ intensity
and duration (hypothesis [h2c]).

3.1.3. Specific boost of the speech spectrum in regions of maximum energetic masking
As shown in Fig. 1, the CKTL noise presents maximum energy below 1 kHz, then decreased energy with increasing
frequencies above 1 kHz. The BB noise presents equal energy below 10 kHz. If the speakers boosted the energy of their
speech in the frequency bands where the background noise presents maximum energy, a greater increase of spectral
energy in the 0–1 kHz frequency band should be found in CKTL noise compared to BB noise.
Fig. 4 shows how male and female speakers modified the energy distribution of their voice spectrum in BB and
CKTL noises. Results of statistical analysis are given in the table on its right side.
At normalized intensities, the spectra of voices produced in CKTL noise did not present enhanced energy in the
0–1 kHz band, compared to voices produced in BB noise. Noise type did not show any significant influence on the
distribution of voice energy in higher frequency bands either. Instead, the following general tendencies were observed:

- For both types of noise, female voices showed a similar an amount of energy in the 0–1 kHz band compared to
voices produced in quiet condition, whereas Lombard male voices showed slightly less energy in the 0–1 kHz band,
compared to voices produced in quiet condition.
588 M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597

Fig. 5. Examples of typical spectrum envelopes observed in quiet and 86 dB SPL of noise for the vowels [a], [i] and [u]. The selected graphs
correspond to female productions from different speakers for each vowel, but from the same speaker in quiet and cocktail-party noise (CKTL). For
comparison purposes, spectra have been normalized to their maximum amplitude. The spectrum enhancement observed between 2 and 4 kHz in
noise involves different formants for the vowels [a], [i] and [u]. Two clustering strategies were observed for the vowel [i].

- For both males and females, and for both types of noise, the voice spectrum was significantly boosted in the 1–2 kHz
and 2–4 kHz bands but decreased in energy at high frequencies (above 4 kHz for males, and above 6 kHz for females),
compared to conversational voice produced in quiet condition.

Fig. 5 gives an example of typical spectral envelopes observed in this study for vowels produced by female speakers
in quiet and noisy conditions. These graphs show how the boosted energy of Lombard speech in the 1–4 kHz region
comes not only from a flatter spectral slope but also from the specific enhancement of the amplitude of higher formants.
Different formants were involved in this enhancement, depending of the vowel ([a], [i] or [u]). For the vowel [i], two
different strategies of formant clustering were observed across speakers and experimental conditions: (1) two clusterings
of F2–F3 and of F4–F5 (illustrated in Fig. 5, bottom left panel); (2) one clustering of F2–F3–F4 (illustrated in Fig. 5,
bottom right panel).

3.2. Bypass strategies

3.2.1. Global shift of spectral energy away from regions of minimum energetic masking
An alternative to boosting speech energy in the frequency bands of maximum energetic-masking regions is to shift
the energy of the speech spectrum away from these masking regions. As CKTL noise presents maximum energy below
1 kHz, one may hypothesize the speech spectral energy to be shifted toward higher frequencies in CKTL noise than in
BB noise.
M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597 589

Fig. 6. Effect of noise exposure on the CoG of the global speech spectrum, as a function of noise type (Broadband noise (BB), Cocktail-party noise
(CKTL)) and speaker’s gender. The top left plot represents the spectral envelope of the CKTL noise. The dashed vertical lines across the panels
correspond to the frequency regions where the CKTL noise spectrum presents local maxima (∼170 Hz and 500 Hz). Error bars indicate the standard
deviation over the five speakers of each gender.
The table on the right side summarizes the main effects and interaction of the factors CONDITION and GENDER on the CoG of the speech spectrum.
Specific contrasts tested the effect of noise exposure ((BB and CKTL noises) vs. quiet) and noise type (BB vs. CKTL), as well as the influence of
gender (Females vs. Males) on these effects.

Fig. 6 shows the variation of the CoG of the speech spectrum (including voiced and unvoiced segments) with noise
exposure, noise type and speaker’s gender. The table on its right side gives the results of statistical analysis. In both types
of noise, speakers shifted their speech spectrum toward higher frequencies. This adaptation translated into a significant
increase of speech CoG (by 533 ± 368 Hz on average), which was greater for female speakers (786 ± 360 Hz) than
for male speakers (280 ± 122 Hz). In CKTL noise, this spectral shift contributed for female speakers to raise their
speech CoG above 800 Hz, where the energy of that background noise is considerably attenuated (see Fig. 1). For male
speakers, however, their speech CoG in CKTL noise remained in the 500–900 Hz region, above the frequency of the
second peak of the CKTL noise spectrum, but not at a sufficiently high frequency to benefit from a significant release
in energetic masking. Furthermore, contrary to our expectations, speakers were not found to raise their speech CoG
more in CKTL noise than in BB noise, but they increased it by a similar extent in both types of noise.

3.2.2. Shift of the f0 information to regions of minimum energetic masking


The CKTL noise spectrum presents a local minimum of energy around 340 Hz, which releases the energetic masking
by 6 dB in comparison to the two adjacent peaks of energy around 170 Hz and 500 Hz (see Fig. 1). An f0 frequency
of 340 Hz seems an accessible target for female speakers. Consequently, a bypass strategy would involve raising their
voice in CKTL noise so that their f0 reaches the region of reduced energetic masking. On the contrary, male speakers
rarely raise their speaking voice to f0 higher than 250–300 Hz. They cannot reach the local minimum of the CKTL
noise spectrum. In their case, a bypass strategy would consist in limiting the increase of their pitch in CKTL noise in
order to maintain their f0 range below the first peak of energy of the CKL noise (around 170 Hz, see Fig. 1 and top
panel in Fig. 8).
Fig. 7 shows the distribution of f0 values for all the speakers in quiet condition, BB noise and CKTL noise. Its
bottom table presents the variation of mean f0 for the target words syllables, as a function of noise type and speaker’s
gender. As expected, all the female speakers were found to increase their syllable f0 in CKTL (by 4.5 ± 0.5 tones on
average) so that their most frequent f0 (corresponding to the mode of the f0 distribution) in CKTL noise was found on
average at 334 ± 17 Hz, i.e. 5.8 tones away from the first peak of the CKTL spectrum, and as close as 0.2 tones to its
local minimum of energy. However, contrary to our expectations, male speakers were not found to limit the increase
of their f0 in CKTL noise. Similarly to females, they raised their syllable f0 in CKTL noise by 4 ± 1 tones on average.
This brought their most frequent f0 in CKTL noise to 180 ± 38 Hz on average, i.e. as close as 0.5 tones from the center
frequency of the main peak of the CKTL spectrum.
590 M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597

Fig. 7. Effect ofnoise exposure on the distribution of f0 values, as a function of noise type (Broadband noise (BB), Cocktail-party noise (CKTL))
for the ten speakers of this study. These distributions are normalized in amplitude so their integration sums to 100%. The spectrum profile of the
CKTL noise is represented at the top. The dashed vertical lines across the panels correspond to the frequency area where the CKTL noise spectrum
presents a local maximum (∼170 Hz) and minimum (∼340 Hz).
The bottom table summarizes the main effects and interaction of the factors CONDITION and GENDER on the mean syllable f0 . Specific contrasts
tested the effect of noise exposure ((BB and CKTL noises) vs. quiet) and noise type (BB vs. CKTL), as well as the influence of gender (Females vs.
Males) on these effects.

When considering vowel spectra, the amplitude of the two first voice harmonics was found to be comparable in
CKTL noise for male speakers (H1–H2 = 0.1 ± 2.1 dB SPL), whereas the amplitude of the first harmonic was always
much greater than that of the second harmonic for female speakers in the same condition (H1–H2 = 8.4 ± 2.9 dB SPL).
Thus, even if male speakers do not raise f0 high enough to reach the local minimum of the CKTL noise spectrum
(around 340 Hz), they may raise it high enough for the second harmonic (2f0 ) to be located in that frequency region.
The most frequent value of 2f0 for males was indeed measured on average around 360 Hz in CKTL noise.
As BB noise spectrum presents no local minimum in energy below 10 kHz, one might expect speakers to still raise
f0 in BB noise – because of the well-known relationship between the variation of f0 and vocal intensity (Titze, 1989)
M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597 591

Fig. 8. Effect of noise exposure on the first formant frequency of [a], [i] and [u] vowels, as a function of noise type (Broadband noise (BB),
Cocktail-party noise (CKTL)) and speaker’s gender.
On the left plots, the dashed vertical line across the panels corresponds to the frequency area where the CKTL noise spectrum presents a local
minimum (∼340 Hz). Error bars indicate the standard deviation over the five speakers of each gender.
The table on the right side summarizes the main effects and interaction of the factors CONDITION and GENDER on the first formant frequency
of the three cardinal vowels of French: [a], [i] and [u]. Specific contrasts tested the effect of noise exposure ((BB and CKTL noises) vs. quiet) and
noise type (BB vs. CKTL), as well as the influence of gender (Females vs. Males) on these effects.

– but to a lesser extent compared to in CKTL noise, and without any specific adjustment of the f0 distribution around
340 Hz.
Syllable f0 was indeed found to increase significantly in BB noise (by 5.5 ± 0.5 tones on average), by a similar
extent for both males and females. However, contrary to our expectations, this increase was greater in BB noise than
in CKTL noise (by 1.1 ± 0.5 tones). As a consequence, the most frequent f0 of female speakers and the most frequent
2f0 value of male speakers in BB noise was not specifically adjusted around 340 Hz.

3.2.3. Shift of the F1 information to regions of minimum energetic masking


Similarly to the f0 range, the range of F1 is situated below 1 kHz, where the CKTL noise spectrum has maximum
energy and exerts greater energetic masking. A bypass strategy would be to adjust F1 to frequency regions where the
CKTL noise has reduced energy. The region around 340 Hz, where the CKTL noise spectrum has a local minimum in
energy, may be an accessible range to shift the F1 of closed vowels. On the other hand, the F1 of open vowels would
have to be shifted above 800 Hz if speakers aimed at improving the audibility of that information.
Fig. 8 represents the variation of F1 with noise exposure for the two closed vowels of French, [i] and [u], and for
the most open vowel [a], as a function of noise type and speaker’s gender. The table on its right side gives the results
of statistical analysis. F2 was not examined here, since its range is already situated above 800 Hz, where the energy of
the CKTL noise is considerably reduced, and since our research question is not whether the vowel space is reduced or
expanded in noise, but to determine whether the variations of formant frequencies can contribute to release the degree
of energetic masking.
For the open vowel [a], F1 was found to increase on average by 140 ± 30 Hz, by a similar extent for both genders
and both types of noise. For females, F1 values were raised above 800 Hz, where the energy of the CKTL noise is
considerably decreased. For males, however, F1 values in noise remained just below 700 Hz, in the frequency region
corresponding to the second peak of the CKTL noise.
For the closed vowels [i] and [u], females were found to increase F1 by an average of 145 ± 36 Hz, similarly to
the open vowel [a], whereas males increased F1 by only 72 ± 24 Hz on average. Thus, gender was found to have a
significant effect on the modification of F1 in noise for these two closed vowels. Furthermore, a significant interaction
592 M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597

Fig. 9. Effect ofnoise exposure on the amount of speech modulation in intensity and frequency, as a function of noise type (Broadband noise (BB),
Cocktail-party noise (CKTL)) and speaker’s gender.
On the left plots, error bars indicate the standard deviation over the five speakers of each gender.
The bottom table summarizes the main effects and interaction of the factors CONDITION and GENDER on speech modulation in amplitude and
frequency. Specific contrasts tested the effect of noise exposure ((BB and CKTL noises) vs. quiet) and noise type (BB vs. CKTL), as well as the
influence of gender (Females vs. Males) on these effects.

was observed between noise type and speaker’s gender in the variation of F1 for these closed vowels; F1 tended
to increase more in BB noise for females (by 43 Hz on average) and more in CKTL noise for males (by 21 Hz on
average). As a result, modifications of the closed vowels [i] and [u] in CKTL noise contributed, for both genders,
toward shifting their first formant away from the first main peak of the CKTL noise spectrum (around 170 Hz), and
close to the frequency where the CKTL noise spectrum presents a local minimum (around 340 Hz) (see Fig. 9). These
modifications appear to be specifically adapted to the characteristics of the CKTL noise, as they were not observed in
BB noise.

3.3. Modulation strategies

3.3.1. Modulation of vocal intensity within each syllable


Speakers significantly enhanced the intensity dynamics of their syllable when they adapted to noise by 3.3 ± 1.2 dB
SPL on average (see Fig. 9a). A greater modulation of syllable intensity was observed in BB noise (4.8 ± 2.0 dB SPL)
M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597 593

than in CKTL noise (1.8 ± 1.1 dB SPL) and this significant effect of noise type was more accentuated in females than
in males.

3.3.2. Modulation of f0 over the long term


In Fig. 7, it can be seen that speakers not only raise f0 when they adapt from quiet to 86 dB SPL of noise but also
significantly widen their f0 range, by an average of 0.8 ± 0.4 tones. The amount of widening was not significantly
affected by gender or noise type (see Fig. 9b).

4. Discussion and conclusions

4.1. Do speakers adopt boosting, bypass or modulation strategies to improve their audibility in noise?

As in previous studies, we found a significant effect of noise type on the modification of speech intensity but
no significant effect on voice spectrum. In this study, this effect of the noise type can be summarized by two main
conclusions:

(1) The relative energy between 0 and 1 kHz was not found to be more enhanced in CKTL noise, neither was the
speech CoG more shifted toward high frequencies in CKTL noise.
So, contrary to Mokbel (1992) who showed that a speaker enhanced more his speech energy in the frequency
band where noise was concentrated, and Junqua et al. (1998) that the increase of vocal intensity varied with noise
spectral tilt at constant masker level, our results do not support the existence of such boosting strategies in speakers
communicating in low frequency noise. However, our observations are consistent with those of (Lu and Cooke,
2009a,b) who also observed a greater shift of speech CoG toward high frequencies in broadband noises and in
high frequency noises rather than in low frequency noises.
(2) Noise type was not found to interact with the type of speech segment and with the speaker’s gender in the
modification of segments’ intensity and duration, syllable intensity and speech CoG:
- Regardless of the speaker’s gender and the type of segment, a greater increase of these parameters was observed
in BB noise compared to CKTL noise, although CKTL noise induces greater energetic masking on speech.
This result is in agreement with Egan (1972) and Lu and Cooke (2009b), who observed a greater increase of
voice intensity in broadband noise than in noise enhanced in low frequencies or high frequencies. It does not
reflect the results of previous studies (Junqua et al., 1998; Ternström et al., 2002; Jung, 2012), in which white
noise, speech-shaped noise, music noise or driving noise did not induce any different adaptation in vocal effort.
However, these studies compared noise types at similar perceived loudness (in dB A), whereas our own study
compared CKTL and BB noises at similar physical levels in dB SPL.
- Regardless of the noise type, vocal intensity and duration were found to increase significantly more for vowels
than for consonants. Sonorant consonants followed the same tendency as unvoiced consonants, despite their
spectral similarity to vowels. This observation is consistent with other studies (Castellanos et al., 1996; Junqua,
1993) that also reported the vowel-to-consonant ratio to increase in both intensity and duration in noise for each
kind of consonant. However, this greater adaptation of vowels could also result from the small vocabulary size
and the weak confusability between the target words of our experiment, which were only distinguishable by
their vowels (except the pairs ‘cochon’ (pig) vs. ‘chausson’ (slipper), and ‘navet’ (turnip) vs. ‘vallée’ (valley)).
- Regardless of the noise type, females raised their CoG more with noise exposure than did males. This partly comes
from a notable decrease of the relative energy above 4 kHz in males. Again, this observation is in agreement with
previous studies (Castellanos et al., 1996; Junqua, 1993) that reported an increase of speech energy between 4
and 5 kHz in female Lombard speech, whereas the greatest enhancement of speech energy in males was observed
between 2 and 4 kHz (Castellanos et al., 1996; Junqua, 1993).

As a consequence, our observations do not support the hypothesis that speakers modify their speech intensity and
spectrum in noise in a way that specifically compensate for the degree of energetic masking, neither using a boosting
nor a bypass strategy. Instead, our results rather support the idea that speakers adapt their level of vocal effort to the
perceived loudness of the background noise, greater in broadband noise than in low frequency noises such as the
CKTL noise of this study (because of the greater ear-sensitivity in the 2–4 kHz frequency band). Thus, regardless of
594 M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597

the spectral content of the background noise, the type of speech segment and the speaker’s gender, the main strategy
to cope with noise appears to consist in increasing vocal intensity to a level that preserves a positive SNR ratio (in dB
SPL).
This main strategy of increasing vocal effort can very well account for the raised f0 , F1 and CoG observed in
Lombard speech, as these acoustical modifications commonly accompany the increase of vocal intensity (Titze and
Sundberg, 1992; Sundberg and Nordenberg, 2006) and are observed in shouted speech, even in absence of background
noise (Rostolland, 1982; Schulman, 1989; Lienard and Di Benedetto, 1999; Traumüller and Erikkson, 2000).
Nevertheless, additional modifications that can improve speech audibility in noise have also been observed in this
study, while they are not directly related to the increase of vocal intensity.
In CKTL noise, speakers of both genders shifted their f0 distribution so that their most frequent f0 or 2f0 value was
specifically adjusted in the frequency region where the CKTL noise presented a local minimum in energy. Likewise,
we observed an interaction between noise type, vowel type (open vs. closed) and speaker’s gender. In CKTL noise,
speakers of both genders specifically adjusted the F1 of closed vowels ([i] and [u]) in the frequency region where the
CKTL noise had a local minimum in energy. Such an improved contrast in spectrum, f0 and F1 is predicted to release
the energetic masking effect of the CKTL noise and to improve the segregation of Lombard speech from another
competing speech stream (Culling and Darwin, 1993; Darwin, 1981).
Furthermore, a more detailed examination of the spectra of Lombard vowels indicated that the shift of speech CoG
toward higher frequencies came not only from the flattening of the spectral slope, as commonly observed when vocal
effort is increased, but also from specific higher-formants clustering and amplitude enhancement. This recalls the
“singing formant” or the “actor’s formant” observed in opera singers who sing over an orchestra, or in stage actors who
have to project their voice at distance (Bele, 2006). This boosting strategy enhances energy around 3–4 kHz, where the
human ear is most sensitive to sound pressure level, and improves voice ‘projection’ and audibility. Such an energy
boost of the 1–4 kHz region is not observed in the shouted speech of untrained speakers, while it is reported in clear
speech (Krause and Braida, 2004). Both artificial flattening of the spectral tilt and artificial enhancement of speech
energy above 1.5 kHz have proved to be efficient techniques of speech enhancement (Horwitz et al., 2008; Skowronski
and Harris, 2006).
Lastly, modulation strategies were observed, as greater modulations of f0 and vocal intensity were found in Lombard
speech than in speech produced in quiet, both for male and female speakers. These modulation strategies are not
directly related to increased vocal effort, as shouted voice demonstrates reduced f0 modulations (Rostolland, 1982).
On the contrary, a wider f0 range (Picheny et al., 1986) and an increased low-frequency modulation of the intensity
envelope (Krause and Braida, 2004) have also been observed in clear speech, and in speakers who are intrinsically
more intelligible than others (Bradlow et al., 1996). Several psychoacoustic studies have shown that vowels degraded
by an interfering sound are better detected and recognized when their f0 is modulated (Ishizuka and Aikawa, 2002).
Sound stream segregation is also improved by intensity modulation and temporal fluctuations (Ishizuka and Aikawa,
2002). Synthesis of natural f0 -contour variations has been shown to improve intelligibility of speech compared to
flat contours (Laures and Weismer, 1999). Segmental intelligibility was worsened when applying compression effects
that decrease the modulation in intensity (Boike and Souza, 2000; Hornsby and Ricketts, 2001). Consequently, this
enhanced modulation of f0 and vocal intensity observed here in Lombard speech are likely to reflect the intention of
the speaker to improve his intelligibility.

4.2. The Lombard effect, a two-level adaptation

In agreement to Lu and Cooke (2009b), our results suggest that there is no systematic “active” adaptation to the
spectral characteristics of the ambient noise. The increase in vocal intensity, the spectral shift, the rise in f0 and F1
appear to be primary and unavoidable features of Lombard speech, regardless of the type of noise and the speaker. The
same tendencies have been observed in shouted speech, in the absence of ambient noise (Lienard and Di Benedetto,
1999; Stanton et al., 1988; Titze, 1989), and in Lombard speech with or without interaction (Amazi and Garber, 1982;
Garnier et al., 2010). All these modifications may be interpreted as the different facets of one main action, which is to
increase voice intensity in noisy conditions.
However, when compatible with this primary tendency, there also appear to be subtler and secondary variations
of these same parameters in ways that optimize the acoustic contrast with the background noise. For example, the
distribution of f0 values is not only shifted to higher frequencies in noise (primary strategy of increasing voice intensity),
M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597 595

but it is also widened and specifically adjusted in the frequency area where the CKTL noise has minimum energy
(secondary strategies for improving acoustic contrasts).
This leads us to support the idea that speech adaptation to noise may broken down into two hierarchical “levels”
of strategies. These primary and secondary speech adaptations can be related to different mechanisms underlying the
Lombard effect. Thus, primary and unvarying modifications of speech in noise, related to the global increase of vocal
effort, may very well be related to the automatic and uncontrollable regulation of voice intensity that leads speakers to
raise their voice when they get an attenuated feedback of their own voice (Pick et al., 1989). Furthermore, the Lombard
effect also depends on communicative interaction, as vocal intensity and related parameters are more modified in noise
for interactive than non-interactive situations (Amazi and Garber, 1982; Garnier et al., 2010). Additional modifications
that do not relate directly to vocal intensity are observed in noise for interactive situations only (Garnier et al., 2010).
This communicative mechanism that also contributes to the Lombard effect may very well account for the secondary
modifications of speech parameters observed in this current study, when they are compatible with the primary ones.

4.3. Implications of this work

Speech modifications examined in this study have applications to the improvement of speech perception when
speech is broadcast in noisy environments, such as train stations, car interiors, places with ventilation or multi-talker
noise. In background noise of low frequency energy, it may be beneficial to shift speech f0 , formants and spectrum
toward higher frequencies, as is observed in Lombard speech. Irrespective of the noise type, speech auditory detection
and segregation may be improved by enhancing both frequency and amplitude modulation of speech, and boosting
speech energy in the 2–4 kHz frequency band.
It is more difficult to predict the perceptual consequence of conjoint speech modifications, as their effects are not
simply additive. Thus, conjoint modifications may either lead to counterproductive outcomes or may have particularly
positive consequences for speech auditory detection. For example, auditory detection may be particularly improved
when enhancing speech modulation together with shifting the speech spectrum toward the 2–4 kHz region, a frequency
band where the human ear is most sensitive not only to sound pressure level but also to acoustic contrasts (Jesteadt
et al., 1977; Wier et al., 1977).
Likewise, speech modifications can have multiple effects on these different aspects of intelligibility (audibility,
phoneme recognition, utterance parsing). For instance, the enhanced modulation of speech in f0 and intensity may both
improve speech segregation from a background noise and the segmentation of the utterance into lexical units (Garnier
et al., 2010; Welby, 2006). On the other hand, raised f0 and F1 may improve speech segregation from a multi-talker
noise, but may simultaneously degrade acoustic cues to vowel recognition.
These results may also have further applications for injury prevention and therapy in the case of vocal misuse and
abuse in noisy working places (preschool teachers, bartenders, factory workers, etc.). Among the different observed
modifications, enhancing speech modulation and spectral energy in the 2–4 kHz region are communicative techniques
that can be taught to people in order to improve their speech audibility in a safer way than simply increasing vocal
intensity.

Acknowledgements

We are grateful to Danièle Dubois for fruitful discussions on the methodological aspects of this study. We also
warmly thank the 10 speakers who kindly agreed to participate in this experiment, despite the discomfort of the noisy
situations.

References

Amazi, D.K., Garber, S.R., 1982. The Lombard sign as a function of age and task. J. Speech Lang. Hear. Res. 25, 581–585.
Arons, B., 1992. A review of the cocktail party effect. J. Amer. Voice I/O Soc. 12 (7), 35–50.
Assmann, P.F., Summerfield, Q., 1990. Modeling the perception of concurrent vowels: vowels with different fundamental frequencies. J. Acoust.
Soc. Am. 88, 680–697.
Bele, I.V., 2006. The Speaker’s formant. J. Voice 20, 555–578.
Boike, K.T., Souza, P.E., 2000. Effect of compression ratio on speech recognition and speech-quality ratings with wide dynamic range compression
amplification. J. Speech Lang. Hear. Res. 43, 456–468.
596 M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597

Boril, H., Pollak, P., 2005. Design and collection of Czech Lombard database. In: Proceedings of ICSLP, Lisbon, Portugal, pp. 1577–1580.
Bradlow, A.R., Torretta, G.M., Pisoni, D.B., 1996. Intelligibility of normal speech I: global and fine-grained acoustic-phonetic talker characteristics.
Sp. Commun. 20, 255–272.
Brown, G., Anderson, A., Yule, G., Shillcock, R., 1983. Teaching Talk. Cambridge University Press, Cambridge.
Brungart, D., 2001. Informational and energetic masking effects in the perception of two simultaneous talkers. J. Acoust. Soc. Am. 109, 1101–1109.
Castellanos, A., Benedi, J.M., Casacuberta, F., 1996. An analysis of general acoustic-phonetic features for Spanish speech produced with the
Lombard effect. Sp. Commun. 20, 23–35.
Culling, J.F., Darwin, C.J., 1993. Perceptual separation of simultaneous vowels: within and across-formant grouping by F0. J. Acoust. Soc. Am. 93,
3454–3467.
Darwin, C.J., 1981. Perceptual grouping of speech components differing in fundamental frequency and onset-time. Q. J. Exp. Psychol. 1981 (33A),
185–207.
Darwin, C.J., Brungart, D.S., Simpson, B.D., 2003. Effects of fundamental frequency and vocal-tract length changes on attention to one of two
simultaneous talkers. J. Acoust. Soc. Am. 114, 2913–2922.
Dreher, J.J., O’Neill, J., 1957. Effects of ambient noise on speaker intelligibility for words and phrases. J. Acoust. Soc. Am. 29, 1320–1323.
Egan, J.J., 1972. Psychoacoustics of the Lombard voice response. J. Aud. Res. 12, 318–324.
French, N., Steinberg, J., 1947. Factors governing the intelligibility of speech sounds. J. Acoust. Soc. Am. 19, 90–119.
Garnier, M., 2008. May speech modifications in noise contribute to enhance audio-visible cues to segment perception? In: Proceedings of AVSP,
Moreton Island, Australia, pp. 95–100.
Garnier, M., Dohen, M., Lœvenbruck, H., Welby, P., Bailly, L., 2006. The Lombard effect: a physiological reflex or a controlled intelligibility
enhancement? In: Proceedings of ISSP, Ubatuba, Brazil, p. 255-262.
Garnier, M., Henrich, N., Dubois, D., 2010. Influence of sound immersion and communicative interaction on the Lombard effect. J. Speech Lang.
Hear. Res. 53, 588–608.
Hornsby, B.W., Ricketts, T.A., 2001. The effects of compression ratio, signal-to-noise ratio, and level on speech recognition in normal-hearing
listeners. J. Acoust. Soc. Am. 109, 2964–2973.
Horwitz, A.R., Ahlstrom, J.B., Dubno, J.R., 2008. Factors affecting the benefits of high-frequency amplification. J. Speech Lang. Hear. Res. 51,
798–813.
Ishizuka, K., Aikawa, K., 2002. Effect of F0 fluctuation and amplitude modulation of natural vowels on vowel identification in noisy environments.
In: Proceedings of ICSLP, Denver, USA, p. 1633-1636.
Jesteadt, W., Wier, C.C., Green, D.M., 1977. Intensity discrimination as a function of frequency and sensation level. J. Acoust. Soc. Am. 61, 169–177.
Junqua, J., 1993. The Lombard reflex and it role on human listener and automatic speech recognizers. J. Acoust. Soc. Am. 93, 510–524.
Jung, O., 2012. On the Lombard effect induced by vehicle interior driving noises, regarding sound pressure level and long-term average speech
spectrum. Acta Acust. United Acust. 98 (2), 334–341.
Junqua, J.-C., Fincke, S., et al., 1998. Influence of the speaking style and the noise spectral tilt on the lombard reflex and automatic speech recognition.
In: Proceedings of ICSLP, Sydney.
Kadiri, N., 1998. Conséquences d’un environnement bruité sur la production de la parole. [Effect of noise exposure on speech production]. Toulouse
University, France (Ph.D. Thesis).
Krause, J.C., Braida, L.D., 2004. Acoustic properties of naturally produced clear speech at normal speaking rates. J. Acoust. Soc. Am. 115, 362–378.
Laures, J.S., Weismer, G., 1999. The effects of a flattened fundamental frequency on intelligibility at the sentence level. J. Speech Lang. Hear. Res.
42, 1148–1156.
Lienard, J.S., Di Benedetto, M.G., 1999. Effect of vocal effort on spectral properties of vowels. J. Acoust. Soc. Am. 106, 411–422.
Lu, Y., Cooke, M., 2008. Speech production modifications produced by competing talkers, babble, and stationary noise. J. Acoust. Soc. Am. 124,
3261–3275.
Lu, Y., Cooke, M., 2009a. The contribution of changes in F0 and spectral tilt to increased intelligibility of speech produced in noise. Sp. Commun.
51, 1253–1262.
Lu, Y., Cooke, M., 2009b. Speech production modifications produced in the presence of low-pass and high-pass filtered noise. J. Acoust. Soc. Am.
126, 1495–1499.
Marin, C.M., McAdams, S., 1991. Segregation of concurrent sounds II: effects of spectral envelope tracing, frequency modulation coherence, and
frequency modulation width. J. Acoust. Soc. Am. 89, 341–351.
Mokbel, C., 1992. Reconnaissance de la parole dans le bruit: bruitage/débruitage [Speech recognition in noise: Noise degradation vs. cancelation].
Ecole Nationale Supérieure des Télécommunications, Paris, France (Ph.D. Thesis).
Picheny, M.A., Durlach, N.I., Braida, L.D., 1986. Speaking clearly for the hard of hearing II: acoustic characteristics of clear and conversational
speech. J. Speech Hear. Res. 29, 434–446.
Pick, H.L., Siegel, G.M., Fox, P.W., Garber, S.R., Kearney, J.K., 1989. Inhibiting the Lombard effect. J. Acoust. Soc. Am. 85, 894–900.
Pittman, A.L., Wiley, T.L., 2001. Recognition of speech produced in noise. J. Speech Lang. Hear. Res. 44, 487–496.
Rostolland, 1982. Phonetic structure of shouted voice. Acta Acust. 51, 80–89.
Schulman, 1989. Articulatory dynamics of loud and normal speech. J. Acoust. Soc. Am. 85, 295–312.
Skowronski, M.D., Harris, J.G., 2006. Applied principles of clear and Lombard speech for automated intelligibility enhancement in noisy environ-
ments. Sp. Commun. 48, 549–558.
Stanton, B.J., Jamieson, L.H., Allen, G.D., 1988. Acoustics-phonetic analysis of loud and Lombard speech in simulated cockpit conditions. In:
Proceedings of ICASSP, New York, USA, pp. 331–334.
Sundberg, J., Nordenberg, M., 2006. Effects of vocal loudness variation on spectrum balance as reflected by the alpha measure of long-term-average
spectra of speech. J. Acoust. Soc. Am. 120, 453–457.
M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597 597

Ternström, S., Sodersten, M., Bohman, M., 2002. Cancellation of simulated environmental noise as a tool for measuring vocal performance during
noise exposure. J. Voice 16, 195–206.
Titze, I.R., 1989. On the relation between subglottal pressure and fundamental frequency in phonation. J. Acoust. Soc. Am. 85, 901–906.
Titze, I.R., Sundberg, J., 1992. Vocal intensity in speakers and singers. J. Acoust. Soc. Am. 91 (5), 2936–2946.
Traumüller, H., Erikkson, A., 2000. Acoustic effects of variation in vocal effort by men, women, and children. J. Acoust. Soc. Am. 107, 3438–3451.
Van Summers, W., Pisoni, D.B., Bernacki, R.H., Pedlow, R.I., Stokes, M.A., 1988. Effects of noise on speech production: acoustic and perceptual
analyses. J. Acoust. Soc. Am. 84, 917–928.
Welby, P., 2006. Intonational differences in Lombard speech: looking beyond F0 range. In: Proceedings of Speech Prosody, Dresden, Germany, pp.
763–766.
Wier, C.C., Jesteadt, W., Green, D.M., 1977. Frequency discrimination as a function of frequency and sensation level. J. Acoust. Soc. Am. 61,
178–184.
Zeiliger, J., Serignat, J.F., Autresserre, D., Meunier, C., 1994. BD Bruit, une base de données de parole de locuteurs soumis à du bruit [BD Bruit, a
database of speakers exposed to noise]. In: Proceedings of JEP, Grenoble, France, pp. 287–290.

You might also like