You are on page 1of 14

Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics

Radio Shack School of Audio Technologies 2014

Binaural approach in acoustic scene simulations in audio forensics

Francesca Ortolani
Ironbridge Electronics

Gunshots, steps, door noises, vehicles, barking dogs each sound could be a critical proof on the
crime scene. Along with acoustic measurements (such as reverberation time, wall isolation,
outdoor noise), binaural miking in acoustic simulations is indeed a powerful instrument for
testing a witness capabilities in hearing sound evidences and check for their direction of arrival on
the crime scene. This article is a short introduction to binaural audio and reverberation issues.

1. Introduction and background

Binaural recording is a 3D audio technique aimed at the playback through headphones or a pair of
speakers. It should not be mistaken for stereophonic sound, which rather refers to a
bidimensional soundscape. Binaural miking aims at rendering the exact position of the sources
relative to the listener in a 3D environment, that is, sound is all around the listener wearing
headphones or being in front of the speakers. Binaural audio through speakers is much more
difficult to implement due to cross-talk between the channels.

There exist 3 principal techniques in order to get a binaural rendering:

The use of a dummy head, an artificial human-like head equipped with microphones in the
ears, placed at eardrums; the dummy head substitutes for the human head and its
characteristic shadow, depending on the direction of arrival of the impinging front.
The simulation of a human head and the computation of a HRTF (Head Related Transfer
Experiments with real human heads and real people wearing microphone capsules in the

The first tests in binaural audio go back to 1881. A pair of coal microphones were placed in front of
the stage of the Opera in Paris and spaced like human ears. The acoustic signals were transduced
and sent to destination by telephone.
Several years later, in 1931-32, the researchers at Bell Telephone Laboratories in New Jersey (USA)
founded some of the most important bases of electroacoustics. Harvey Fletcher, Father of
stereophonic sound, also known for the Fletcher-Munson Loudness Curves (a graph showing
human auditory sensitivity, expressed in dB SPL, with varying frequency along with isophonic
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics

Radio Shack School of Audio Technologies 2014

curves) started investigating about the nature of sound, speech and listening and patented the
first medical acoustic hearing aid device.
The accurate description of each part of an acoustic system, in a particular way the speaking
source, the microphone, the electric transmission line, the receiver (headphones, loudspeaker or
listener) soon became essential.

The first binaural dummy head was called Oscar. It was built in 1932 at Bell Labs. It was made of
wax and had a pair of dynamic microphones with a diameter of 1.4 inches (about 3.56 cm) placed
in proximity of the ears.
The experiments revealed that the inaccuracy in localisation, regarding the sensing of the distance
of the source above all, were due to the discrepancies between the mean aperture of the real
(human) ear and the mic diaphragm dimensions (1.4 inches are comparable to a wavelength at 9

New binaural heads became more sophisticated, apart from some rudimental experiments made
with spheres and opposite mics. Commercial dummy heads are produced by Neumann, AKG,
Sennheiser, Bruel & Kjaer, Knowles Electronics and others. It is also possible to purchase in-ear
microphones (these mics look like in-ear monitors).

2. Binaural Duplex theory

Sound localisation is the capability of pinpointing the position of one or more sound sources in
terms of distance, azimuth and elevation. The information about position is not contained in the
receptor cells in the auditory system as on the retina in sight. On the contrary, it has to be
computed exploiting other information. The main information cues available are ITD (Interaural
Time Difference), ILD (Interaural Level Difference), also named IID (Interaural Intensity Difference).

It is known that sound takes different time lapses to reach the two ears, while the ILD is produced
by the shadowing effect of the head at the opposite-to-direction-of-arrival ear, impeding energy
carried by the sound itself.
Stern et al. in [1] point out how the ITD and ILD work on complementary frequency ranges (at least
for what concerns free space and simple point sources).
The ILDs are prominent for frequencies above 1.5 kHz, because in the high-end of the audible
spectrum the head has dimensions comparable with the wavelengths of the impinging sound
waves, thus reflecting a significant portion of the sound.
ITDs are present for all frequencies, but only at higher frequencies periodic sounds are decoded
unambiguously. In other words, the maximum feasible ITD (physically) is less than a half period of
the wavelength at each frequency. The reason is that the two ears sample the sound in space.
In order to avoid spatial aliasing the Nyquist theorem must be respected in space.

Since the maximum ITD for a human head is about 660 sec, the ITDs useful to localisation are
those below 1.5 kHz.
Given that different azimuth and elevation angles generate more or less the same ITD and this is
approximately constant with frequency and human specimen, the ITD [2] does not locate the
source position unequivocally.
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics

Radio Shack School of Audio Technologies 2014

The ILD exhibits instead greater variability changing the listener and it depends quite critically on
frequency. Thats why the ILD turns out to be more useful to source localisation.
Referring to Figure 1, the formula for calculating the ITD can be Woodworth formula (extended for
frequencies below 1.5 kHz) [3]:

a ( + sin )
ITD ( , ) = cos (1.1)

where a is the radius of the head (supposing it is spherical), c is the speed of sound (about 344 m/s
in air), is the azimuth and is elevation.

Figure 1 Duplex binaural model

Woodworths original formulation did not consider that the ITD diminishes outdistancing the
source on the listeners horizontal plane. The factor cos reckons with that.
At frequencies above 1.5 kHz Woodworths formula gets less accurate.

Consider another sensory cue: the IPD (Interaural Phase Difference). When the IPD is greater or
equal to 180, the source position is ambiguous. This ambiguity is induced by the periodic nature
of phase.

Since the wavelength for higher frequencies is shorter, it may happen that the distance between
the ears is greater than a wavelength for adequately high frequencies (>1.5kHz).
This gives rise to spatial aliasing, as mentioned above, and the perceived ITD proves to be
ambiguous, as the ear is responsive only to the phase and not to time differences.
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics

Radio Shack School of Audio Technologies 2014

So, for such frequencies the ILD supporting information is necessary. In other words, a high
frequency signal (>1.5 kHz) produces an ITD greater than the signal period, while a low frequency
signal is such that, because of the ITD ranging in a period of the signal itself, the phase difference
perceived at the ear allows the listener to evaluate the ITD unambiguously.

The human listener takes advantage of the ITD and ILD cues in order to decode the direction of
arrival of the sound. In addition, the presence of the pinna in the ear produces furthermore a
coloration of the sound, depending on the direction of arrival of the sound wave.
Figure 2 shows the human ear anatomy.
In general, localisation performance is excellent in the front stage of the horizontal plane, good on
the rear and much worse on the vertical plane.

3. HRTF (Head-Related Transfer Function)

In linear systems analysis the transfer function is defined as the complex ratio between the output
signal spectrum and the input signal spectrum. Then, the transfer function H( f ) of a linear time-
invariant system is defined as:
Y (f )
H (f ) = (1.2)
X (f )

where Y( f ) is the output signal frequency spectrum and X( f ) is the input signal spectrum.

A transfer function (or, equivalently in time domain, the impulse response) characterizes and
describes the channel between the source and the receiver.
The HRTF is the head related transfer function, a function of 4 variables (f, r, , ), where f is the
frequency, r is the distance of the source from the listener, and are respectively the azimuth
and the elevation.
So, we have a left HRTF and a right HRTF and these describe the channel between the ear and the
source relatively to the left and right ear.

A HRTF matches only one configuration of source and receiver and the left and right HRTFs do
embed the ITD and ILD information.

In order to extract the ITD and ILD from the HRTFs [4] we compute the modulus of the interaural
transfer function, which is defined as:

HrR, , ( )
HINT ( ) = (1.3)
HrL, , ( )

where = 2f is the angular frequency and the superscripts R,L denote respectively the right and
left channel. The modulus (expressed in dB) of HINT ( ) is:

Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics

Radio Shack School of Audio Technologies 2014

HINT ( ) dB = 20log10 HINT ( ) = HrR, , ( ) HrL, , ( ) (1.4)

dB dB

and it is an expression of the ILD with varying frequency.

The group delay is defined as:

d INT ( ) d ( R ( ) L ( ) )
= (1.5)
d d

where ( ) = HINT ( ) , R ( ) = HrR, , ( ) , L ( ) = HrL, , ( ) represents the ITD with

varying frequency.

The HRTF method does not replace the Duplex model, but supports the latter in order to obtain a
better estimation of the source position in the listening environment. Research brought to more
complicated human models and the BRTF (Body-Related Transfer Function) were invented.
These new transfer functions describe human body vibrations and their transmission to the
auditory system thus contributing to the overall sound perception.

4. Reverberation effects on source localisation

As a first effect reverberation reduces the efficacy of the information used by the listener. In
presence of reflections, in fact, the auditory system may decode the source position erroneously
due to high intensity echoes interfering the direct sound and superimposing themselves to it.

Tests in anechoic chamber and reverberant chamber brought some significant results through the
Some of them are listed below:

a wide band stationary noise is localised more scarcely in a reverberant field [5];
a wide band noise is localised easier in a reverberant field than single sparse tones, that is,
localisation improves with the spectral density of the source;
high attack transient sounds are better localised than others independently on
reverberation decay time .

Thanks to the precedence effect the listener is able to weight wavefronts in different ways.
The first front on arrival (direct sound) is given more weight.
Steep attack transients trigger the precedence effect and favour localisation, while stationary
sources are decoded with a greater difficulty.

Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics

Radio Shack School of Audio Technologies 2014

Studies about separation of concurrent sources [6] revealed that the capability of grouping sounds
according to their harmonicity depends on the extent of their fundamental frequency (f0) range.
In free space the listener identifies with ease those sounds having a proper pitch difference.
The situation changes in a reverberant environment: direct sound and echoes with the same
fundamental frequencies relate to each other on the basis of harmonic ratios. A flutter in f0 causes
inharmonicity between direct sound and reflections. For this reason, in a reverberant field, the
mechanism of grouping sounds according to harmonicity loses its efficacy.

In a confined space distance perception depends on 2 factors [7]: the ratio between direct sound
energy and reflected energy and the time delay between direct sound and reflected sound.
Room reflections also contribute to the general spatial impression charactering the listening
environment and the sound source.

Some parameters are introduced here: ASW (Apparent Source Width), which is the perceived
length of a source, and LEV (Listener Envelopment), the perception of a surrounding sound.
ASW is determined principally by the intensity level of the early lateral reflections within the first
80 ms from the arrival of the direct sound at the listeners position.
LEV is the subjective immersive sound sensation produced by reverb and depends on the nature of
late reflections, in particular on their distribution over time, their level and direction of arrival.

In [8] Bradley explains how ASW and LEV are caused by the precedence effect, during which the
direct sound and the early reflections are temporarily merged together.

ASW finds justification in the circumstance that the identification of the direction of arrival is
distorted by the enhancement of the merged event produced by early lateral reflections.

The reverberation tail instead is not merged with direct sound and appears as a sort of diffused
halo around the direct sound thus giving the listener a sense of immersion quantifiable with LEV.

5. Reverberation effects on sound perception

When the sound is emitted in a reverberant environment, reflections are superimposed to direct
sound, causing the spreading of it in time and filling up with energy those spaces in time that
would have remained empty in an anechoic ambient, that is, a reflectionless chamber.
With immediate effect, the listener encounters a reduction in signal intelligibility.
Figure 2 (a) (b) compares the spectrogram1 of the vocal recording Thank you for travelling on the
central line by a female speaker in the case (a) of a moderate reverberant ambient and the case
(b) with a 1-second decay-time reverb.

Figure 2 (b) highlights reverberation filling up those darker zones with lower energy, which are
much sharper in Figure 2 (a).
The attack phonemes are less susceptible to the effect of reverberation than the final (release)

Spectrogram: time vs frequency representation of a signal.
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics

Radio Shack School of Audio Technologies 2014


Figure 2 Spectrogram of speech (a) short reverb (b) long reverb

Studies about the masking effect of reverb on words were discussed in [9], [10].
We define self-masking the phenomenon in which the initial part of a phoneme masks the final
part of the same phoneme, mixing up the transients.
In overlap-masking the echoes of earlier phonemes mask the following phonemes.
Early and late reflections have different effects on speech intelligibility [7].
The early reflections are highly correlated with the original speech and they can possibly assist its
understanding, increasing its loudness, or that attribute of auditory sensation in terms of which
sounds can be ordered on a scale extending from quiet to loud [11].
The late reflections are less correlated with the original signal and for this reason they behave like
additive noise.
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics

Radio Shack School of Audio Technologies 2014

Lochner and Burger [12] defined a parameter for speech intelligibility according to the ratio
between the energy introduced by the early and the late reflections.
This parameter is included in standard ISO 3382 (Measurement of the reverberation time of
rooms with reference to other acoustical parameters) with the name Early-to-Late Index, or
Clarity, where the crossover interval separating the early reflections and the tail of the reverb is 50
ms for speech and 80 ms for music:

50 ms 80 ms
2 2
h (t )dt h (t )dt
0 0
C 50 = 10log
C 80 = 10log
2 2
h (t )dt h (t )dt
50 ms 80 ms

Where h(t ) is the room impulse response.

Its known that reverberation colours the signal spectrum. Apparently, human listeners are able to
compensate for this spectral distortion. Its going to be a study object in machine

6. The use of the binaural techniques on the crime scene

Binaural audio can be useful in many ways.

Dummy head on the crime scene

Suppose a witness claims he heard a gunshot coming from a certain direction or suspected the
presence of a spiteful person after hearing the dog bark or a car sprint away at full throttle.
Is this the truth? A dummy head would be an appropriate support for the inquiry in order to test
the veracity of the statements. The dummy head would be situated in the place of the witness and
the acoustic environment would be reconstructed, recorded and processed for further
investigation and analysis.
One would take advantage of the recorded tracks and the advanced digital signal processing, so
after recording the acoustic scene, it is possible to enhance certain sounds, apply noise cancelling
algorithms, estimate the direction of arrival of essential evidences.

The use of HRTFs for the acoustic identikit

Just like an identikit, it is possible to reconstruct a binaural acoustic scene so that a potential
witness will listen to it. The aim is to understand if what the witness heard on the real scene is
something similar to the artificial binaural soundscape.
The instrument to accomplish this resides in the use of HRIRs/HRTFs.
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics

Radio Shack School of Audio Technologies 2014

One could simulate an artificial room by placing the virtual sources relative to the listener by
convoluting sounds with impulse responses.
It is also possible to add an artificial reverb after having modelled the room properly.

Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics

Radio Shack School of Audio Technologies 2014

Annexe A: Reverberation
Reverberation is the persistence of sound, after the source ceased to produce it, occurring in an
enclosed space where reflective surfaces are present.
The sound reaches the listener through the direct path from the source and through multiple
reflections from those surfaces impacted by the sound field.
The first step in understanding the characteristics of a reverberant environment is the pointwise
computation of its impulse response.
The impulse response describes completely the properties of a room according to the specific
configuration of source and receiver.
Typically, one excites the room with a continuous wide band signal (reference signal), such as pink
noise, sine sweep, MLS in order to record the room impulse response.
The impulse response is obtained from the deconvolution between the reference signal and the
recorded response. As expected, the impulse response looks like Figure A.3, where the first sound
front (direct sound) arrives at time 0 ms, followed by the early reflections, which get more and
more numerous and thicker and decrease in amplitude thus defining the reverb tail or late
The human ear perceives two distinct sounds if these are separated by a time longer than 30-80
ms interval (it depends on the person). This means that the late reflections appear like a halo with
exponential decay.

Figure A.3 Typical room impulse response

The impulse response changes with the listening position inside the room. The pattern of the early
reflections, that is, the sequence of repeating early reflections, and the arrival of the first
reflection are influenced by the position of source and receiver relative to the reflective surfaces
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics

Radio Shack School of Audio Technologies 2014

and by the dimensions of the room (see Figure A.4). The decay time depends on the room
dimensions and by the surface absorbency.
The delay time between the arrival of the direct sound and the reflections has the side effect of
colouring the received sound. This phenomenon is called comb filtering. A comb filter introduces
nulls in the frequency response of the room for a certain frequency and its harmonics.
Sound coloration is also due to the materials in the room, which absorb the sound differently with

When choosing the material, with the aim of operating the acoustic correction of a room, its good
to consult the data sheets or the charts listing all the absorption coefficients for all the frequency
All materials (including air) absorb higher frequencies efficiently, since typically these carry a lower
energy relative to lower frequencies.
High frequencies are also more easily diffused, given that their wavelengths (short) are
comparable with the dimensions of the objects (obstacles) in a room.
From the perception point of view, high frequencies will be more directional. For this reason, for
instance, the subwoofer in a surround system could be placed at whatever position in the room
(rationally depending on room dimensions, cross over frequency, shockwave direction) and for
this reason it is very common to find the external far end (port) of a bass reflex tube in different
positions on a loudspeaker case.
Another effect is, changing the listening position, that reverb gets more apparent walking away
from the source. There exist a distance, called critical distance, beyond which the reverberant field
has a greater intensity in comparison to the direct field:

Figure 4 A listener receives multiple reflections [13]

Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics

Radio Shack School of Audio Technologies 2014

dc =
S an
n n

where Q is a directivity parameter ( Q = 1 for an omnidirectional source), Sn is the n-th surface

area having the absorption coefficient a n .

If the absorption coefficients are unknown, an alternative formula for dc is the following:

dc = (1.8)
100p RT

60 is the estimated reverberation time260 dB [s ].

where V is the room volume m3 and RT

Figure 5 The reflection order is the number of impacts the sound carries out before reaching the listener/receiver [13]

Reverberation time (RT60, RT30, RT20, RT10): typically defined as the time interval sound energy takes to
decay 60, 30, 20, 10 dB, after the excitation in the room has ceased.
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics

Radio Shack School of Audio Technologies 2014

Annexe B: The human ear

The pinna, the outmost part of the ear, is a cartilaginous structure with the aim of conveying the
sound, through the auditory channel, towards the eardrum membrane. The sound is transmitted
to the ossicles (malleus, anvil and stirrup), which form a system of levers behaving like an
amplifier. So, the force impinging on the oval window is made triple compared to that impinging
on the malleus and the acoustic pressure is 2 orders of magnitude greater relative to the pressure
on the eardrum [4]. The cochlea is the center of the auditory process. It has the shape of a snail
and its about 35mm long with a diameter of 2mm at its external entrance and gradually shrinking.
Inside the cochlea there is the basilar membrane covered in ciliated cells. When the membrane
vibrates, the cilias bend and this movement causes the nerve terminations of the relative cells to
transmit impulses to the brain, whose number depends on the frequency of the sound wave. In
other words, the cochlea behaves like a filter bank [4], operating a spectrum analysis and
transduces a frequency information into an electric signal towards the brain, according to the
position of the involved resonator.

Figure B.6 Ear anatomy [David Darling Encyclopedia of Science] : (1) pinna (2) lobe (3) auditory channel (4) eardrum (5) ossicles
stirrup, anvil, malleus (6) Eustachian tube (7) oval window (8) saccule (9) semicircular canals (10) cochlea

Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics

Radio Shack School of Audio Technologies 2014

1. R. M. Stern, G. J. Brown e D. Wang, Binaural Sound Localization, In Computational Auditory
Scene Analysis: Principles, Algorithms, and Applications, Wiley-IEEE Press, 2006

2. F. L. Wightman e D. J. Kistler, Factors affecting the relative salience of sound localization

cues, in Binaural and Spatial Hearing in Real and Virtual Environments, 1-23, Lawrence
Erlbaum Associates, Mahwah, NJ, 1997

3. Brian Carty e Victor Lazzarini, Binaural HRTF based spatialisation: new approaches and
implementation, 12th Int. Conference on Digital Audio Effects (DAFx-09), Como, Italy,
September 1-4, 2009

4. A. Uncini, Audio Digitale, 3:80-81, McGraw-Hill, 2006

5. W. H. Hartmann, Localization of sound in rooms, Journal of the Acoustical Society of

America, 74(5):1380-1391,1983

6. J. F. Culling, Q. Summerfield e D. H. Marshall, Effects of simulated reverberation on the use

of binaural cues and fundamental frequency differences for separating concurrent vowels,
Speech Communication, 14:71-95, 1994

7. G.J. Brown e K. Palomki, Reverberation, In Computational Auditory Scene Analysis:

Principles, Algorithms, and Applications, Wiley-IEEE Press, 2006

8. J. S. Bradley, Predictors of speech intelligibility in rooms, Journal of the Acoustical Society of

America, 21(6):577-580, 1949

9. R. H. Bolt e A. D. MacDonald, Theory of speech masking by reverberation, Journal of the

Acoustical Society of America, 21(6):577-580,1949

10. B. Libbey e P. H. Rogers, The effects of overlap-masking on binaural reverberant word

intelligibility, Journal of the Acoustical Society of America, 115:3141-3151, 2004

11. American National Standards Institute, American national psychoacoustical terminology,

S3.20, 1973, American Standards Association

12. J. P. A. Lochner e J. F. Burger, The influence of reflections on auditorium acoustics,

Journal of Sound and Vibration, 1(4:426-454, 1964

13. F. A. Everest, The Master Handbook of Acoustics, McGraw-Hill 2009