Professional Documents
Culture Documents
Dissertation
der Mathematisch-Naturwissenschaftlichen Fakultät
der Eberhard-Karls-Universität Tübingen
zur Erlangung des Grades eines
Doktors der Naturwissenschaften
(Dr. rer. nat.)
vorgelegt von
M.Sc. Mansoor Hyder
aus Naushahro Feroze (Sindh) Pakistan
Tübingen
2011
Tag der mündlichen Qualifikation: 14.12.2011
Dekan: Prof. Dr. Wolfgang Rosenstiel
1. Berichterstatter: Dr. -Ing. Christian Hoene
2. Berichterstatter: Prof. Dr. Thomas Walter
Dedicated
To
My Mother
Nawab Khatoon Depar
My Father
Ghullam Hyder Depar
&
Also
To
Subhan Khatoon Depar
Muhammad Soomar Khan Depar
Rabia Hyder Depar
Shahid Hyder Depar
Abstract
The invention of telephony has brought a significant revolution to our lives and is
undoubtedly considered one of the most important inventions of the modern-day world.
But over the last decades hardly any improvements in audio quality have been achieved.
Telephony still suffers from issues such as low speech intelligibility, poor audio quality
and extraneous noise. To improve the quality of telephony the use of spatial (or 3D)
audio has been proposed. 3D audio can offer significant advantages, such as enhanced
overall audio and speech quality, since our natural listening ability is inherently three
dimensional. Here, the nature of Virtual Acoustic Environments (VAE), which are used
in most of the 3D audio simulations, play a very important rule in the perception of spatial
audio. Due to the importance of VAEs, there is need for studying various VAE parameters
to properly design virtual acoustic rooms for the benefit of better audio quality, speech
intelligibility and enhanced localization performance.
This thesis introduces a telephony and teleconferencing system supporting three
dimensional audio and customizable virtual acoustic environments. The system consists
of a VoIP based telephone extended by low-delay audio codecs, three dimensional
renderers, and head phones extended by head-tracking sensors.
This thesis also presents a series of experiments conducted to optimize the 3D
telephony system. In the experimental study various parameters are considered to
validate speech quality, locatability and speech intelligibility of the teleconferencing
participants. Within two different VAEs, seven different placement of participants were
studied. In addition, eleven sets of user experiments are described in this thesis that
examine the effects of simulated acoustic room properties, virtual sitting arrangements,
reflections of a conference table, number of concurrent talkers and voice characteristics.
This thesis also presents three interlocutor based live conversational tests to compare the
audio qualities of mono, stereo and spatial conversations with and without head-tracking.
A conceptual and holistic Quality of Experience (QoE) model comprising all domains
of a communication ecosystem and the relationships between QoE and virtual acoustic
environments is also presented. The model is evaluated through user studies and
empirical analysis. Based on this model, a use case study is presented for three
dimensional telephony. Also the interaction and classification of QoE factors and
contextual aspects are presented.
vii
Abstract
viii
Kurzfassung
Die Erfindung des Telefons übt einen signifikanten Einfluss auf das moderne Leben aus
und kann zweifelsohne als eine der wichtigsten Erfindungen der Moderne angesehen
werden. Doch während der letzten Jahrzehnte hat sich die sprachliche Qualität dieser
Erfindung kaum gebessert. So leiden Nutzer des Telefons noch immer an Problemen wie
Verständnisschwierigkeiten, mangelhafter Audioqualität und Störgeräuschen. Um diese
Probleme zu beheben wurde der Einsatz von Raumklang (3D-Audio) vorgeschlagen.
3D Audio bietet erhebliche Vorteile für die Audio- und Sprachqualität, da der Mensch
von Natur aus räumlich hört. Die Beschaffenheit der Virtuellen Akustischen Umgebung
(VAU) übt dabei einen großen Einfluss auf die Wahrnehmung des Raumklanges aus. Aus
diesem Grund ist es notwendig, unterschiedliche Parameter einer VAU auf ihren Einfluss
auf die Sprachverständlichkeit, Audioqualität und Lokalisierbarkeit zu untersuchen.
Diese Arbeit stellt daher ein Telefon- und Telekonferenzsystem vor, das 3D
Audiotechnologien und individuell konfigurierbare VAUs unterstützt. Das System
besteht aus einem VoIP-Softphone, das um Codecs mit geringer Verzögerung erweitert
wurde, 3D Renderern und Kopfhörern mit Head-Tracking Sensoren.
Weiterhin beschreibt diese Arbeit eine Reihe experimenteller Messungen zur
Optimierung des Telefonsystems. Die Experimente umfassen die Evaluierung
unterschiedlicher Parameter zur Beschreibung der Sprachqualität, Sprachverständlichkeit
und Lokalisierbarkeit der Teilnehmer einer Telekonferenz. So wurden sieben
unterschiedliche Teilnehmerplatzierungen in zwei unterschiedlichen VAUs evaluiert.
In elf weiteren Szenarien wurde der Einfluss unterschiedlicher Umgebungsparameter,
Teilnehmerplatzierungen, eines Konferenztisches sowie der Anzahl simultaner Sprecher
und deren Stimmcharakteristika untersucht. Zusätzlich wurden Konversationstests
durchgeführt, die den Einfluss von Mono-, Stereo- und Raumklang sowie den Einsatz
von Head-Tracking-Kopfhörern auf die Audioqualität messen.
Letztendlich beschreibt diese Arbeit ein konzeptuelles, ganzheitliches Quality of
Experience (QoE)-Modell, das alle Domänen eines Kommunikations-Ökosystems sowie
die Beziehungen zwischen QoE-Aspekten und VAUs umfasst. Eine Auswertung des
Modells anhand von Nutzerstudien und empirischer Analyse wird dargestellt. Auf der
Basis dieses Modells wird außerdem eine Fallstudie mit dem Schwerpunkt 3D Telefonie
beschrieben, und eine Klassifikation von QoE-Faktoren und deren Interaktionen mit
kontextuellen Aspekten präsentiert.
ix
Kurzfassung
x
Acknowledgments
The research work for this thesis was conducted in Wilhelm Schickard Institut für
Informatik (Computer Science) at the Eberhard Karls Universität Tübingen, Germany.
Preparation of this thesis was supported by Higher Education Commission (HEC)
Pakistan in collaboration with Deutscher Akademischer Austausch Dienst (DAAD)
Germany.
My enormous thanks and gratitude goes to my guide and supervisor Dr. -Ing. Christian
Hoene for his constant support and technical guidance throughout my PhD research.
I am thankful to him for inviting me in his research group to conduct PhD research
work for the accomplishment of this thesis. He is one of the wonderful persons I
have ever met in my life. He always kept me motivated for the achievement of this
thesis work and provided me his kind support and technical insights when ever it was
needed. I am also thankful to all the teachers, coworkers and staff at Wilhelm Schickard
Institut für Informatik, the University of Tübingen for their help and support to make this
dissertation possible. Specially, I am thankful to Prof. Dr. Andreas Zell for reviewing
my annual HEC-DAAD reports. I am also thankful to Prof. Dr. Michael Menth for
his kind support. I would also like to thank my second supervisor Prof. Dr. Thomas
Walter for his kind support and time. I would also like to thank my current workmates,
specially Michael Haun for being always so helpful and kind, with whom I published
research work in co-authorship as well. I am also thankful to Olesja Weidmann, Patrick
Schreiner, Stefan König, Mark Schmidt, Michael Höfling, Alfons Martin and Susanna
Uresch for their help and support.
Work in Chapter 7 was achieved in collaboration with Institut Telecom Sud Paris, Evry
France. I am thankful to Professor Dr. Noel Crespi and M.Sc Khalil ur Rehman Laghari
for their collaborative work. The outcome of our collaboration work has been formulated
in the shape of a journal article. I also thank all subjects who participated in user studies,
without whom this study would not have been possible. I express much appreciation to
my thesis supervisors/examiners for their guidance.
I would like to also acknowledge all my friends and family who supported me in
various ways to achieve this thesis work. I would like to thank all friends who helped me
proof reading of thesis. I am also thankful to my friends in Tübingen and in Germany
for their time and valuable discussions. I am thankful to my friends whom I met in
Tübingen specially Lala Faisal khan Bangash, Mian Irfan Ghani, Zaigham Mahmood,
Zafar Iqbal, Aftab Ali Shah, Iftikhar Alam Khatak, Faisal Shahzad, Kahsif Jilani, Khaver
Saeed, Muhammad Raza, Umer Zeb, Uwe Schmidt and Yasir Niaz Khan for their nice
company in the evenings and on weekends. I am also thankful to my friends who were
xi
Acknowledgements
living in other cities of Germany for their encouragement and support, specially I am
thankful to Azad Ali Wassan, Shahid Hussain Danwar, Syed Saif-ur-Rehman and Jam
Raja Ghazanfar Ali Sahito.
I am also very thankful to my family, specially my mother Nawab Khatoon Depar
who loved me so much. During pursuing my PhD studies she passed away. I love
you Amaan, miss you. I am thankful to my father Ghullam Hyder Depar who is the
greatest motivational force for me. He always supported me in every way to achieve
better education and he constantly insisted me to work hard. Love you Baba Saein.
Also, I express my love and gratitude to Muhammad Soomar Khan Depar and Subhan
Khatoon Depar for their love and prayers, love you both. Many thanks to all my brothers
and sisters for their love and support. I am particularly thankful to my wife Rabia who
always supported and encouraged me in every stage of my life and particularly supported
me during PhD research work through her endurance and love. My special thanks to
my son Shahid Mansoor Hyder who born in Tübingen, Germany during my PhD work,
whom I could not give proper time during last several months. But, I promise him that
this will be changed from now on.
I am really grateful to Allah Azz Wa Jall for his countless blessings on me and his help
for what ever I am and for what ever I have. Ya Allah Azz Wa Jall! Open the portal of
knowledge and wisdom for me, and have mercy on me! and all of us! O the One, who is
the most Honorable and glorious!
xii
Contents
1 Introduction 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 5
2.1 Hearing: Ear the listening organ . . . . . . . . . . . . . . . . . . . . . 5
2.2 Binaural Hearing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Sound Localization . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Sound Localization Cues . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Cone of Confusion . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Binaural Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 3D sound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Ambisonics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 Wave-Field Synthesis . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 3D Audio recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Dummy Head . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 B-format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 3D Audio Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.1 Head Related Transfer Functions . . . . . . . . . . . . . . . . 15
2.5.1.1 Individualized vs Generic HRTFs . . . . . . . . . . . 16
2.5.2 Amplitude Panning . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Acoustics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6.1 Virtual Acoustics . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.1.1 Image Source Technique . . . . . . . . . . . . . . . 22
2.6.1.2 Beam Tracing Technique . . . . . . . . . . . . . . . 22
2.6.2 Reflections Early and Late . . . . . . . . . . . . . . . . . . . . 23
2.6.3 Reverberation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.4 Signal-to-Noise Ratio . . . . . . . . . . . . . . . . . . . . . . 24
2.7 Different Head-Tracking Technologies . . . . . . . . . . . . . . . . . 25
2.7.1 Virtual Acoustic Environment and 3D Sound Localization . . . 25
2.7.2 Acoustic-based Trackers . . . . . . . . . . . . . . . . . . . . . 26
2.7.3 Video-based Tracker . . . . . . . . . . . . . . . . . . . . . . . 27
2.7.4 Accelerometer/magnetometer-based tracker . . . . . . . . . . . 27
2.7.5 Inertial/magnetometer-based trackers . . . . . . . . . . . . . . 28
xiii
Contents
xiv
Contents
6 Conversational Tests 89
6.1 Test design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Test description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
xv
Contents
8 Conclusions 113
8.1 Outlook on Future Research . . . . . . . . . . . . . . . . . . . . . . . 114
Bibliography 129
xvi
List of Figures
xvii
List of Figures
xviii
List of Tables
2.1 Basic algorithms of virtual acoustic simulation . . . . . . . . . . . . . . 20
C.1 Descriptive Statistics for Conversational test for Mono Audio Quality . 124
C.2 Descriptive statistics for conversational test for stereo audio quality . . . 125
C.3 Descriptive statistics for conversational test for spatial audio quality . . 126
C.4 Descriptive statistics for conversational test for spatial-HT . . . . . . . 127
xix
Chapter 1
Introduction
Telephone has special significance in our life and undoubtedly considered as one of
the most important inventions of the modern-day world. Invention of Telephone [Bell,
1876] has brought a revolution in the way people communicate personally and/or
professionally. According to [ITU, 2010] there were five billion mobile cellular
subscribers globally including 940 million subscriptions to 3G services at the end of
year 2010. In the year 2009 1.19 billion subscribers were already observed for only
fixed-line telephony. Despite the very significant advances in the number of subscribers
for fixed, mobile and VoIP services, the audio quality of the calls for telephony and
teleconferencing has not improved.
Humans’ natural listening ability is three dimensional. Sounds are perceived by human
beings from all distances and directions with spaciousness. Also, three dimensional
listening provide humans an ability to locate the origin of auditory events accurately. The
technological requirements, to reproduce human ability of three dimensional hearing in
computational systems to generate the same sound at the listeners eardrums as a real
sound source would have produced, are also known [Kim et al., 2005a].
On the other hand, literature study suggests the advantageous use of 3D audio for
telephony/teleconferencing [Kilgore et al., 2003, Yankelovich et al., 2006, Ahrens et al.,
2010]. 3D audio helps to improve overall audio quality and to overcome problems such
as: “Cocktail Party Effect” and as reported in [Yankelovich et al., 2004]. However, to
the author’s knowledge, there is no potent product or service in the telecommunication
industry yet which best utilizes humans’ natural listening ability that is inherently 3
dimensional. Additionally, virtual acoustic environment which is part of most of the
3D audio simulations, has gained a lot of attention during years. Study of a virtual
acoustic environment is essential in a recent sense that 3 dimensional audio telephony
or teleconferencing can be further improved by properly selecting virtual acoustic
parameters.
The main focus of this thesis work is to design, develop, test and enhance 3
dimensional telephony and teleconferencing system. With three dimensional telephony
users should be able to make one-to-one and one-to-many telephone calls with the
enhanced user audio perception, understandability of the speech of concurrent talkers
and increased localization performance.
1
Chapter 1 Introduction
Another aim of this thesis work is to design and study a Virtual Acoustic Environment
(VAE) for three dimensional telephony and teleconferencing system. 3D telephony based
on VAE will help the participants of the conference call to spatially separate each other, to
locate concurrent talkers in virtual acoustic space and to understand speech with clarity.
Also, VAE provides the level of freedom to modify specifications such as the virtual
room size, conference table size and shape, and to place the call participants at a specific
distance and direction as per their own requirements and ease.
Another aim of this thesis work is to present a conceptual and holistic Quality of
Experience (QoE) model comprising all domains of a communication ecosystem, such
as technical aspects, business models, human behavior and contextual aspects; and to
evaluate QoE and VAE relationship through user studies and empirical analysis. Main
contributions in this PhD thesis work are listed in the next section.
1.1 Contributions
The main contributions in this thesis work include:
• The design of a three dimensional telephone system aiming for comfortable and
mobile usage at low costs (Chapter 3).
• The implementation of four head-tracking devices: (1) the Nintendo Wii remote
control using VRUI Virtual Reality toolkit by Kreylos Kreylos [2008]. (2) A
simple keyboard tracker that translates key strokes into translations and rotations.
(3) A tracking simulator to test and display the effects of changes in position and
orientation on sound rendering. (4) A PNI sensor corporation’s SpacePoint fusion
tracker (Chapter 3).
• User studies based on the virtual acoustic rooms and on placement of participants
of virtual teleconferencing system to analyze the impact of two different HRTFs
comprising of different channels and different frequency bands, two room
sizes, and different heights of the listener and talkers on the audio quality,
understandability and locatability of virtual participants (Chapter 4).
• User studies based on virtual acoustic rooms to examine the effects that simulated
acoustic room properties, virtual sitting arrangements, reflections of a conference
table, number of concurrent talkers and voice characteristics have on the perception
of speech quality, locatability and speech intelligibility in a 3D teleconferencing
system (Chapter 5).
2
1.2 Outline
• User studies to obtain subjective scores for three interlocutors based conversational
tests to compare mono, stereo, spatial with and without head-tracking audio
qualities ( Chapter 5).
1.2 Outline
This thesis is organized as follows. Chapter 2 serves as a background study. Design
and implementation of three dimensional telephony has been discussed in Chapter 3.
In Chapter 4, user studies based in virtual acoustic rooms to evaluate the placement of
participants have been presented. In Chapter 5 user studies based on different virtual
acoustic rooms have been discussed. Chapter 7 discusses quality of experience modeling
in communication ecosystem. Also, a case study of three dimensional telephony has
been presented. This study is concluded in Chapter 8.
3
Chapter 1 Introduction
4
Chapter 2
Background
This chapter provides a background for various portions of this thesis. In (Section 2.1)
a brief overview of the anatomy and the physiology of a human ear is presented.
In (Section 2.2) binaural hearing, sound localization, localization cues and the
phenomenon of cone of confusion is discussed. In (Section 2.3) an overview of 3D
sound technologies such as: binaural technology, ambisonics and wave field synthesis
are presented. In Section 2.4 a brief overview of 3D audio recording technologies have
been discussed. In (Section 2.5) a short introduction of 3D audio reproduction methods
such as: Head Related Transfer Functions (HRTFs) and Amplitude Panning (AP) have
been presented. In (Section 2.6) a short introduction of acoustics have been presented,
additionally, a brief introduction of virtual acoustics, virtual acoustic techniques,
reflections in virtual acoustic rooms, reverberations and 3D sound localization in
virtual acoustic environment have been discussed. In (Section 2.7) a short review of
head-tracking technologies have been presented. In (Section 2.8) speech, conversational
and audio quality assessment method have been briefly discussed and subjective and
objective testing procedures are also considered. In Section 2.9 background work have
been summarized.
Comprehensive reviews of these areas mentioned above are available in literature and
are also referred to as appropriate in the text.
2.1 Hearing: Ear the listening organ
The ear is a series of interlinked structures which provides humans the sense of hearing.
Ear collects sound waves as pressure changes in the air and sends these sound waves to
the brain. A depiction of a human ear is provided in (Fig. 2.1 adapted from [Karjalainen,
2011]). The outer part of the ear consists of the pinna and concha, which leads to the
eardrum via the ear canal. The outer ear acoustically filter incoming sound waves that
vibrates the eardrum.
The middle ear is air filled cavity that contains ossicular chain which consist of small
bones the malleus, incus and stapes, and also contains eardrum. As reported, “The prime
function of the middle ear is to transmit the vibrations of sound in air gathered at the
tympanic membrane to the fluid of the inner ear at the oval window.” [Irwin, 2006]
Inner ear contains the Cochlea and semicircular canals. “The inner ear is an intricately
shaped membranous tube suspended within a bony tube – the labyrinth ”. [Irwin, 2006].
5
Chapter 2 Background
The cochlea is a coiled tube, located in the temporal bone of the skull, which is divided
along its length by membranes into three fluid-filled compartments. The vibration of
the eardrum causes pressure waves to travel through the fluid of the cochlea, setting up
traveling waves in the lower basilar membrane, which is approximately 35 mm long.
The organ of Corti is on the basilar membrane that includes several rows of hairs, which
are in contact with tectorial membrane. When the basilar membrance vibrates there is a
difference in motion between the basilar and tectorial membrane. That causes the hairs
in the hair cell to bend. Bending of the hairs causes hair cells to send impulses to the
auditory nerve. To simplify, these impulses are understood as sound by the brain.
For further details on the anatomy and physiology of human hearing readers are
referred to [Pickles, 1988, Irwin, 2006].
2.2 Binaural Hearing
Binaural hearing is defined as the process required to use two ears to perceive the location
of sound sources [Wightman and Kistler, 1997]. The Duplex theory presented by [Strutt,
1907] was the first extensive analysis of the physics of the binaural perception of audio
and this theory is still considered as valid. As [Strutt, 1907] noted, two physical cues
dominate the perceived location of an incoming sound source (Fig. 2.2 and 2.3), sound
arrives slightly earlier in time at the ear which is physically closer to the source and
with somewhat greater intensity. This produces a Interaural Time Difference (ITD)
because sound takes longer time to reach at the ear which is farther from the source. An
Interaural Intensity Difference (IID) is also produced because of the shadowing effect
of the head which prevents some of the incoming energy to reach the ear which is
farther from the source [Cheng and Wakefield, 1999]. Binaural hearing also enables us
to selectively attend to an individual conversation when there are many people having
conversations at the same time which is termed as “Cocktail Party Effect” [Cherry,
1953, Crispien and Ehrenberg, 1995, Brungart et al., 2007]. Conclusively, binaural
6
2.2 Binaural Hearing
Sound localization is a complex human process. To calculate the position of sound source
normally humans take advantage of binaural hearing. According to [Blauert, 1997];
“Localization is the law or rule by which the location of an auditory event (e.g., its
direction or distance) is related to a specific attribute or attributes of a sound event, or
of another event that is in some way correlated with the auditory event”.
Sound source position is cued by differences in arrival time of sound and also by
differences in sound intensity at two ears. Following sections contain detailed description
about these localization cues.
Normally, human auditory system utilizes different acoustical cues to achieve static
sound localization. According to [Strutt, 1907, Begault, 1994, Blauert, 1997, Tsakostas
et al., 2007] and as explained in [Hirahara et al., 2011], the first acoustical cue is ITD
which is defined as the difference in arrival times of a sound’s wavefront at the left and
right ears and normally calculated from lower frequency components below 1.5 kHz,
and the second acoustical cue is Interaural Intensity Difference (IID) which is defined
as the amplitude difference generated between the right and left ears by a sound in the
free field and normally calculated from higher frequency components above 1.5 kHz. An
illustration of ITD and IID is presented (Fig. 2.2 and 2.3).
According to [Vorländer, 2008], “At the two ears, the sound signals arrive with
differences in time and amplitude. Sound from a source located at the side of the head
travels a longer time to the contralateral ear1 and suffers frequency-dependent damping
due to diffraction and absorption. Both effects are noticeable as differences between the
ear signals, as interaural time differences and interaural level differences”.
The third acoustical cue is the spectral cue which is defined as the spectral notches
appearing in the mid frequency range of around 4 to 14 kHz of the amplitude spectrum
of HRTFs. Three above discussed cues are known as static cues [Blauert, 1997].
1 Contralateral ear is defined as: the ear in the shadow zone of the head.
7
Chapter 2 Background
8
2.2 Binaural Hearing
When either a sound source or the listener moves all these static cues for sound
localization change. This change in static localization cue may be called dynamic cue.
With dynamic cues not only moving but static sound can also be localized, since the head
movements have the tendency to change physically static sound to perceptually dynamic
sound [Hirahara et al., 2011].
For further reading readers are referred to [Begault, 1994, Blauert, 1997, Hirahara
et al., 2011].
9
Chapter 2 Background
2.3.1 3D sound
The natural human hearing encounters different sounds everyday from different
directions and distances. Natural hearing is defined as how we hear sounds spatially
in everyday life with uncovered ears, our head moving and also in interaction with
other sensory input. It is also clear that human hearing is inherently three dimensional,
we experience not only the horizontal and vertical directions of the sound but also its
distance [Begault, 1994, Ericson and McKinley, 1997], and this is where we encounter
the term 3D sound (Fig. 2.5).
In literature, 3D sound’s equivalent designations include virtual acoustics, binaural
sound and spatial audio. Fundamentally, all of these designations of 3D sound refer to
techniques where the outer ears (the Pinna) are either directly implemented or modeled
as digital filters [Begault, 1994].
3D sound is defined as the simulation of a 3D sound field for any real environment with
the use of various techniques, “3D sound system uses processes that either complement
or replace spatial attributes that existed originally in association with a given sound
source” [Begault, 1994]. According to [Begault, 1994], 3D sound refers to a sound
which makes a listener discern significant spatial cues for a sound source such as
direction, distance and spaciousness. Therefore generating 3D sound means that one
can place sound anywhere—left or right, up or down, near or far—at one’s disposal in 3
dimensional space [Begault, 1994, Kim et al., 2004, Low and Babarit, 1998, Lee et al.,
1998].
10
2.3 Binaural Technology
2.3.2 Ambisonics
11
Chapter 2 Background
The Wave-field synthesis (WFS) concept was introduced by [Berkhout, 1988]. With this
technology a sound field can be generated with natural temporal and spatial properties
within a volume or area bounded by arrays of loudspeakers [De Vries and Boone, 1999].
For further details, readers are referred to [Berkhout, 1988, Berkhout et al., 1993,
Boone et al., 1995, De Vries and Boone, 1999, Brandenburg et al., 2004].
12
2.4 3D Audio recording
13
Chapter 2 Background
2.4.2 B-format
The B-format is a four channel recording standard that uses a sound field microphone.
B-format consists of four channels W, X, Y and Z [Gerzon, 1973]. The W channel
represents acoustic pressure at a point in space while other channels represent
components of the pressure gradient at left-right (X), front-back (Y) and up-down
(Z) signals stem from figure-of-eight microphones. Where the W channel is feed from
an omnidirectional microphone [Vorländer, 2008, Barrett and Berge, 2010]. Further,
the directional patterns of the four microphones in a B-format microphone is presented
in (Fig. 2.9 adapted from [Vilkamo, 2008]).
The B-format is a technique of Ambisonics which was developed based on the work
of [Gerzon, 1973, 1974]. Any source material such as synthesized sound or mono
recordings can be positioned or moved within a B-format sound field. For further
reading, please refer to [Malham and Myatt, 1995, Vilkamo, 2008].
2.5 3D Audio Reproduction
The term 3D audio reproduction is defined as recreating the acoustic signals at the ears
of the listener in such a way that these signals are equal to a recorded or synthesized
audio scene [Blauert, 1997]. Headphones are typically used for 3D audio reproduction
with HRTFs. However, the amplitude panning method has also been described in the
following sections to cover 3D audio reproduction using loudspeakers based solution.
14
2.5 3D Audio Reproduction
3D audio reproduction with HRTFs is of our interest in the context of this thesis.
Since HRTFs have been utilized to reproduced 3D audio signals on headphones for the
accomplishment of this thesis research work.
15
Chapter 2 Background
frequency domain is called HRTFs [Blauert, 1997]. Digital sounds can be processed by
these HRTFs to produce spatial audio signals that help the listener to believe that sound
emanates from the corresponding virtual source location [Park et al., 2005].
2.5.1.1 Individualized vs Generic HRTFs
HRTFs can be either individualized or generic. Since there are anthropometric
shape and size differences among the subjects [Wenzel et al., 1993]. Due to these
differences, literature study suggests to measure individual HRTFs experimentally for
each subject [Begault et al., 2001]. Individualized HRTFs are based on measurements
considering each persons unique physical head properties. Measurements of HRTFs
require special equipment, facilities and also expertise which are very difficult to
achieve [Hu et al., 2008].
Furthermore, literature study [Genuit, 1984, Shaw, 1997] suggests that exact HRTFs
are complicated to achieve rather their general behavior can be estimated from fairly
simple geometric models of torso, head and pinna. For understanding and estimating
HRTFs as suggested by [Genuit, 1984] and reported by [Algazi et al., 2001], a set of 27
anthropometric measurements, 17 for the head and torso (Fig. 2.11 adapted from Algazi
et al. [2001]) and 10 for the pinna (Fig. 2.10 adapted from Algazi et al. [2001]), are
shown. Further, like finger prints, human pinna are not identical and vary widely in
shape and size and consequently HRTFs also vary which makes it difficult to generalize
the spectral characteristics across large numbers of individuals [Rumsey, 2001].
On the other hand, generic HRTFs are a mathematical combination of multiple,
individualized HRTFs. For speech content, it does not matter whether individualized
or generic HRTFs are used [Begault et al., 2001]. Also, if an HRTF is omitted, the
externalization would be weak [Begault et al., 2001]; the sound would come from
16
17
2.5 3D Audio Reproduction
“inside-the-head”.
Non individualized HRTFs in literature [Fisher and Freedman, 1968, Weinrich, 1982,
Bronkhorst, 1995, Møller et al., 1996] have been cited as degrading localization accuracy,
decreasing externalization and increasing reversal errors. However, the mentioned
research reports are based on full-spectrum noise stimuli. Additionally, in [Bronkhorst,
1995] found no significant effect of using individualized HRTFs on reversals, but
on the other hand, research results reported by [Wenzel et al., 1993] indicated that
individualized HRTFs mitigated reversal confusions. According to [Møller et al., 1996]
non individualized HRTFs resulted in increased reversals, however, they reported no
effect on externalization in their experimental results, since their experiments were based
on speech stimuli.
18
2.6 Acoustics
moving or stationary sounds can be positioned in any direction in the sound field spanned
by the loudspeakers (Fig. 2.12 adapted from [Pulkki, 2001b]).VBAP has been introduced
by [Pulkki et al., 1996] and is utilized to position virtual sources in arbitrary 2-D or 3-D
loudspeaker setups where the same sound signal is applied to a number of loudspeakers
with appropriate non-zero amplitudes (Fig. 2.13 adapted from [Pulkki, 2001b]). VBAP
can be generalized for 3-D loudspeaker setups as a triplet-wise panning method [Pulkki,
1997]. A sound signal is then applied to one, two, or three loudspeakers simultaneously.
VBAP has certain advantages compared to earlier virtual source positioning methods
in arbitrary layouts. Previous methods either used all loudspeakers to produce virtual
sources, which results in some artifacts, or they used loudspeaker triplets with a non-
generalizable 2-D user interface [Pulkki and Karjalainen, 2001].
2.6 Acoustics
Acoustics is the science concerned with the production, control, transmission, reception,
and effects of sound. The term acoustics is derived from the Greek akoustos, meaning
“hearing” [Britannica, 2011]. As described in his book [Vorländer, 2008], the study of
generation, transmission, reception, cognition and evaluation of sound waves may be
called acoustics [Vorländer, 2008].
Acoustics, in two fold scope, first, have an involvement of sound which might be
generated by either mechanical radiations produced by natural causes or by human
activity. Second, the generated sound has a psychological influence on the sensation
of human hearing for areas which have particularly strong association with human
listening such as: speech, music, sound recording and reproduction, telephony, and
audiology [Pierce, 1989] etc. Within the scope of this thesis, acoustics relating to
19
Chapter 2 Background
the speech, recording and reproduction and telephony is of our interest. However, we
specifically present it from the perspective of virtual acoustics which is discussed in the
following subsections.
For the acoustic quality in small to medium-size rooms a German standard [Beuth Verlag,
2004] is a good guideline to follow. This standard contains a detailed description of
acoustic quality requirements for small to medium-sized rooms of up to 5000 m³ in
volume. It also specifies design guidelines to maintain a good acoustic quality for
spoken communication in such rooms considering three main components such as:
speaker, transmission and hearing/understanding. For further reading please refer
to [Beuth Verlag, 2004].
2.6.1 Virtual Acoustics
The term virtual acoustics is often used as a subset of the Virtual Reality (VR)
techniques [Burdea and Coiffet, 2003] or an integration of acoustics into VR [Vorländer,
2008] and normally it includes simulation of the source, the acoustic space and the
receiver. VR is an environment generated in the computer, which the user can operate
and interact with in real time [Vorländer, 2008]. Among the other, definitions include:
digitally processing sounds so that they appear to come from particular locations in three
dimensional space, with the goal of simulating the complex acoustic field experienced by
the listener within a natural environment. This concept is also known as auralization or
three dimensional sound [McGraw-Hill, Dictionary., 2011]. It is also worth mentioning
here that two keywords auralization and rendering are frequently used in the field of
virtual acoustics. Auralization in it’s broad spectrum can be defined as the processing
of acoustic effects, primary sound signals or means of sound reinforcement or sound
transmission into an audible result [Vorländer, 2008]. Rendering can be defined as
the process of generating the cues for the respective senses (3D image, 3D audio
etc) [Vorländer, 2008].
First virtual acoustic software was developed and used way back in in the year
1968 by [Krokstad et al., 1968]. Virtual acoustic Simulation is normally done
with techniques such as: image source method [Allen and Berkley, 1979, Borish,
1984], ray tracing [Krokstad et al., 1968] or beam tracing [Funkhouser et al., 1998].
These techniques are described further in the following subsections. Additionally,
basic algorithms of virtual acoustic simulation are presented in (Table 2.1 adapted
from [Vorländer, 2008]).
Further, relating to the virtual acoustics, literature study suggests the importance of
20
2.6 Acoustics
early reflections which positively enhance direct sound [Wallach et al., 1949, Haas,
1951]. Additionally, it has also been reported [Begault et al., 2001] that reverberation
is very important for improving subjective realism and externalization achieved in a
virtual spatial auditory displays. However, it is reported in literature [Mershon et al.,
1989, Zahorik, 2002, Shinn-Cunningham, 2001, Begault et al., 2001] that reverberation
in ordinary closed environments is considered as a reliable cue in identifying source
distance but it also modestly degrades directional perception [Santarelli, 2001] and
speech intelligibility [Houtgast, 1980, Payton et al., 1994]. Reverberation can cause
modest degradation in speech perception in multi-talker situations where one needs to
concentrate on a talker of choice while ignoring other concurrent talkers [Houtgast, 1980,
Shinn-Cunningham et al., 2001].
21
Chapter 2 Background
The image source technique computes specular reflection paths by considering virtual
sources generated by mirroring the location of the audio source over each polygonal
surface of the environment [Allen and Berkley, 1979, Borish, 1984] (Fig. 2.15 adapted
from [Funkhouser et al., 1998]). A study presented in [Funkhouser et al., 1998] suggests
robustness of image source method because it provides the guarantee that all specular
paths up to a given order or reverberation time will be found at the cost of modeling only
specular reflections and exponential growth of computational complexity.
Beam tracing technique classifies reflection paths from a source by recursively tracing
pyramidal beams (sets of rays) through the environment [Heckbert and Hanrahan, 1984].
22
2.6 Acoustics
A beam can be used to represent potential reflections, transmissions and edge diffract
ions (Fig. 2.16 adapted from [Kajastila et al., 2007]). Also, the beams are collected
into a forest of tree structures, one tree for each sound source [Kajastila et al., 2007]. A
detailed description of beam tracing is given in [Funkhouser et al., 1998].
2.6.2 Reflections Early and Late
The importance of early reflections can be traced in the work of [Haas, 1951]
and [Wallach et al., 1949], where they showed how early reflections are integrated with
the direct sound to make it seem to be effectively louder and they have clear time gaps
among them [Kajastila et al., 2007]. Reflections that reach the listener within 100 ms
after direct sound refers to early reflections and support speech intelligibility. Benefits
of early reflection on speech intelligibility have also been reported by [Bradley et al.,
2003], early reflections arriving within about 50 ms after the direct sound have the effect
of usefully increasing the level of direct sound or signal-to-noise (S/N) by 7 dB or more.
After early reflections, more dense reflections that arrive to a listener from all
directions and are so close in time that individual reflections can not be separated by
human listeners are termed late reverberations [Kajastila et al., 2007].
2.6.3 Reverberation
In a particular enclosed space when original sound is removed, the sound normally
does not stop quickly and it persist itself in the environment for some time and slowly
decays until it is absorbed by the air and walls [Valente et al., 2008]. This ability of the
sound to persist itself in the enclosed space/room is of great importance for particularly
checking rooms for performance of speech or music and is called reverberation. Also,
the ability of sound to persist itself in closed space is described by specific decay
time, time required for a level decrease of 60 dB, and is denoted as reverberation
time T [Vorländer, 2008]. Furthermore, according to [Begault, 1994], the presence of
reverberation improves externalization of virtual acoustic images, on the other hand it
23
Chapter 2 Background
can decrease localization accuracy under real and simulated conditions [Hodgson and
Nosal, 2002, Yang and Hodgson, 2006].
The reverberation algorithm is used for rendering engine [Uni-Verse, 2007], which
we have also adapted during the course of this thesis work. It was first introduced
by [Vaananen et al., 1997] and has been modified to the current position by [Kajastila
et al., 2007]. Reverberation time implemented in the rendering solution is approximately
the same for all rooms of different volumes and sizes (Fig. 2.17 courtesy [Kajastila et al.,
2007]). It is implemented by the algorithm used in the program which cuts off the
higher/maximum values if there are any, and only allows reverberation within a minimum
and maximum level, resulting in nearly the same reverberation time RT60 for all rooms
of different sizes. This is verified and compared between calculated and measured values
(Chapter 5 and Table 5.8).
2.6.4 Signal-to-Noise Ratio
Signal-to-noise ratio is a measure used in science to compare the level of desired signal
to the level of background noise [Hawkins and Yacullo, 1984, Bradley et al., 1999,
wikipedia, 2011]. Signal-to-noise ratio which is denoted as SNR or S/N, is the ratio of
signal power to the noise power. According to [Bradley et al., 1999, Yang and Bradley,
2009] reflected sound is one of the important factors along a S/N ratio which influence
speech intelligibility in closed environments. It has also been argued that increasing
reflecting sound increases both speech and noise level. Within the scope of this thesis
speech and noise both translates into concurrent talkers speech level, and therefore there
is no change in S/N. According to [Hodgson and Nosal, 2002] the critical thing is the
24
2.7 Different Head-Tracking Technologies
relative distance of speech and noise sources (concurrent talkers) from the listener. In
the case when the noise source is closer to the listener than the target speech/talker then
early reflections would increase S/N values and would be expected to improve speech
intelligibility. Also, in [Good and Gilkey, 1996] studied effect of noise on localization,
they found that localization accuracy decreases with the decrease in the SNR. They also
found that azimuthal judgments (left and right) were less influenced than up and down
or front and back.
2.7 Different Head-Tracking Technologies
Head-tracking and locating position and/or orientation of a user has gained immense
importance in virtual and augmented reality fields, also the increasing trend of using
head-tracking technologies has been observed in recent years in games [Wang et al.,
2006, Yim et al., 2008] and in robotics [Stiefelhagen et al., 2004]. Head tracking
decreases front-back reversals or in-the-head localizing errors [Begault et al., 2001,
Noble, 1987] which are very common during headphone playback and influence the
localization [Blauert, 1997, Harma et al., 2004]. There has been considerable work
done in the area of tracking location and orientation based on different technologies
which are commonly categorized as magnetic [Zhu and Zhou, 2004, Roetenberg et al.,
2007, Auer and Pinz, 1999] (despite great successes magnetic trackers have inherent
weaknesses e.g latency and jitter [Lenz et al., 1990]), optical [Auer and Pinz, 1999,
Chow, 2009], 3D cylinder head model [Ryu and Kim, 2007], accelerometers [Keir et al.,
2007], gyroscope [Luinge, 2002, Roll et al., 2008], acoustic [Tikander et al., 2003,
Karjalainen et al., 2004] and video based [de Ipin A. et al., 2002, Kourogi et al., 2001]
tracking.
It has been observed that it is still very difficult for single technology to give all
solutions related with positioning and orientation, hence many researchers have taken
an advantage of using multiple sensors (sensor fusion) to estimate user’s location
and orientation [Azuma et al., 1999b,a, Hallaway et al., 2004, Tenmoku et al., 2003,
Zeimpekis et al., 2002].
In the following sections some of the work relating to above mentioned head-tracking
technologies have been discussed.
25
Chapter 2 Background
Head-tracking with the use of sound intensity calculation for the orientation of the user
was implemented by [Laitinen, 2008]. Laitinen achieved head-tracking with the help of
six omnidirectional microphones and two fixed sound sources of different frequencies
in a horizontal plane to calculate azimuth, elevation and tilt. Sound sources were
utilized as anchor points since their positions were previously known. Also, in his
work, calculations of head-tracking was done in a Cartesian coordinates system. Laitinen
found the directional accuracy in the region of 3-10 degrees and also fluctuation of a few
degrees, which is far from perfect.
26
2.7 Different Head-Tracking Technologies
27
Chapter 2 Background
algorithm is able to track orientation, when it is combined with low pass filter for
accelerometer data. The algorithm has also been successfully used by them in real time
human-body-motion applications.
A solution for estimating position and orientation which is based on an extended Kalman
Filter for fusion of magnetic and inertial sensor was presented by [Schepers et al.,
2010]. Normally change of position and orientation can be obtained by integration of
acceleration and angular velocity signals from inertial sensors. In this study inertial
sensing is fused with magnetic measurements, where magnetic update is activated
only when any uncertainty in the position or orientation is involved and exceeds to a
predefined threshold (Fig. 2.21 adapted from [Schepers et al., 2010]).
28
2.7 Different Head-Tracking Technologies
29
Chapter 2 Background
30
2.8 Speech/Conversation Audio Quality
stereophonic audio conferences through subjective tests were tested. These tests were
conducted according to recommendation [ITU-T, 2001a]. Seven audio clips were
used in subjective intelligibility tests with 8 different placements ranging from 1 to
8 participants. The placements included front, rare and mixed hemisphere speaker
positioning where speakers were counting numbers from 1 to 9 and subjects had to
describe the total number of speakers, whether one speaker talks more than once and
the location of the speaker belonging to a certain number. Additionally, subjective
tests on perception were taken using pre-recorded audio scripts with conference-call-
like conversations including monophonic, panned stereophonic, flat stereophonic and
spatial audio recordings. Subject’s opinion scores were recorded using the MUSHRA
mean opinion score after evaluating each clip. Presented results show that the spatial
mixed hemisphere setup produced the most pleasing listening experience of a multi-
person conversation. In [Disch et al., 2004] discussed issues of test methodologies for
multi-channel sound quality assessment and presented test results for stereo-based and
mono-based representations. Additionally they described the potential behind spatial
audio coding. The listening test method was chosen was [ITU-T, 2001a]. In their
studies [Pulkki and Karjalainen, 2001, Pulkki, 2001a,c] regarding the localization of
amplitude-panned virtual audio sources, used both subjective and objective methods for
the evaluation of spatial sound. In the subjective tests, subjects were asked to adjust
the perceived direction of an amplitude-panned virtual source to match best with the
perceived direction of a virtual source.
For further details readers are referred to [Möller, 2000, Best et al., 2006, Raake et al.,
2007, Guéguin et al., 2008, Ahrens et al., 2010, Raake, 2011]. The methods to measure
speech and/or audio quality either subjectively and/or objectively are described in detail
in the following subsections.
31
Chapter 2 Background
32
2.8 Speech/Conversation Audio Quality
33
Chapter 2 Background
communication system without any need of human listeners. These objective measure
are based on mathematical models and are used to supplement subjective test results.
Objective measures are classified into two classes: intrusive and non-intrusive.
2.8.2.1 Intrusive Measures
Intrusive measures are also called input-to-output measure because they base their
measurement on computation of distortion between the original speech signal and the
degraded or distorted speech signal. Block depending on the domain transformation
used, intrusive object measure are classified into time, spectral and perceptual
domains [Quackenbush et al., 1988, Itakura, 1975, Itakura and Saito, Kitawaki et al.,
1988, Karjalainen, 1985]. Further examples of intrusive measures are Bark Spectral
Distortion measure (BSD) developed by [Wang et al., 1992], Modified and Enhanced
Modified Bark Spectral Distortion (MBSD and EMBSD) measure [Yang et al., 1998],
Perceptual Speech Quality Measurement (PSQM) [Beerends and Stemerdink, 1994] and
PSQM+ [Beerends et al., 1997], Measuring Normalizing Blocks (MNB) [Voran, 1999]
and/or Perceptual Analysis Measurement System (PAMS) [Rix and Hollier, 2000].
The International Telecommunication Union Standardization Sector (ITU-T) in
recommendation [ITU-T, 2001c] described standard method to objectively measure
perceived audio quality in the year 1998 and last time updated in the year 2001. In the
year 1999, KPN Research improved PSQM as PSQM99 which provided more accurate
correlations with subjective test results than classical PSQM and PSQM+. Meanwhile,
ITU-T recognized significant merits of PSQM99 and PAMS and combined the merits
of each one into a new measurement technique for intrusive objective speech quality
assessment called perceptual evaluation of speech quality (PESQ). ITU-T approved
the PESQ under recommendation [ITU-T, 2001b]. PESQ is currently accurately
estimate the listening speech quality performed by wireless, VoIP and fixed networks
and is in fact, standard method for automated speech or audio quality measurement
technique. It can be used in a wide range of measurements applications, such as, codec
development and error distortions, equipment selection, equipment optimization and
network monitoring [Rix et al., 2001]. More recently, POLQA (Perceptual Objective
Listening Quality Assessment) has also been selected by the ITU-T to form the new
voice quality testing standard P.863.
2.8.2.2 Non-intrusive Measures
Non-intrusive measures, which are also known as out-based or single ended measures,
use only the degraded signal and have no access to the original signal, for instance, ITU-T
recommendation P.563:”single ended method for objective speech quality assessment in
narrow-band telephony applications” [ITU-T, 2004]. P.563 is used for the voice quality
measurements in narrow band telephony applications, such as, live network monitoring
and live network end-to-end testing using digital or analogue connection to the network
and live network end-to-end testing unknown speech sources at the far end side. ITU-T
emphasis that P.563 and PESQ can not be used to replace subjective testing but can be
34
2.9 Summary
applied where listening only tests would be too expensive or not applicable. It is also
interesting to note that the accuracy of current p.563 model will be always lower than
that of the PESQ [Kim et al., 2006].
2.9 Summary
This chapter has been written to provide the readers a background of the terms and
technologies used in this thesis as far as possible. This chapter includes background
for various portions of this thesis work such as human listening organ, binaural hearing,
sound localization, helpful cues for sound localization, binaural technology, three
dimensional sound, three dimensional audio recording and reproduction, virtual acoustic
environment, a short head-tracking technologies review, and also speech and audio
testing procedures. Also, readers are referred to comprehensive reviews available in
literature of the mentioned terms and technologies where ever it was appropriate in the
text.
35
Chapter 2 Background
36
Chapter 3
Design and Implementation of 3D
Telephony
3.1 Introduction
The technical requirements that are needed to implement a headphone-delivered 3D
sound are well known [Begault et al., 2001]. In order to improve sound localization
performance, three factors need to be considered. First, individualized head-related
transfer functions (HRTFs), to describe how the acoustic waves propagate through each
listener’s head [Tan and Gan, 1998, Kim et al., 2004]. Second, sound processing
(auralization) to simulate reverberations and reflections within a virtual surrounding
individually. Third, head tracking systems to follow the movements of the speaker’s
and listener’s heads.
These three factors with their basic concepts have been discussed in detail in Chapter 2.
However, to summarize the concepts, following points are presented:
1. HRTFs are used to produce spatial audio signals that helps the listener to believe
that sound emanates from the corresponding virtual source location [Park et al.,
2005] (HRTFs have already been discussed in detail in section 2.5.1).
37
Chapter 3 Design and Implementation of 3D Telephony
Based on the three essential requirements described above a 3D telephone system has
been designed. In the following sections the design background and the description has
been discussed.
3.2 Design Background
3.2.1 Classic VoIP Teleconferencing
A classic voice-over-IP teleconferencing system consists of three main components as
depicted in (Fig. 3.1 to 3.3 based on the concepts presented in [Sinnreich and Johnston,
2001] courtesy Haun, M.). The first component is the audio input produced by two or
more call participants. These inputs are mixed by the second component, the audio mixer
and played to the call participants by an audio output component. The components can
be arranged in several ways. While the audio in- and output components usually reside
inside a VoIP client or phone on each endpoint of the VoIP connection, audio stream
mixing can occur at different locations within the network [Sinnreich and Johnston,
2001].
The most simple form of VoIP teleconferencing is to employ a centralized conference
bridge to which each endpoint connects. This bridge can then handle audio mixing in a
centralized way by additionally providing audio transcoding between different endpoints
to satisfy different bandwidth restrictions.
The second possibility places the mixer in one of the conference endpoints. This
setup limits the number of call participants by means of the available bandwidth and
computational power of the client handling the audio mixing [Sinnreich and Johnston,
2001].
The third possibility employs a network that establishes a full mesh of connections
between each call participant. Each endpoint mixes the incoming audio streams.
This setup minimizes media latency but complicates media synchronization. Finally,
a large teleconference can be realized by using multi-cast conference addresses to
enable the participation of millions of users. Although this provides the most powerful
38
3.2 Design Background
VoIP audio
Phone
VoIP Phone
+Mixer
VoIP PSTN PSTN
Phone Gateway Phone
To obtain further details about the concepts discussed above readers are referred
to [Sinnreich and Johnston, 2001].
39
Chapter 3 Design and Implementation of 3D Telephony
40
3.3 Design of the 3D telephone system
on the conference bridge. Also, due to the distributed nature of this setup and caused
by the latency between head movements and changes in the acoustic rendering, the
naturalness of spatial audio impression is reduced.
The second setup corresponds to the meshed setup discussed previously (Fig. 3.5). It
has the complete rendering engines in the users’ endpoints. All incoming audio streams
are sent through the rendering engine before being returned to the play out.
The meshed setup overcomes the problem of computational burden on a centralized
conference server by deploying separate spatial audio rendering systems for each of the
call participants. In addition, it allows for an individual representation and full control of
the virtual environment for each call participant. This might be beneficial if the virtual
environment should be mapped to the participants’ real environments. A drawback is due
to the fact that multiple audio and head tracking streams must be distributed and that the
virtual environments must be synchronized to a certain extent. This increases the burden
on the network. Also, this setup needs to simulate the virtual acoustic environment
multiple times and thus increases the computational demands. However, this is not really
a drawback because the end devices might have enough unused computational resources.
Then, scalability is achieved because spatial audio teleconferences are not limited by a
central conference bridge with limited computational resources, anymore.
3.3 Design of the 3D telephone system
Based on the three essential requirements described in (Section 3.1), the background
knowledge of the classic VoIP teleconferencing and spatial audio teleconferencing
requirements (see Section 3.2), we designed a 3D telephone system (Fig. 3.6) aiming for
41
Chapter 3 Design and Implementation of 3D Telephony
comfortable and mobile usage at low costs while supporting spatial audio. The design
extends a VoIP based telephone by low-delay audio codecs, 3D sound renderers, and head
phones extended by head tracking sensors. More precisely, the design of 3D telephone
system consists of:
1. Stereo headsets extended by a head tracking unit, which follows the movement
of ears and mouth. Optionally, sensors can be used to determine the size of the
current room and the position of the head in the room. Usually, each participant of
the conference call requires one 3D sound capable headset.
2. The microphones of the headsets are coupled to low delay audio encoders. The
audio content is transmitted mono only enriched by the sensor data of the 3D sound
headset. The sensor data mainly include the relative position and orientation of the
mouth.
4. As each participant might sit in their own room having different dimensions—
the orientations and movements of the participants may also vary—the
teleconferencing system must decide where in a virtual room it would place
the participants. In (Fig. 3.6), this is displayed in the upper middle box. We also
conducted listening-only tests (Chapter 4) to determine suitable placements of the
participants.
5. The 3D telephones must be connected via a low latency network because of the
participant’s requirements on interactivity. Even more stringent are requirements
on change of the filter parameters after head movements. Human can tolerate a
delay between movements and spatial adaptation impression up to 70ms, before
the 3D sound experience becomes unrealistic [Brungart et al., 2006].
42
Callees Room
Headphone
Headphone Virtual rooms to simulate Microphone
head Microphone
tracking
acoustic wave propagation
filter environement
audio parameters and movements audio
packets Transport Transport packets Transport
protocols protocols protocols
Internet
43
3.4 Design Description
Chapter 3 Design and Implementation of 3D Telephony
achieved by a separate and direct connection between each user and the virtual acoustic
server.
3.5 Implementation
We implemented the system based on the open-source VoIP soft-phone Ekiga [Ekiga,
2010] (details are provided in the next section), which we enhanced by a plug-in to
control the virtual environment. As a virtual acoustic server or rendering engine we
utilized the Uni-Verse acoustic simulation framework [Uni-Verse, 2007] (details are
provided in the following sub sections). A custom build software was employed as a
conference bridge, a conference server and mixer. The current prototype system can be
installed on any desktop computer or laptop running an Ubuntu/Debian based operating
system. In the following, we describe details of the implementation.
44
3.5 Implementation
As a VoIP client we used the open-source soft-phone Ekiga [Ekiga, 2010]. We extended
it with the Bluetooth SBC [Hoene and Hyder, 2010] codec to support stereo and full-
band audio. To connect the VoIP client to the virtual acoustic server or renderer for
spatial audio rendering , we enhanced Ekiga by a plug-in architecture and a 3DTel plug-
in (Fig. 3.9). The 3DTel plug-in or Ekiga plug-in consists of the five main components
Graphical User Interface (GUI), Database Backend, Virtual Reality Backend,
Tracking Unit and Rendering Frontend as shown in (Fig. 3.10 courtesy Haun, M.).
The graphical user interface allows the user to control different parameters related to
the choice of the tracking unit and the virtual reality back-end. The tracking unit interface
allows the connection of arbitrary tracking devices to modify the users’ position and
orientation within the virtual world in real-time. The virtual reality back-end interface
makes it possible to specify various sources of virtual world representations such as
static files Vector Markup Language (VML) [Mathews, 1998], or dynamic streams from a
content creation tool or a game server. This back-end interface provides additional means
to analyze the provided environment in order to dynamically generate seating-plans
according to which all call participants can be placed upon call initialization. Finally, the
core of the system is given by the rendering front-end interface. This interface specifies
the connection to the rendering engine to which all audio, position and orientation data as
well as the virtual environment are sent and where spatial audio is rendered individually
for each avatar from the audio input of all the other call participants. Afterwords, this
interface receives the rendered audio streams and either plays them on the user’s head-
phones or sends them across a network using RTP.
45
Chapter 3 Design and Implementation of 3D Telephony
46
3.5 Implementation
47
Chapter 3 Design and Implementation of 3D Telephony
Conference
Room
Talkers
Listener
Phantom
Sources
Figure 3.11: A virtual conference room with a listener and two virtual talkers
computes the BSP tree by splitting the three dimensional object space into half spaces
using a two dimensional hyperplane [Schröder and Lentz, 2006]. To achieve a well-
balanced tree, the hyperplanes are chosen according to a heuristic that tries to minimize
the number of surfaces on either of the resulting sides of the hyperplane.
The beam tracing method classifies reflection paths from a source to the listener by a
set of pyramidal beams, where each beam represents a frustum consisting of an infinite
number of rays. When intersecting polygons of the virtual environment are detected,
the beam is clipped to remove the shadow region behind the intersecting polygon and
reflected by generating a virtual sound source by mirroring the original source at the
intersecting polygon.
To build the beam tree, the BSP graph is traversed in a depth-first manner starting
at the cell containing a source and recursively visiting adjacent cells. As the algorithm
traverses a cell boundary into a new cell, the current beam is clipped to include only
the space passing through the transparent polygon boundary and phantom sources are
created at the solid boundaries of the polygon.
From the resulting beam tree structure, reverberation paths can be derived by simply
traversing the tree from the listener to all sound sources and by collecting surface
absorption coefficients and distance information along the way. (Fig. 3.11) shows a
virtual room rendered by UVAS with a listener, two sound sources and reverberation
paths.
48
3.5 Implementation
49
Chapter 3 Design and Implementation of 3D Telephony
50
3.6 Summary
All data received from the tracking device is computed to relative or absolute
movements in Cartesian coordinate system. Translations are represented by either
Euclidean vector transformation in case of absolute values or affine transformations in
case of relative values. Given an position p0 = (x0 , y0 , z0 ), a translation pt = (xt , yt , zt )
and a scaling factor s, the new position p1 = (x1 , y1 , z1 ) can be obtained by calculating
p1 = pt ? s in case of absolute values or by p1 = p0 + (pt ? s) in case of relative values.
Rotations are represented by unit quaternions. Given an initial position p0 = (x0 , y0 , z0 )
and a rotation or = (xr , yr , zr , wr ), the new position p1 = (x1 , y1 , z1 ) can be obtained by
calculating p1 = or p0 or .
The current implementation does not feature any collision detection mechanisms yet.
If a position is tracked that falls outside the virtual environment, the audio signal is
lost until the virtual avatar reenters the environment. An engine to control and limit
movements within the virtual environment is subject of future enhancements.
3.6 Summary
In this chapter design and description of 3D telephony system is presented. Furthermore,
two implementations of 3D telephony is discussed. Additionally, main components such
as: VoIP phone client Ekiga and its 3D telephony plug-in, virtual acoustic server and
renderer with its basic components from our implementation perspective is presented.
Head-tracking support of four devices for our current design and implementation has
also been discussed.
51
Chapter 3 Design and Implementation of 3D Telephony
52
Chapter 4
Experiments on the Placement of
Teleconference Participants
In order to optimize and enhance a 3D audio supported telephony and teleconferencing
system up to an acceptable level for a user/customer, user experiments were conducted
to study the virtual placement of teleconference participants in particular. In this user
study, focus was based on sound quality, understandability and locatability of virtual
participants. Additionally, the occurrence of any front/back or elevation localizing errors
were also studied. Front/back reversals and elevation localization errors1 are commonly
seen in 3D audio systems when non-individualized HRTFs are used [Wenzel et al., 1993]
(refer to Chapter 2).
In addition, we investigated the trade-off between sound source direction perception
and distance perception. According to [Shinn-Cunningham, 2000], reverberation
degrades perception of the sound source direction, but enhances distance perception.
Also, in this study azimuth errors (deviations along horizontal plane), elevation errors
(deviations along vertical plane) and reversal error (front-back or back-front “confusion”)
which are very common with 3D sound reproduction over headphones [Begault, 1994]
were evaluated separately.
This chapter is organized as follows: To figure out how to position the participants
in a conference call, seven formal listening only tests had been conducted (Section 4.1).
After presenting experimental setup and results, the work presented in this chapter has
been summarized. The major portions of user experiments and results presented in this
chapter were already published ([Hyder et al., 2009, 2010b]).
4.1 Placement of Participants
In order to provide a better teleconferencing solution, it was very important for us to
study the participants positioning arrangement within virtual acoustic environment so
that they could not only locate each other properly in three dimensional space but their
understandability of the speech may not be decreased as well. Further, the speech quality
should not be impaired by reduced loudness, reverberations and by echoes. Thus, we
1 Localization error refers to the deviation of the reported position of a sound stimulus from measured or
synthesized target location.
53
Chapter 4 Experiments on the Placement of Teleconference Participants
We selected four sets of simulation parameters and used them to judge the seven
different placements of participants in the virtual room (in total 22 combinations). We
tested by changing one Uni-Verse parameter at a time in every setup and kept the other
parameters the same to see the effect of every single changing parameter to study the
impact of virtual placement of participants on sound quality, understandability and
locatability. We have used two different HRTFs, two different room sizes, different
heights of the listener and talker and kept the headsize constant. The following
parameters can be chosen for the acoustic simulation (Table 4.1).
Room dimensions: In our test experiments, we used two rooms. A Big Room
having dimensions (HxWxL=20 x 20 x 40 m³) and a Small Room having dimensions
(HxWxL=10 x 10 x 20 m³).
HRTF: We have used two HRTFs in these tests, HRTF-1 and HRTF-2. HRTF-1 has 5
reverberations for 5 frequency bands and HRTF-2 has 10 reverberations for 10 frequency
bands.
Head size: Headsize refers to internal distance between two ears in centimeters. We
kept the head size to its default value which is 0.17 in all the setups, because we did not
notice any difference by changing its value ranging between 0.1 to 0.3. (Head-Size is a
Uni-Verse UVSR parameter scalable from 0.1 to 0.3).
54
4.1 Placement of Participants
Placement: Seven different placements of the talkers and listeners were studied. We
name these placements Talkers in the Corners, Listener in the Corners, Horizontal
Placement, Frontal Placement-1, Frontal Placement-2, Surround Placement-1 and
Surround Placement-2. They are described further in the following sections.
Height: The placement of listeners and talkers in terms of height in the virtual room
is summarized in (Table 4.2). We have used the same height parameters for Default,
HRTF-2 and Small Room which we call Height-A and for Talker standing we have used
Height-B.
4.1.1 Sample Design
The samples were processed by the open-source 3D audio rendering engine Uni-
Verse [Kajastila et al., 2007].
The virtual rooms were based on the sample UVAS file “testscene_no_doors.vml”.
The walls of the rooms had the typical acoustic properties of concrete. Based on
the results of the acoustic simulation, a sound renderer auralizes the direct sound
and early reflections paths calculated by the room acoustic simulation module. The
acoustic simulator transmits the listener, source and image source information including
position, orientation, visibility and the URL of the sound source to the sound renderer.
Then the sound renderer applies a minimum phase HRTF on the sound source. A
detailed explanation of the used minimum-phase HRTF can be found in the paper
by [Savioja et al., 1999]. The reverberation algorithm used in the implemented system
was introduced by [Vaananen et al., 1997], which has been modified by [Kajastila et al.,
2007]. Because the reverberation time (RT) is frequency dependent, the sound renderer
uses 10 individual reverberators for 10 frequency bands and separate RTs for different
frequency bands.
Further parameters used for the sample design, such as positions of listeners and sound
sources are given in the following test descriptions.
4.1.2 User Experiments
User experiments with 32 normal-hearing subjects (29 male, 3 female) were conducted
to find out the sound quality, understandability and locatability of the virtual talkers in
the implemented system.
The listening only tests were conducted based on the recommendations [ITU-T, 1996]
as far as possible along with the use of an additional in-house tailored test method which
encompasses 3D audio component of the subjective study. Test method encompassing
3D audio component contains tasks which are described at the end of this section. The
reason for using an in-house tailored test method, along with a recommendation P-
800, which describes methods for subjective determination of telephone transmission
quality, is the lack of standards available particularly for testing of 3D audio supported
transmission quality.
The reason for developing an in-house tailored test method was due the fact that
there is no standard available for a testing of 3D audio supported transmissions quality.
55
Chapter 4 Experiments on the Placement of Teleconference Participants
Figure 4.1: Acoustic simulations: One listener and two sound sources
(Acoustic simulations having one listener and two sound sources. The white lines show
the direct beam between sound source and listeners. The yellow lines are due to phantom
sound sources (plotted as red points). The green lines are reflection of real sound sources)
56
4.1 Placement of Participants
4.1.3 Test 1: Talker in the corners and Test 2: Listener in the corners
In the test talker in the corners and Listener in the corners, we used a Big Room having
dimensions (HxWxL=20 x 20 x 40 m³).
In the test talker in the corners, listener was positioned at the center of the room at
ground level and talkers were positioned in all eight corners of the room. The listener
was facing the wall appointed by the corners 5, 6, 7 and 8. We wanted to study: (1)
Whether subjects could locate the sound sources correctly? (2) Whether subjects could
identify the orientation of the sound? (3) What judgment subjects had regarding the
quality of the speech? The layout for the virtual acoustic room can be seen in (Fig. 4.2).
In the test Listener in the corners talker position was fixed at the center of the room
at ground level while the listener positions were changed among any one position out of
the all eight corners of the room at a time and the orientation of listener remains facing
the wall depicted by the corners 5, 6, 7 and 8. The layout for the room can be seen in
(Fig. 4.2).
57
Chapter 4 Experiments on the Placement of Teleconference Participants
4.1.3.1 Results
These tests were preliminary subjective study conducted to make sure that
listener/subject attain proper orientation within the simulated virtual acoustic
environment. Results indicated that it was very difficult for subjects to correctly
identify the virtual talker positions in this test and very frequent elevation errors [Wenzel
et al., 1993] were seen. There were no significant results achieved relating with
subjects correctly identified talker locations in these two tests. However, relating with
audio quality, talker in the corners test achieved MOS-LQS value (95% Confidence
Interval) 3.85±0.76 and listener in the corners test achieved MOS-LQS value (95% CI)
3.68±0.79. Moreover, subjects attained proper orientation and no orientation errors were
found throughout these test. Thus, we achieved our primary target of having proper
orientation within our developed virtual acoustic environment.
Within these tests, it could be concluded that following might be the possible factors
that caused difficulty for subjects in properly identifying a talker location:
• Non-individualized HRTFs
58
4.1 Placement of Participants
we do not attain this kind of talker as well as listener positions in everyday life. However,
it was important for us to check over our developed virtual acoustic environment with
some added difficulties. Since, we wanted to test our developed environment thoroughly,
and also wanted to start the second phase of testing with a perfect orientation.
Surround Placement-1
Surround Placement-1
Horizontal Placement
Frontal Placement-1
Frontal Placement-2
Test
Setups
59
Chapter 4 Experiments on the Placement of Teleconference Participants
60
4.1 Placement of Participants
percent success rate. It is interesting to report that Small Room accumulated highest
MOS score (Table 4.3) among all setups, yet produced the lowest successful scores.
MOS-LQS value (95% CI) was 4.25±0.63 (Table 4.3)
Speech samples were processed on these talker positions, however a listener position
was fixed as shown in the layout (Figure 4.5). During the test, subjects were presented
with the samples in randomized order containing a one-talker situation; this means only
one talker speech was processed at a time at one of the various positions as described
above. Subjects were then asked to identify the position/location of a talker. Within this
virtual acoustic room, a listener and a talker were having the same height, details of a
listener and a talker height are presented in (Table 4.3).
61
Chapter 4 Experiments on the Placement of Teleconference Participants
4.1.5.1 Results
Default and HRTF2 were the highest successful result accumulators with a score of 89%
each. Small Room produced lowest results by scoring 64%. Talker Standing achieved
score of 69%. The main highlights of the Frontal placement-1 test remained Default and
HRTF2. These results showed an effectiveness of Frontal placement-1 in combination
with Default and HRTF2.
MOS-LQS value (95% CI) was 4.07±0.68 (Table 4.3)
62
4.1 Placement of Participants
the literature it was also found out that the time required recognizing a person is shorter
in such cases.
4.1.6.1 Results
Default produced highest successful results by scoring 73%. Talker standing
accumulated second highest results by scoring 66%. Small Room had the lowest
successful results by scoring 38%. HRTF2 produced only 62% successful score. It
can be safely concluded that increase in the number of simultaneous talkers decreases
localizing performance. Default in earlier test (Frontal Placement-1) in which one-talker
63
Chapter 4 Experiments on the Placement of Teleconference Participants
situation was presented produced nearly 90% results but with two-talker situation we
observed around 16% decrease in scores.
MOS-LQS value (95% CI) was 3.93±0.68 (Table 4.3)
4.1.7.1 Results
64
4.1 Placement of Participants
65
Chapter 4 Experiments on the Placement of Teleconference Participants
4.1.8.1 Results
In Surround Placement-2, Default produced a highest successful result by scoring
46%. HRTF-2 produced second highest result by scoring 41%. Small Room yielded
lowest result by scoring 30%. Two-talker situation caused overall reduction in scores.
Additionally, in Surround Placement-2 subjects faced frequent front/back confusions.
Therefore, it can be safely concluded that for teleconferencing solutions, a listener at the
center of the talker positions is not a suitable position at all.
MOS-LQS value (95% CI) was 4.12±0.61 (Table 4.3)
4.2 Summary
The quality of conference calls can be significantly enhanced if the telephones do
not reproduce the speech in mono but instead use stereo headphone and spatial audio
rendering. Then, one can identify participants by locating them and one can listen to one
specific talker even if multiple talkers speak at the same time.
Listening-only tests using normal stereo headphones have shown that listeners can
locate the origin of sounds and the position of talkers quite well. At the same
time, the speech quality is only slightly reduced by adding reverberations echoes, and
HRTF related filters. No subject complained about the lack of an understandability to
understand the talkers or of any extra efforts required on user behalf to concentrate on
talkers during user tests.
The test results revealed that the performance of speech locating test is good when
speech is placed at the same height with the listener and poor when it is vertically placed
down or up in the direction of the listener. In listening-only tests subjects seemed quiet
sure about the speech orientation. The speech quality remained very good throughout
all the tests and there were no impairments even with two echoes and reverberations.
66
4.2 Summary
Same is true with two sound sources at a time, each of the sound source could be clearly
heard and distinguished during the tests. Summary of MOS scores confirms speech
quality (Table 4.3).
Small Room accumulated highest MOS score (Table 4.3) for Horizontal Placement
test but could not produced better results than Default and HRTF2. It could be concluded
safely that with smaller rooms we can produce better speech quality but not the better
localization scores.
Front/back reversals and elevation localization errors were commonly seen throughout
listening only tests. Possible reasons for front/back reversals could be the use of non-
individualized HRTFs and also the fact that our tests were done without any head-
tracking system installed.
The Default setup employing an HRTF consisting of five reverberations for five
frequency bands produced better results among HRTF-2 consisting of 10 reverberations
for 10 frequency bands, Small Room and Talker Standing setups.
67
Chapter 4 Experiments on the Placement of Teleconference Participants
68
Chapter 5
Assessing Virtual Teleconference
Rooms
5.1 Introduction
3D audio simulations of 3D telephony and teleconferencing system are based on virtual
acoustic environment. Properly choosing virtual acoustic environment is essential to
further improve this system. This chapter describes a series of experiments and examines
the effects that simulated virtual acoustic room properties, virtual sitting arrangements,
reflections of a conference table, number of concurrent talkers and voice characteristics
have on the perception of speech quality, locatability and speech intelligibility in a 3D
teleconferencing system. Particularly, the tests conducted were designed to answer the
following questions: To what extent are multiple talker localization performance and
subjective speech quality ratings influenced by the size of the virtual conference room?
What are the results when a conference table is simulated and what is the overall impact
of changing the conference table size? What results are achieved when the number of
simultaneous talker increases? Do different voice types have an influence on the easiness
of locating simultaneous talkers? What are the results when there is an increase in talker
position density?
Also, to author’s knowledge, there is hardly any literature available regarding
simulating a conference table in the virtual acoustic conferencing rooms to study its
impact on overall speech intelligibility. Additionally, it has been reported by [Jeub
et al., 2009] that reflections of the conference table can cause decrease in the speech
intelligibility. We experimented with different properties of conference table to study its
impact on speech intelligibility in particular.
Also, we know that changes in the room properties (change in volume) and changes
in source-to-receiver configuration (distance or orientation) causes change in direct-to-
reverberant ratios at receiver which helps in sound source distance perception [Vesa,
2009]. We experimented with different room properties and different source-to-receiver
configuration to study the near and far perception of talkers in these placement tests.
Additionally, the room size is an other important factor which need to be studied in
order to study which room size allows listeners of the teleconference to easily understand
the multiple talkers and locate them in space.
69
Chapter 5 Assessing Virtual Teleconference Rooms
The remainder of this chapter is structured as follows: (Section 5.2) lists related and
ongoing research related to 3D audio, spatial audio teleconferencing systems and the
quality assessment of such systems. (Section 5.3) discusses the methodology, setup and
performance of the listening-only tests presented in this paper by listing the utilized
testing scenarios, procedures and terms. Afterwords, the results of these tests will be
presented in detail in (Section 5.4). Finally, the paper is concluded with a summary of
the obtained results in (Section 5.5).
5.2 Related Work
Teleconferences suffer from many well known problems. For example, the listener
performance in multi-talker scenarios decreases in terms of understanding speech,
locating talkers and concentrating on a talker of choice as there is an increase in auditory
scene complexity [Brungart et al., 2007]. If binaural or even 3D audio is incorporated in
teleconferencing systems, the quality of teleconferences can be increased [Yankelovich
et al., 2006, Begault, 1994].
Multiple 3D audio teleconference systems have been implemented. In [Hughes, 2008],
Hughes presented a 3D audio teleconferencing system called Senate. In [Reynolds
et al., 2009] a distribution model for headphone based spatialized audio conferences
was presented. In [Herre et al., 2010] described a combination of Spatial Audio
Object Coding and Directional Audio Coding technologies to be used for interactive
teleconferencing. In [Ahrens et al., 2008] the Sound Renderer Framework that can be
used to render 3D audio for teleconferences was presented.
Spatial audio teleconferencing systems under development are far from mass market
usage as their quality of experience does not fulfill all user demands yet. Consequently, it
is very important to measure the quality of existing systems to understand how to improve
them. In their work [Kilgore and Chignell, 2006, Kilgore, 2009], experimental research
to determine whether the combination of spatialization and simple visual representation
of a voice’s location helps recognizing completely unfamiliar voices was presented.
The test results evidently show that localization easiness benefits when spatial audio
is coupled to a visual interface only with a large number of voices, as in this case with
eight, but not with four voices. In [Vesterinen, 2006] performance differences between
3D, monophonic and stereophonic audio conferences through subjective tests in her work
“Audio Conferencing Enhancements” was tested. Results presented show that spatially
mixed hemispherical audio produced the most pleasing listening experience of a multi-
person conversation.
The impact of spatialized audio and video on user-experience in multi-way video
conferences using a proprietary software was explored in [Inkpen et al., 2010]. Their
study didn’t reveal any significant differences between mono audio and spatialized
audio. The results of other studies [Kilgore et al., 2003, Yankelovich et al., 2006, Hyder
et al., 2010b] however showed positive influence of spatial audio. Because of these
contradicting research results, we see it as an important task to improve spatial audio
conferencing as different spatial teleconferencing systems might perform significantly
70
5.3 User Experiments
different.
In our research review it was also found that in auditory selective attention listening
tasks, dichotic [Hillyard et al., 1973], Interaural Level Difference (ILD) and/or Interaural
Time Difference (ITD) [Darwin and Hukin, 1999, Shinn-Cunningham and Ihlefeld,
2004] presentations are employed. In [Spring, 2007] HRTFs presentation have been
utilized with four simultaneous talkers stimuli presented to listener for selective attention
tasks. Subjects were asked to concentrate on one story told by one of the talkers
while ignoring the other three stories. Average correct responses reported were 58%
ranging from 18% to 84%. In contrast, our work includes presentation of stimuli to
subjects containing four simultaneous talkers employed at different locations in the
virtual acoustic environment. Questions included identifying the mixed gender talkers,
to understand speech and to locate every concurrent talker in virtual space.
5.3 User Experiments
In order to enhance our 3D Telephony system we conducted formal listening-only tests
to measure localization performance, localization easiness, spatial and overall speech
quality of different virtual teleconferencing scenarios.
To measure localization performance each test participant was presented a map with
possible talker locations. Then, the actual location of each talker was compared to the
location selected by the test participant. Localization easiness described the subjectively
perceived effort required by test participants to localize a talker, while spatial quality
described how well the participant could perceive that talkers were spatially separated,
71
Chapter 5 Assessing Virtual Teleconference Rooms
and overall speech quality referred to the perceived speech quality as compared to a
real life conversation. Localization easiness, spatial and overall speech quality were
measured using discrete Mean Opinion Score - Listening Quality Scale Wide-band
(MOS-LQSW) scores with the values 1 (bad), 2 (poor), 3 (fair), 4 (good) and 5 (excellent).
The MOS-LQSW values were named MOS-LQSW LE for localization easiness, MOS-
LQSW SQ for spatial quality and MOS-LQSW OQ for overall speech quality.
During the tests the five parameters voice type, number of concurrent talkers, table
size, talker position density and room size were modified. The influence of each
parameter was evaluated by comparing a specially designed test setup consisting of a
series of two tests to a given reference test.
User experiments were conducted with 31 paid subjects, 13 of them female and 18
of them male, according to [ITU-T, 1996]. All test participants were aged between 20
and 45 years with an average age of 27 years. 8 out of 31 participants showed earlier
experiences with listening-only tests, and all subjects indicated a good to professional
level of computer proficiency. The average time taken by the subjects to complete
all tasks given in the tests was 62 minutes. Each subject participated in 11 different
tests contained in 5 different setups and one reference test, thereby assessing quality
and localization information on 71 audio samples. Thus, 2201 audio samples were
collectively assessed during the tests all together.
All audio samples consisted of anechoic speech samples taken from [ITU-T, 1998].
They were prerecorded from and processed by the open-source 3D audio rendering
engine Uni-Verse [Kajastila et al., 2007] at a sampling rate of 16 kHz.
A screen-shot of Uni-Verse’s rendering engine is shown in (Fig. 5.1), and further
details about the usage of the Uni-Verse framework can be obtained in (Chapter 3).
The speech samples were recorded using three different male and three different female
voices each speaking four sentences in American English. Table 5.1 lists all samples used
during the experiments as well as their duration. Using human speech samples as sound
sources in the experiments has been thought of as a direct application to the problem of
a multi-party teleconferencing system.
All tests were conducted in a quiet listening room on a computer using a specially
designed user interface as shown in (Fig. 5.2). Before the tests were conducted, each
72
73
5.3 User Experiments
participant received an introduction into the testing environment and instructions about
the tasks to be accomplished during the tests. Every test was preceded by a learning phase
during which the participants were presented reference samples with their accompanying
correct locations. In the training phase, all samples were presented in the same linear
order to each participant and could be played up to three times using the provided play
button, before moving on to the next sample by pressing the next button. To enable
participants to distinguish the different talkers contained in each sample, each talker was
represented by a number as well as its spoken text.
Each participant was asked a series of questions to be answered for each talker
contained within each sample. First, the locations of all talkers were to be determined by
selecting a location from a map of possible talker locations. Secondly, location easiness,
spatial and overall speech quality had to be selected by using the previously described
discrete MOS values MOSLE , MOSSQ and MOSOQ .
5.3.1 Experimental Design
All tests were performed in cubic virtual conference rooms with varying dimensions.
The walls of the rooms showed the typical acoustic properties of concrete. A schematic
overview over the virtual test environment and all measured parameters is shown
in (Fig. 5.3 and 5.4).
A round conference table showing the acoustic properties of wood was placed at the
center of the room at a height of htable = 0.75m above the floor. The table had a variable
radius of 2, 3 or 4 meters depending on the test.
Either 5, 7, or 9 participants were distributed equally around the table. All participants
were placed at a distance of d part = 0.25m from the table and at a height of h part = 1.25m
above the floor.
74
5.3 User Experiments
In each test, one of the participants always represented the listener and was placed
at a fixed position. To simulate the listener, a generic HRTF for five frequency bands
was assumed due to good experiences obtained during our previous studies (Chapter 4).
All other participants represented talkers whose positions and numbers were varied in the
different setups, with at least 2 and at most 4 participants talking concurrently at the same
time. Additionally the distribution of male and female talkers was varied to examine the
influence of the different voice types on localization performance and subjective speech
quality.
Beside a reference test setup, we tested five different setups varying one of
the above mentioned parameter at a time as compared to the reference test.
(Table 5.2) lists all setups and their respective parameters. The setups were called
Voice Type, Number of Simultaneous Talkers, Listener-to-Sound Source Distance,
Talker Position Density and Sound Source-to-Wall Distance and are described in the
following sections.
5.3.1.1 Reference Test
The reference test is based on processed speech signals with an average length of 14.38s,
simultaneously spoken by two male talkers from four possible locations distributed
around the table. The virtual conference room has a size of 20 × 20 × 20m3 , the radius of
the table is set to 2m. Sound source positions are labeled relatively to the listener location
as 1-NearLeft, 2-FarLeft, 3-FarRight, 4-NearRight, the position of the listener is labeled
Listener. The listener and all sound sources are facing the center of table. Within the
reference test, six samples with different combinations of voice-to-position assignments
were recorded. The total number of samples assessed for this test is 186.
75
Name Room Participants Table Simultaneous Voice
dimension radius talkers type
Reference 20 × 20 × 20m3 5 2m 2 m/m
f/f
Voice Type 20 × 20 × 20m3 5 2m 2
m/f
Chapter 5 Assessing Virtual Teleconference Rooms
3 m/m/m or f / f / f
Number of simultaneous- Talkers 20 × 20 × 20m3 5 2m
4 m/ f /m/ f
3m
Listener-to-Sound Source Distance 20 × 20 × 20m3 5 2 m/m/m
4m
7
Talker Position Density 20 × 20 × 20m3 2m 2 m/m
9
15 × 15 × 15m3
Sound Source-to- Wall Distance 5 2m 2 m/m
10 × 10 × 10m3
Table 5.2: Test setups and parameters
76
5.3 User Experiments
77
Chapter 5 Assessing Virtual Teleconference Rooms
78
5.4 Results
79
Chapter 5 Assessing Virtual Teleconference Rooms
Figure 5.5: Localization correctness vs. MOS − LQSW LE ratings - (Voice Type)
Voice Type
Test parameter Number of talkers
located correctly (in
%)
S.No: Relative speech 2 1 0
frequency Talkers Talker Talkers
1 Male/Male 46% 35% 19%
2 Female/Female 37% 23% 40%
3 Male/Female 61% 31% 8%
Table 5.3: Talker localization distribution - (Voice Type)
80
5.4 Results
Figure 5.6: Talker localization vs. MOSLE ratings - (Number of Simultaneous Talkers)
81
Chapter 5 Assessing Virtual Teleconference Rooms
talkers one out of four talkers was located correctly. Only at 6% of the time no talker
could be located correctly. The MOS ratings were similar to the ratings found in
Number of Simultaneous Talkers-1, only (MOS-LQSW LE on 95% CI) was slightly better
at 3.14 ± 0.13.
5.4.4 Listener-to-Sound Source Distance
The results of Listener-to-Sound Source Distance show that a larger table leads to better
localization performance. Listener-to-Sound Source Distance-1 employed a table radius
of 3m. Here, 71% overall correctly located talkers result was achieved as compared
to 64% obtained in the reference test, as shown in (Fig. 5.7). In 57% of the cases,
both talkers were located correctly, one of two in 28%, and none in 15% of all cases
(Table 5.5). Misperception occurred in a matter similar to the reference test, while all
MOS scores were slightly higher at 3.72 ± 0.10 (MOS-LQSW LE on 95%CI ), 3.68 ± 0.09
(MOS-LQSW SQ on 95%CI) and 3.75 ± 0.09 (MOS-LQSW OQ on 95%CI).
Using a radius of 4m for the virtual conference table in Listener-to-Sound Source Distance-2
82
5.4 Results
yielded 75% overall correctly located talkers, while in 59% both talkers were located
correctly, in 31% only one of two and in 10% none of the talkers were located correctly.
All MOS scores for this test were found to be within the confidence interval of
Listener-to-Sound Source Distance-1.
83
Chapter 5 Assessing Virtual Teleconference Rooms
Figure 5.8: Localization correctness vs. MOSLE ratings - (Sound Source Density)
In Sound Source Density-2, each talker could be placed on one of eight possible
locations. Here, only 37% overall talker localization correctness was achieved. In 17%
of all cases, both talkers were located correctly, in 41% only one and in 42% none
of the talkers were located correctly. Misperception occurred between 5-FarRight and
84
5.4 Results
85
Chapter 5 Assessing Virtual Teleconference Rooms
of 1000m3 also exhibited a correctly located talker ratio of 72%, while both talkers could
be located in 58% of the cases, one talker in 30% and none of the talkers in 12%. Again,
misperception was found to be similar to the reference test, and MOS ratings were near
equal to Sound Source-to-Wall Distance-1 and the reference test.
5.5 Summary
As shown by the results listed in Section 5.4, each of the measured parameters has a
substantial influence on talker localization performance.
Results of the Voice Type setup clearly state that participants were able to locate two
simultaneous talkers more often when the presented stimuli were of different genders as
was previously assumed, and that two male talkers were easier to locate than two female
talkers. While the first finding can be explained by the fact that it is much easier to
distinguish two different voices if their pitches differ greatly. A possible explanation is
that the male voices showed greater differences in voice pitch and hence were easier to
differentiate than the female voices. But since the subjective location easiness ratings do
not show any significant differences between the reference test and Voice Type-1/2, one
can assume that the reasons were not that obvious. Another explanation can be given
by the fact, that the experiments were performed by more male than female participants.
Both tests achieved subjective MOS quality ratings at an acceptable level.
It could also be shown that an increasing number of participants leads to higher
localization correctness ratios, which partly contradicts the preliminary assumptions
made in (Section 5.3.1.3). Although this result seems counter-intuitive, one has to keep
in mind that the number of possible talker locations was kept constant while the number
of concurrent talkers increased, and hence the talker-to-location ratio increased with the
number of concurrent talkers. Therefore, participants were able to directly compare all
concurrent talkers and the error of misperceiving a talker location with an empty location
was minimized. Subjects reported that spatial separation of all simultaneous talkers
helped them to determine the corresponding locations to a good extent, although echoes
and reverberations for three simultaneous talkers made it difficult to absorb the situation
for a longer period, thereby resulting in significantly lower MOS-LQSW LE ratings for 3
86
5.5 Summary
87
Chapter 5 Assessing Virtual Teleconference Rooms
room having dimensions of 15 m3 and average reverberation start delay time of 89ms −
94 ms, localization accuracy of 71.77%. was achieved.
Average reverberation start delay time comparison for three virtual acoustic rooms for
two sound sources
No significant difference between results of smaller and medium size room was found.
While in bigger room size having dimensions of 20 m3 and average reverberation start
delay time of 107 ms − 124 ms, localization accuracy of 63.44% was achieved. This
accuracy was found significantly lower than it was achieved in small and medium-sized
room results. This difference in localization results suggested the importance of early
reflections. Early reflections were found lower in value for smaller rooms (Table 5.8).
In [Bradley et al., 2003] it was reported that room volume reduction from 1777 m³ to
1092 m³ resulted an increase in taking benefit from early reflections up to 3dB. Reduction
in the ceiling height from 10 m to 7 m was also included in their work. Lowering a ceiling
height produced lower reverberation times.
through changing length of early reflections by average values of the longest early reflections for each
source separately [Kajastila et al., 2007].
88
Chapter 6
Conversational Tests
A conversation may be defined as the alternative adaptation of roles of listener and
talker by the conversation partners for interacting with each other [Richards, 1973,
Guéguin et al., 2008]. The International Telecommunication Union Standardization
Sector (ITU-T) in recommendation [ITU-T, 2007] describes methods and procedures for
conducting conversational tests to evaluate subjective communication quality. Based on
the recommendations [ITU-T, 2007], pair of subjects take part in the conversational test
by talking and listening interactively and at the end of the test vote using MOS quality
scores.
Normally, subjective test results obtained through pair of subjects do not properly
reflect the teleconferencing requirements. In typical teleconferencing situations it can be
assumed that the number of participants would be more than two. Additionally, there
are many chances that more than two persons start talking at the same time during the
conversation [Yankelovich et al., 2006]. Unfortunately, there is no standard available yet
which covers methods and procedures for conducting multi party conversational tests
covering proper teleconferencing situations.
In his recent work [Raake, 2011] presented a conversation test method for assessing
conference quality with three participants. Raake submitted his work to International
Telecommunication Union - (study group 12) to propose the method as a potential
appendix to the recommendation [ITU-T, 2007] or to form a new, stand-alone
recommendation.
In order to evaluate and optimize 3D telephony system for a conversational audio
quality we conducted 3 participants’ based conversational tests. In conversational tests,
four audio qualities such as mono, stereo, spatial and spatial audio with head-tracking
were compared and results have been presented in the following sections.
6.1 Test design
Test layout for a conversational test is presented in (Fig. 6.1). The detailed explanation of
each of the components shown here in this layout can be studied in detail in (Chapter 3).
This was a three participant’s based conversational test. This means that in this subjective
study only three participants could take part at any given time. Each participant was
sitting in a separate quiet room during the tests.
89
Chapter 6 Conversational Tests
Conversational test participants were connected with each other through the
conference bridge by using Ekiga VoIP client. The conference bridge provides an overall
control of the call among the participants during this test. However, for spatial audio, the
conference bridge after establishing a connection forwarded all individual audio streams
to a virtual acoustic server for rendering. Additionally, a separate connection between a
client and virtual acoustic server was utilized for transferring of position and orientation
of each user for head-tracking purpose.
6.2 Test description
In conversational tests, 23 paid subjects (9 female, 14 male having average age of 30)
voluntarily participated. Subjects voted using MOS quality score for mono audio, stereo
audio, spatial audio and spatial audio with head-tracking quality separately.
Four conversational test scenarios were developed to test each audio quality separately.
Each conversational testing scenario was of five minutes duration. At the end of each
testing conversational scenario, we asked subjects nine questions which were based
on the recommendations [ITU-T, 2007]. Summary of questions that were asked from
the subjects for conversational test is provided in (Table 6.1) and details can be seen
in (Appendix C). Head phones used in the conversational tests were Sennheiser PC-230.
Head-tracking was achieved by the PNI sensor corporation’s Space Point-gaming tracker.
90
6.2 Test description
1 How would you assess the sound quality of the other person’s voice?
2 How well did you understand what the other person was telling you?
What level of effort did you need to understand what the other person
3 was telling you?
How would you assess your level of effort to converse back and forth
4 during the conversation?
5 How annoying was it for you when all partners were talking?
6 What is your opinion of the connection you have just been using?
How easy you think you feel to get the direction of conversational
8 partner’s speech in the listening environment?
SpacePoint tracker offers 9 axes of motion tracking which are 3-axis magnetometer, 3-
axis gyroscope and 3-axis accelerometer which are driven by PNI’s motion tracking
engine [SpacePoint, PNI., 2011].
Subjects were provided with 10 discussion topics and were asked to select one
discussion topic unanimously for each test scenario. Discussion topics ranged from
sports, students affairs in the universities, music, food and European financial crises etc.
6.2.1 Results
In a category of sound quality of conversational partner’s voice, stereo audio performed
well yielding MOS ratings on 95% CI at (4.45 ± 0.28). However, interestingly spatial
audio and spatial audio with head-tracking yielded lower MOS ratings on 95% CI at
91
Chapter 6 Conversational Tests
Spatial-HT
Conversational
Spatial
Stereo
Mono
Test
Questions
from (1 to 9)
MOS ± CI MOS ± CI MOS ± CI MOS ± CI
1 3.95 ± 0.32 4.45 ± 0.28 3.90 ± 0.50 3.75 ± 0.43
2 4.65 ± 0.35 4.80 ± 0.19 4.40 ± 0.53 4.35 ± 0.38
3 4.70 ± 0.22 4.75 ± 0.21 4.20 ± 0.58 4.35 ± 0.41
4 4.70 ± 0.22 4.65 ± 0.23 4.45 ± 0.47 4.25 ± 0.40
5 4.35 ± 0.31 4.50 ± 0.36 4.35 ± 0.41 4.15 ± 0.38
6 4.10 ± 0.37 4.30 ± 0.31 3.75 ± 0.48 3.80 ± 0.39
7 4.30 ± 0.34 4.30 ± 0.27 3.85 ± 0.49 3.90 ± 0.40
8 4.55 ± 0.39 4.65 ± 0.27 4.15 ± 0.55 4.05 ± 0.47
9 4.05 ± 0.49 4.25 ± 0.34 3.70 ± 0.53 3.80 ± 0.45
Table 6.2: MOS values with 95% CI for the nine questions
(Comparison of conversational quality MOS values with 95% CI for the nine questions)
92
6.2 Test description
Mono
Stereo
5 Spatial
1
0 1 2 3 4 5 6 7 8 9
93
Chapter 6 Conversational Tests
6.3 Summary
Conversational tests were performed among three interlocutors (three conference
participants at a time) to optimize 3D telephony system. Conversational tests were done
in real time, it was of our interest to check at what extent our teleconferencing solution
performs well and also to check which audio qualities are liked by the participants. Since
it is not so easy to listen and to understand three simultaneous talker even in real life.
In literature it is stated by [Stifelman, 1994] that listening to three simultaneous audio
streams is cognitively difficult, even in face to face situations.
In conversational test results it was found that stereo audio surpassed audio qualities
such as: mono, spatial with and without head-tracking. Their respective MOS ratings on
95% CI remained at (4.57 ± 0.27), (4.37 ± 0.33), (4.08 ± 0.5) and (4.04 ± 0.41). Though
we were expecting better MOS scores with spatial with and without head-tracking audio
qualities, since spatial audio quality is more natural representation of audio. But test
participant’s perception was found opposite of our expectations. The reason for not
getting better scores by spatial with and without head-tracking audio qualities than mono
and stereo audio qualities, may be due to the fact that users/subjects are more acquainted
to the mono and stereo audio qualities in their normal usage of communication solutions
(VoIPs, land-line phones and mobile phones). Therefore, subjects preferred the mono
and stereo over spatial with and without head-tracking audio qualities. It was looked
as participant’s preferences were based on their previous audio quality experiences
which they encounter in their daily life using communication channels. Importantly, no
participant complained about spatial audio quality with or without head-tracking rather
they reported that they had a complete new experience of teleconferencing while having
conversations with their partners. We can argue safely here that acceptance of spatial
audio quality (also with head-tracking) among masses (users/customers) can be further
observed when users/customers shall be offered spatial audio quality conferencing web
services to talk to more than three partners at a time. In near future, having spatial audio
conferencing with more than three participants through 3D telephony would be possible.
94
Chapter 7
Investigating Virtual Acoustic
Environments & QoE Relationship
Quality of experience (QoE) is an assessment based on human perception, feeling and
behavior. On the other hand, a communication ecosystem represents the interaction
among various domains, such as technical aspects, business models, human behavior
and contextual aspects. The main contribution in this chapter is to present a conceptual
and holistic QoE model comprising all domains of a communication ecosystem and to
evaluate QoE-Context relationships through user studies and empirical analysis. Virtual
acoustic environment is a subcategory of a contextual model and it comprises of the
virtual rooms and different voice types present in it. We present findings of user studies
to analyze the impact of a virtual acoustic environment on QoE. Furthermore, using
a statistical approach, QoE terms, their analysis and validation in two different test
scenarios have been benchmarked. In first scenario, the investigation shows a strong
correlation between the virtual rooms and three QoE factors which are localization
performance, spatial audio quality and overall audio quality and moderate correlation
with localization easiness. The investigation also led to the discovery that simultaneous
mixed gender talkers in a conference call has secured better QoE scores and values.
7.1 Introduction
Along with the rapid technological advances, there has been a proliferation of new and
innovative systems, services, applications and end-user devices. Network management
concepts are also evolving, and an autonomic network management paradigm aspires to
bring human-like intelligence to telecommunication management tasks [Laghari et al.,
2009]. Thanks to these technical advancements, the fulfillment of customer demands
and user experience requirements have also come into focus and are becoming the
main differentiators for the effectiveness of telecom operators and service providers.
To understand human quality requirements, the notion of Quality of Experience (QoE)
is used since it provides an assessment of human expectations, feelings, perceptions
and cognition with respect to a particular product, service or application [Kilkki, 2008,
Laghari et al., 2011]. Traditionally, a technology-centric approach based on QoS
parameters has been employed to ensure quality and better performance to end users.
However, QoE expands this horizon as it tries to capture people’s aesthetic and hedonic
95
Chapter 7 Investigating Virtual Acoustic Environments & QoE Relationship
needs. The International Telecommunication Union (ITU-T) defines QoE [ITU, 2007]
as, "the overall acceptability of an application or service, as perceived subjectively by
the end-user". We define QoE as a blueprint of all human quality requirements and
experiences arising from the interaction of a person with technology and with business
entities in a particular context. QoE comprises of human subjective and objective factors
developed in a particular context.
To understand the QoE concept, at first, it is pertinent to know and understand the
communication ecosystem. Human behavior, business, technological and contextual
aspects constitute a communication ecosystem. The term ecosystem has been used in
various fields; in ecology [Dictionary, 2011] it is defined as, "a system involving the
interaction between a community of living organisms in a particular area and its non-
living environment". Similarly, a communication ecosystem could be defined as, "the
systematic interaction of people, technology and a business in a particular context”.
In a communication ecosystem, different actors interact with each other and they may
have different approaches. For instance, technical people try to provide a better user
experience by assuring network and service performance based on Quality of Service
(QoS) models. Business people develop economic models and strategies to assess
the profit, cost and customer churn rate. Psychologists and social scientists analyze
human attitude, intentions and cognition to understand human behavior in a particular
context. All actors of a communication ecosystem may have different vocabularies,
semantics and models, but to get a holistic and unified view of human needs and
behavioral requirements, these different approaches in business, technology, psychology
and cognitive science should be integrated into one framework. In a communication
ecosystem, where these domains interact with each other, it would be interesting to
converge and combine these different models to understand how human behavior is
actually shaped in a communication ecosystem. The QoE notion is thus a converging
factor that combines the influences of all these aspects to produce a blueprint of human
aesthetic and hedonic needs.
For a communication ecosystem, Kilkki’s QoE model [Kilkki, 2008] proposes
a simple and intuitive interaction between various actors in a communication
ecosystem. Kilkki presents a generic interaction between a person, technology and
business. However, by referring to Killki’s framework for analyzing communication
ecosystem (Fig. 7.1 adapted from [Kilkki, 2008]) we argue that in his framework
he neither classifies QoE factors nor includes any contextual aspects for analyzing a
communication ecosystem. We therefore extend Kilkki’s work by adding contextual
aspects to the model and by defining the taxonomy of each domain in a communication
ecosystem.
ITU-T’s [G-1080, 2008] proposes a QoE model that classifies QoE factors into two
parts, one part is related to subjective human components or emotions and the second part
with objective QoS parameters. Additionally, also in [G-1080, 2008], technology-centric
parameters are considered as objective factors. We propose objective QoE factors based
on human physiology, cognitive science and psycho-physics, because cognitive science
96
7.1 Introduction
and mental models can be utilized to obtain precise quantitative information about human
performance [ITU, 2007]. A consolidated QoE based communication ecosystem has also
been proposed with extended concepts as described in a later section.
3D Telephony was selected to be used as a case study. It consists of a 3D
audio telephone and a teleconferencing system. Classic teleconferencing often suffers
from issues such as low intelligibility, limited ability of the participants to discern
unfamiliar interlocutors. 3D Telephony is a possible solution to address the shortcomings
of traditional teleconferencing services. 3D Telephony provides a virtual acoustic
environment and 3D sound improves the quality of experience of a teleconferencing
service. To evaluate 3D Telephony [Hyder et al., 2010b,a] system, user studies were
conducted following ITU-T’s P.800 standard.
This chapter is divided into two main contributions. First, a theoretical framework
for a consolidated QoE model and its taxonomy is presented. In the second half of
the chapter, an experimental set up and results of subjective studies have been presented.
Additionally, results presented in this chapter help us to analyze the relationship between
the virtual acoustic environment and the QoE (Fig. 7.2).
The chapter is organized as follows. In section (7.2) we present related work. In
section (7.3) we discuss our proposal for a consolidated QoE model for communication
ecosystem. In section (7.4) we present a use case study based on 3D Telephony and
present the methodology adapted for our user studies. In section (7.5) we present test
97
Chapter 7 Investigating Virtual Acoustic Environments & QoE Relationship
results and discuss our findings. Conclusion and map out of some future work are
presented in the last section of the chapter.
7.2 Background
7.2.1 QoS and QoE
QoE is considered to be an extension to the QoS concept, which is why most audio
telephony services, such as VoIP services, are assessed based on Quality of Service (QoS)
parameters [Bai and Ito, 2006, Radhakrishnan and Larijani, 2010]. Existing QoS metrics,
such as packet loss rate, jitter, delay and throughput are typically used to indicate the
impact on the audio quality level from the network point of view and do not directly
reflect the user’s experience. Consequently, these QoS parameters fail at capturing
the subjective and objective aspects associated with human perception and cognition.
QoE approaches have been introduced to overcome the limitations of current QoS-
aware multimedia networking schemes as far as their human perception and subjective-
related aspects [Takahashi et al., 2008]. QoE applicability scenarios, requirements,
evaluations and assessment methodologies in multimedia systems have been investigated
by several researchers and working groups, such as the International Telecommunication
Union – Telecommunication Standardization Sector (ITU-T) [G-1080, 2008], and the
European Technical Committee for Speech, Transmission, Planning, and Quality of
Service [ETSI, 2009]. The ITU-T proposed the E-Model [ITU-T, 2003a] to assess
the quality of experience indirectly from network traffic patterns. Furthermore the
ITU-T recommended the use of Perceptual Speech Quality Measure (PSQM) in
its recommendation P.861 [ITU-T, 1994b], but it was recognized as having certain
limitations in specific application areas. It was replaced by P.862, known as Perceptual
98
7.2 Background
99
Chapter 7 Investigating Virtual Acoustic Environments & QoE Relationship
100
7.3 QoE Based Model for a Communication Ecosystem
101
Chapter 7 Investigating Virtual Acoustic Environments & QoE Relationship
empirically. Some common subjective QoE factors are perceptions, feelings, ease of
use, joy of use, satisfaction, etc. These factors are normally obtained through surveys,
customer interviews, and ethnographic field studies [Cooper et al., 2007]. For more
information on a subjective studies, ITU-T proposes the P.800 recommendations [ITU-
T, 1996]. In marketing and social psychology, psychological models are normally
used to understand human intentions and behavior. One widely-recognized model is the
Technology Acceptance Model (TAM), [Davis, 1986] which is a derivative of the Theory
of Reasoned Action (TRA) [Sheppard et al., 1988]. The TAM is a simple model to help
understand human intention and behavior towards the adoption of a particular product or
service. Over time, the TAM model has been revised and advanced by other models, such
as the Theory of Planned Behavior (TPB), the Demodified Theory of Planned Behavior
(DTPB), and the Unified Theory of Acceptance and Use of Technology (UTAUT) [Al-
Qeisi, 2009], etc. These approaches are used to understand the subjective acceptability
of any product or service by end users or customers. These psychological models can
also be utilized for capturing subjective human factors.
7.3.1.2 Objective Human Factors
Objective human factors are quantitative in nature and are related to human physiology,
to psycho-physical aspects and to cognition. Some examples of human objective factors
are the human audio-visual system, brain waves, heart rate, blood volume pressure,
memory, attention, language, task performance and human reaction time. The influence
of biology and of the cognitive system on human behavior or decision making is
normally investigated in cognitive psychology, behavioral neuroscience or in biological
psychology. Audio-visual systems have received increased attention with the innovation
and development of teleconferencing, computer games, and virtual reality systems. The
use of psycho-physical aspects and physiology could contribute to a massing significant
data about the human biological state. Quantitative data answers questions such as “how
much”, “how many” or “where”, etc [Cooper et al., 2007]. These factors can be gathered
and evaluated through subjective testing and/or via quantitative research.
The line between subjective and objective human factors suggests that they may be
interdependent and could possibly be inferred from each other through some mechanism,
e.g., a change in human biological and cognitive parameters could also influence human
subjective perceptions and feelings or vice versa.
7.3.1.3 Human Entity
The human entity category provides information about a person such as his or her
roles (i.e., customer, user) and characteristics (i.e., age, gender etc). The roles can be
categorized into three main categories: user, customer and group. In the current work, we
focus more on user and customer roles. A customer is the entity/person who subscribes
to a service and is a legal owner of that service; however, he or she may or may not be
the primary user of that service. A user is the individual who actually uses a service. The
line between the user and customer boxes indicates the possibility that their roles can
102
7.3 QoE Based Model for a Communication Ecosystem
interchange. A customer who is paying for an on-line telephony service may be stricter
about quality than a user who is using a free on-line audio chat service. In [Laghari et al.,
2010] a customer experience model was presented to specifically understand customer
experience requirements. In addition to human entity roles, it is also possible that
people in a different age groups or gender or demographic groups may have different
QoE requirements. This sort of differentiation of human roles and characteristics helps
researchers to better understand QoE requirements and to document them with much
more precision.
103
Chapter 7 Investigating Virtual Acoustic Environments & QoE Relationship
(iii) Social context describes the social aspects of the contextual entity. The
social context usually contains interpersonal relations such as the social associations,
connections, or affiliations that may exist between two or among many people. For
instance, social relations can contain information about friends, family, neighbors, co-
workers, etc.
7.3.2.2 Contextual Characteristics
Each contextual entity may have some specific characteristics and parametric
specifications. For example, GPS data for a location, the echoes and reverberations of
teleconferencing rooms, the size of the a virtual teleconferencing room, etc. Changes in
contextual aspects have the tendency to influence human behavior. A person participating
in a teleconference or a telephony call who is sitting in a quiet room has different QoE
requirements than a person conducting a call or conference while standing in a railway
station, at a bus stop or in a cafeteria. To provide improved customization and better user
experience, the technological domain should be agile enough to adapt to the needs of a
user/customer as appropriate to their changing context. Context-aware applications and
systems are being developed to cater to the needs of real context. In the area of virtual
context, a user has more freedom to shape his/her context according to his/her own
needs and ease. For example, in a 3D virtual acoustic environment for teleconferencing
services, end users can vary the size of virtual teleconferencing rooms and/or place the
participants of a teleconference anywhere in the virtual acoustic environment that suits
his or her own needs. Thus, it becomes very interesting to investigate the significance of
the impact of contextual aspects over QoE, and then how contextual information could
be exploited by the technological and business domains to develop a service with better
user experience and customized business models.
7.3.3 Business Domain
7.3.3.1 Business Entity
A business entity represents service providers, network operators, marketplace owners
and/or device vendors, etc. Most business entities have customer/user service touch
points that customers reach to in order to subscribe to a service that fulfills their intended
goals or to report a service problem. This interaction between customer/user and provider
could be direct or indirect (on-line) but in both cases this interaction experience develops
positive or negative feelings, and possibly a combination of both.
7.3.3.2 Business Characteristics
A business entity has certain properties such as business model and strategies,
which basically define the direction of its business. Business characteristics include
advertisement, pricing, promotion and brand image, etc. To avoid a high customer/user
churn rate and a bad word-of-mouth, business characteristics should be mapped with
QoE so that they can fulfill customer expectations. Furthermore, there should be
an alignment between business and technical characteristics to create an integrated
104
7.4 A use case study - 3D Telephony
105
Chapter 7 Investigating Virtual Acoustic Environments & QoE Relationship
correctly listeners could locate the positions of the concurrent talkers in a virtual
teleconferencing room. LP data are real quantitative data based on the actual
performance of listeners and it represents the listener’s ability to locate either both talkers
correctly, or only one, or neither, in a virtual acoustic environment with the help of a map.
LP data are presented as percentage values.
We define three subjective human factors: Localization Easiness, Spatial Audio
Quality and Overall Audio Quality. To obtain measures of these subjective human
factors, subjects were asked to give their opinion ratings on a five point MOS scale.
Localization Easiness (LE): LE represents human perception and feelings about
localizing talkers. We define LE as an assessment of listener’s feelings to locate
concurrent talkers easily in VAE.
Spatial Audio Quality (SAQ): This factor is also a perception and feeling related
106
7.4 A use case study - 3D Telephony
107
Chapter 7 Investigating Virtual Acoustic Environments & QoE Relationship
(iii) What is the actual performance of listeners in correctly locating the talkers at their
positions? (iv) How is the 3D audio quality rated by the subjects? (v) Is there any
difference in the listeners’ perception and the performance with respect to voice type and
virtual room size?
To validate this model and investigate the relationship between the QoE and a virtual
acoustic environment, we conducted user studies based on the following methodology.
7.4.2 Methodology
The methodology adapted for tests has been already defined in detail in (Section 5.3). A
selection of scenarios and sub-scenarios for current user study was created based on the
following facts and reasoning.
Virtual Room Size: In this scenario, we analyzed how varying virtual room sizes
and sound source/talker-to-wall distances impact on QoE factors and measured how
participants’ opinions and performance vary with varying room size. To determine the
effect of room size and sound source/talker-to-wall distance on all QoE scores, this test
used three different rooms with dimensions of 10³ m³, 15³ m³ and 20³ m³. The average
lengths of the presented stimuli were 14:38s, 14:65s, and 14:43s, respectively, for the
three tests.
Voice Type: In this scenario, our goal was to test the impact of relative and absolute
differences in voice types (such as two concurrent male, female or mixed gender talkers)
on the QoE. Therefore, the three tests within this setup were Voice Type-1: two
simultaneous female talkers with an average signal length of 13:03s, Voice Type-2: two
mixed gender talkers with an average signal length of 14:42s, and Voice Type-3: two
concurrent male talkers with speech signals of an average length of 14:38s, and each
from four possible locations distributed around the table.
Summary of Cronbach Alpha test results which verify reliability and internal consistency
of QoE factors. All results are well above 0.6 value which shows a high level of reliability
for the construct variables.
108
7.5 Results & Discussions
7.5.2 Discussion
In this section, results about the two main scenarios based on the virtual room size and
the voice types of participants are presented.
7.5.2.1 Experiment I: QoE Factors and Virtual Room Size
In this experiment, the QoE factors are analyzed based on changes in the size of a virtual
teleconferencing room. The results (Table 7.2) and (Fig. 7.6) suggest that there is a
very small decrease in localization performance when we switch from a small room
(10³ m³) to a medium size room (15³ m³). However, when we switch the room size
to that of a big room (20³ m³) a sudden decrease in localization performance can be
109
Chapter 7 Investigating Virtual Acoustic Environments & QoE Relationship
Figure 7.6: Quality Scores Comparison for Different Virtual Acoustic Rooms
Additionally, relating to the spatial audio quality and overall audio quality experience
in virtual teleconference rooms, results show that both the subjective spatial audio quality
and the overall audio quality MOS scores gradually improve with an increase in the
size of a virtual room. In contrast to the LP, a strong positive correlation is found
for both SAQ (0.94) and OAQ (0.98). This implies that localization performance of
test participants decrease with increasing room size, while spatial and overall audio
quality increase along with an increase in virtual room size. One possible factor for
this result could be the echoes and reverberations, since they are stretched in larger
rooms. As reported in [Mershon et al., 1989, Zahorik, 2002, Shinn-Cunningham, 2001,
Begault et al., 2001], reverberation in acoustic environments is considered to be a
reliable cue in identifying sound source distance, but it also modestly degrades sound
source directional perception [Santarelli, 2001] and speech intelligibility [Houtgast,
110
7.5 Results & Discussions
111
Chapter 7 Investigating Virtual Acoustic Environments & QoE Relationship
easiness is achieved with mixed gender voice types. To assess the impact of voice type
over spatial audio quality and overall audio quality, the results in (Table 7.2) indicate that
the highest subjective spatial audio quality and overall audio quality MOS scores were
achieved with mixed gender voice types and were the lowest with female voice types.
It is safe to conclude that mixed gender voice types achieve better LP values and MOS
scores, since when both voices are of a different gender they can be distinguished much
more easily than two voices of the same gender.
7.6 Conclusion
In this chapter a consolidated Quality of Experience (QoE) model is presented which
is based on the various domains or actors to develop a holistic and integrated view of
QoE in a communication ecosystem. The model shows the interaction between human,
business, technology and contextual aspects. To evaluate this model, we focused to study
the influence of contextual aspects on QoE. However, the influence of technological and
business parameters may be considered in future work.
The case study was based on the major idea of this thesis work which is called 3D
telephony and teleconferencing system. Particularly, an applied QoE model for 3D
telephony was constructed to study the influence of the virtual acoustic environment
on user/customer quality of experience. Further, a methodology designed to act as a
framework to conduct useful user studies was presented. In the user study the impact of
changing the characteristics and contextual aspects within 3D telephony system on a user
QoE factors such as localization performance, localization easiness, spatial audio quality
and overall audio quality were assessed. According to the results it is safe to conclude
that contextual aspects do influence QoE constructs. It was found out that changes in
the virtual room sizes and in the voice types of concurrent talkers produced different
values/scores for the QoE factors. Additionally, this study suggests that a medium sized
(15³ m³) teleconferencing room and mixed voice type provide the optimal quality of
experience in a 3D telephony-based virtual acoustic environment.
112
Chapter 8
Conclusions
As an out-come of this thesis work, a 3D audio telephony and teleconferencing system,
which is called 3D telephony, has been achieved. 3D telephony system is based on
customizable virtual acoustic environment. In order to optimize 3D telephony, a series of
subjective experimental studies were conducted and the empirical analysis of the results
have been presented.
First user study contained four sets of 7 different placement of the participants of
the conferencing call to judge the audio quality, understandability, locatability and the
occurrence of front/back or elevation localization errors. In this user study, localization
performance of subjects was measured and quality scores on perception were obtained
by varying HRTFs, virtual acoustic room sizes and different heights of the listener and
talkers. When the listening tests were carried out by placing the talkers in the corners of
the virtual acoustic rooms, it was found that most of the time subjects were unsure about
the talker locations. Subjects were found making front/back and up/down localization
errors frequently. On the other hand, when placement of talkers and listener was made
horizontal by placing the listener at the center of the room and talkers at left, right,
front and back of the listener, the localization performance improved. Especially left
and right orientation of listener was nearly perfect. However, nearly 50 percent front
and back localization errors were still observed. Further, change of the placement from
horizontal to frontal was found more productive. Better localization scores with frontal
placement were found than other competing placements of talkers and listeners. It was
identified that frontal placement listeners found it very easy to locate virtual talkers and
their success rate in locating virtual talkers remained nearly perfect.
For further optimization of 3D telephony and teleconferencing solution, subjective
experiments were conducted that helped to understand how to select proper virtual
acoustic environment for teleconferencing. This second subjective study was based
on 11 sets of user experiments to examine the effect that simulated virtual acoustic
room properties, virtual sitting arrangement, reflections of conference table, number
of concurrent talkers and voice characteristics have on the perception of audio quality,
locatability and speech intelligibility. It was identified that two different gender
simultaneous talkers were often localized correctly with better achievement of quality
scores on perception. Also, it was identified that increase in the number of simultaneous
113
Chapter 8 Conclusions
talkers brought increase in higher localization correctness ratios but at the cost of lower
quality scores. It was also found that increasing the table size within virtual acoustic
environment brings increase in overall localization scores. However, there was no major
difference found in quality scores among different table sizes used for the experiments.
It was also identified that increase in talker density produce decrease in localization
performance and quality scores. It was found that increase in the volume and size
of the virtual acoustic room brought positive change in the quality scores. However,
localization performance scores do not follow this trend. Localization scores got better
with smaller room sizes.
To optimize 3D telephony solution further, three interlocutor based conversational
tests were conducted. Virtual acoustic environment for conversational tests was selected
on the basis of successful results obtained in earlier subjective studies. Through
conversational tests we obtained subjective opinion to compare among audio qualities
such as: mono, stereo, spatial and spatial with head-tracking. It was identified that
spatial audio performed slightly less than mono and stereo but overall conversational
tests based on spatial audio produced satisfying results. Within conversational tests, it
was concluded that to clearly observe the advantages of spatial audio for conversational
quality, the future conversational tests should be conducted based on four interlocutors,
at least.
Further, a QoE model for a communication ecosystem, based on various domains or
actors, has been presented. QoE model presented develops a holistic and integrated
view of QoE in a communication ecosystem and shows the interaction among human,
business, technology and contextual aspects. To evaluate this model, the influence of
contextual aspects on user Quality of Experience has been studied. Also, an applied user
QoE model for 3D telephony was constructed to particularly study influence of Virtual
Acoustic Environment (VAE) on user Quality of Experience (QoE). Through user study,
the impact of contextual aspects on QoE factors was assessed. It was found that change
in context brings change in QoE constructs.
8.1 Outlook on Future Research
Based on the knowledge gained in how to design virtual acoustic environment, it is
considered an important future work to further optimize virtual acoustic rooms so that
they may be perceived as near to real conference rooms. As a future work to optimize
virtual acoustic rooms to a further extent a standard for acoustic quality of room such
as [Beuth Verlag, 2004] could be followed. This standard applies to small to medium-
sized rooms to ensure good acoustic quality precisely for spoken communication in such
rooms. We can take advantage of this standard and apply these design guidelines to
virtual acoustic rooms of different volumes and sizes.
Further, as a future work it may be considered to conduct conversational tests based on
four interlocutors (at least four interlocutors or more) to optimize 3D telephony solution
to a further extent. Through conversational tests we may compare among different audio
qualities such as: mono, stereo, spatial with and without head-tracking and different
114
8.1 Outlook on Future Research
115
Chapter 8 Conclusions
116
Appendix A
Research Papers
A.1 Conference Papers/Technical Reports
1. Mansoor Hyder, Michael Haun, Christian Hoene, “Measurements of Sound
Localization Performance and Speech Quality in the Context of 3D Audio
Conference Calls”, In International Conference on Acoustics, NAG/DAGA, March
2009, Rotterdam, Netherlands.
3. Mansoor Hyder, Michael Haun, and Christian Hoene, “Placing the Participants
of a Spatial Audio Conference Call”, In IEEE Consumer Communications
and Networking Conference - Multimedia Communication and Services (CCNC
2010), January 2010, Las Vegas, USA.
5. Christian Hoene and Mansoor Hyder, “Optimally Using the Bluetooth Subband
Audio Codec (SBC) Over Wireless links and on the Internet”, In 35th Annual
IEEE Conference on Local Computer Networks (LCN) (LCN 2010), October
2010, Denver, Colorado, USA.
7. Mansoor Hyder, Khalil ur Rehman, Christian Hoene, “Are QoE requirements for
Multimedia different for men and women?”, Second International Multi Topic
Conference, IMTIC 2012, March 28-30, 2012, Jamshoro, Pakistan.
117
Appendix A Research Papers
118
Appendix B
Summary of Contributions
I would like to summarize the contributions of my colleagues, collaborators and students
in achieving this PhD thesis. The basic scientific idea of this research work belongs to Dr.
-Ing. Christian Hoene. He contributed many useful comments while conducting research
work, selecting testing parameters and writing research results and publications. Michael
Haun, Olesja Weidmann and Jonas Leidig did their diplom and undergraduate thesis
under my supervision, their implementation and research contribution was important
to achieve this thesis work. I contributed in initiating the basic research idea from
scratch, conducted and implemented all experimental user studies, obtained data through
subjective/user studies, analyzed research results, wrote and presented research articles,
guided and supervised students to accomplish their diplom and undergraduate thesis
work. All this eventually helped me to accomplish this thesis work. I also initiated
a collaborative work with Institut Telecom SudParis, Paris, France on the idea of “An
Investigation into the Relationship Between Perceived Quality-of-Experience and Virtual
Acoustic Environments: the Case of 3D Audio Telephony”. I worked with Prof. Noel
Crespi and his student Mr. Khalil ur Rehman Laghari on this collaborative work. We
developed a consolidated Quality of Experience model for a communication echo system
and also developed applied user Quality of Experience model for 3D telephony system1 .
Chapter (7) of this thesis is based on our collaborative work. Later on we (collaborative
partners) wrote a journal article together in a co-authorship and got acceptance for
publication from Journal of Universal Computer Science.
119
Appendix B Summary of Contributions
120
Appendix C
Conversational Tests
C.1 Questionnaire for Conversational Tests
Please provide your feedback regarding the call quality you just have observed for the
following questions:
(1) How would you assess the sound quality of the other person’s voice?
• No distortion at all, natural
• Minimal distortion
• Moderate distortion
• Considerable distortion
• Severe Distortion
(2) How well did you understand what the other person was telling you?
• No loss of understanding
(3) What level of effort did you need to understand what the other person was
telling you?
• No special effort required
121
Appendix C Conversational Tests
(4) How would you assess your level of effort to converse back and forth during the
conversation?
• No special effort
(5) How annoying was it for you when all partners were talking?
• No annoyance
• Minimal annoyance
• Moderate annoyance
• Considerable annoyance
• Severe annoyance
(6) What is your opinion of the connection you have just been using?
• Excellent
• Good
• Fair Quality
• Poor Quality
• Bad Quality
• Good
• Fair Quality
• Poor Quality
122
C.2 Conversational Tests Descriptive statistics
• Bad Quality
(8) How easy you think you feel to get the direction of conversational partner’s
speech in the listening environment?
• No special effort required
• Good
• Fair Quality
• Poor Quality
• Bad Quality
123
Appendix C Conversational Tests
Table C.1: Descriptive Statistics for Conversational test for Mono Audio Quality
124
Table C.2: Descriptive statistics for conversational test for stereo audio quality
125
C.2 Conversational Tests Descriptive statistics
Appendix C Conversational Tests
Table C.3: Descriptive statistics for conversational test for spatial audio quality
126
Table C.4: Descriptive statistics for conversational test for spatial-HT
127
C.2 Conversational Tests Descriptive statistics
Appendix C Conversational Tests
128
Bibliography
J. Ahrens, M. Geier, and S. Spors. The soundscape renderer: A unified spatial audio
reproduction framework for arbitrary rendering methods. In Audio Engineering
Society Convention 124, 5 2008.
J. Ahrens, M. Geier, A. Raake, and C. Schlegel. Listening and conversational quality
of spatial audio conferencing. In Audio Engineering Society Conference: 40th
International Conference: Spatial Audio: Sense the Sound of Space, 10 2010.
K. Al-Qeisi. Analyzing the use of utaut model in explaining an online behaviour: Internet
banking adoption. 2009.
V. Algazi, R. Duda, D. Thompson, and C. Avendano. The cipic hrtf database. In
Applications of Signal Processing to Audio and Acoustics, 2001 IEEE Workshop on
the, pages 99–102. IEEE, 2001.
J. Allen and D. Berkley. Image method for efficiently simulating small-room acoustics.
J. Acoust. Soc. Am, 65(4):943–950, 1979.
T. Auer and A. Pinz. The integration of optical and magnetic tracking for multi-user
augmented reality. Computers & Graphics, 23(6):805–808, 1999.
R. Azuma, B. Hof, H. Neely III, R. Sarfaty, M. Daily, G. Bishop, V. Chi, G. Welch,
U. Neumann, S. You, et al. Making augmented reality work outdoors requires hybrid
tracking. In Augmented Reality: placing artificial objects in real scenes: proceedings
of IWAR-98, pages 219–224, 1999a.
R. Azuma, B. Hoff, H. Neely III, and R. Sarfaty. A motion-stabilized outdoor augmented
reality system. IEEE Virtual Reality, 1999. Proceedings., pages 252–259, 1999b.
E. Bachmann, X. Yun, and C. Peterson. An investigation of the effects of magnetic
variations on inertial magnetic orientation sensors. IEEE Robotics and Automation
Magazine, pages 76–87, 2007.
Y. Bai and M. Ito. A study for providing better quality of service to VoIP users. 2006.
ISSN 1550-445X.
J. Baldis. Effects of spatial audio on memory, comprehension, and preference during
desktop conferences. In Proceedings of the SIGCHI conference on Human factors in
computing systems, pages 166–173. ACM, 2001.
129
Bibliography
N. Barrett and S. Berge. A new method for b-format to binaural transcoding. In 40th
AES International conference. Tokyo, Japan, pages 8–10, 2010.
D. Begault. 3-D sound for virtual reality and multimedia. Academic Press Professional,
Inc., San Diego, CA, USA, 1994. ISBN 0-12-084735-3.
A. Berkhout, D. De Vries, and P. Vogel. Acoustic control by wave field synthesis. Journal
of Acoustical Society of America, 93:2764–2764, 1993.
G. Beuth Verlag. Hörsamkeit von kleinen und mittleren räumen. (acoustical quality in
small to medium-sized rooms). Beuth Verlag GmbH, 2004.
130
Bibliography
J. Borish. Extension of the image model to arbitrary polyhedra. The Journal of the
Acoustical Society of America, 75:1827, 1984.
J. Bradley, R. Reich, and S. Norcross. On the combined effects of signal-to-noise ratio
and room acoustics on speech intelligibility. The Journal of the Acoustical Society of
America, 106:1820, 1999.
J. Bradley, H. Sato, and M. Picard. On the importance of early reflections for speech in
rooms. The Journal of the Acoustical Society of America, 113:3233, 2003.
K. Brandenburg, S. Brix, and T. Sporer. Wave field synthesis: From research
to applications. In Proceedings of 12th European Signal Processing Conference
(EUSIPCO), Vienna, Austria, 2004.
E. Britannica. Acoustics. Encyclopedia, Britannica, Oct. 2011. URL http://www.
britannica.com/EBchecked/topic/4044/acoustics.
A. Bronkhorst. Localization of real and virtual sound sources. The Journal of the
Acoustical Society of America, 98:2542, 1995.
D. Brungart, A. J. Kordik, and B. D. Simpson. Effects of headtracker latency in virtual
audio displays. JAE, 54(1/2):32–44, Feb. 2006.
D. Brungart, B. Simpson, C. Bundesen, S. Kyllingsbaek, A. Burton, and A. Megreya.
Cocktail party listening in a dynamic multitalker environment. Perception and
Psychophysics, 69(1):79, 2007.
G. Burdea and P. Coiffet. Virtual reality technology. Presence: Teleoperators & Virtual
Environments, 12(6):663–664, 2003.
M. Burkhard and R. Sachs. Anthropometric manikin for acoustic research. The Journal
of the Acoustical Society of America, 58:214, 1975.
C. Cheng and G. Wakefield. Introduction to Head-Related Transfer Functions (HRTFs):
Representations of HRTFs in Time, Frequency, and Space. J Audio Eng Soc, 49(4):
231, 2001.
C. I. Cheng and G. H. Wakefield. Introduction to head-related transfer functions
(HRTFs): Representations of HRTFs in time, frequency, and space. In AES
Convention: 107,, Sept. 1999.
E. Cherry. Some experiments on the recognition of speech, with one and with two ears.
Journal of the acoustical society of America, 25(5):975–979, 1953.
Y. Chow. Low-cost multiple degrees-of-freedom optical tracking for 3d interaction
in head-mounted display virtual reality. International Journal of Recent Trends in
Engineering, 1(1):52–56, 2009.
131
Bibliography
K. Crispien and T. Ehrenberg. Evaluation of the "Cocktail Party Effect" for Multiple
Speech Stimuli within a Spatial Auditory Display. Journal of the Audio Engineering
Society, 43(11):932–941, 1995.
D. De Vries and M. Boone. Wave field synthesis and analysis using array technology. In
Applications of Signal Processing to Audio and Acoustics, 1999 IEEE Workshop on,
pages 15–18. IEEE, 1999.
A. Dey. Understanding and using context. Personal and ubiquitous computing, 5(1):4–7,
2001.
A. Dictionary. The american heritage science dictionary, July 2011. URL http://
dictionary.reference.com/.
132
Bibliography
M. Ericson and R. McKinley. Binaural and Spatial Hearing in Real and Virtual
Environments, chapter The intelligibility of multiple talkers separated spatially in
noise, pages 701–724. Erlbaum, Mahwah, NJ, r. h. gilkey and t. r. anderson edition,
1997.
S. ETSI. European technical committee for speech, transmission, planning, and quality
of service, 2009.
J. Fajardo, F. Liberal, and N. Bilbao. Study of the impact of UMTS Best Effort
parameters on QoE of VoIP services. In Autonomic and Autonomous Systems, 2009.
ICAS’09. Fifth International Conference on, pages 142–147. IEEE, 2009.
H. Fisher and S. Freedman. The role of the pinna in auditory localization. Journal of
Auditory research, 1968. ISSN 0021-9177.
M. Gerzon. Ambisonics. part two: studio techniques. studio sound, 17(8):24–26, 1975.
M. Good and R. Gilkey. Sound localization in noise: The effect of signal-to-noise ratio.
The Journal of the Acoustical Society of America, 99:1108, 1996.
133
Bibliography
H. Haas. Uber den einfluss des einfachechos auf die horsamkeit von sprache. Acustica,
1(2):49–62, 1951.
J. Hair, R. Anderson, and W. Tatham, R.L.and Black. Multivariate Data Analysis with
Readings. Englewood Cliffs NJ: Prentice Hall, 1998.
D. Hallaway, S. Feiner, and T. Høllerer. Bridging the gaps: Hybrid tracking for adaptive
mobile augmented reality. Applied Artificial Intelligence, 18(6):477–500, 2004.
D. Hawkins and W. Yacullo. Signal-to-noise ratio advantage of binaural hearing aids and
directional microphones under different levels of reverberation. Journal of Speech and
Hearing Disorders, 49(3):278, 1984.
C. Hoene and M. Hyder. Optimally using the bluetooth subband codec. In Local
Computer Networks (LCN), 2010 IEEE 35th Conference on, pages 356–359. IEEE,
2010.
134
Bibliography
H. Hu, L. Chen, and Z. yang Wu. The estimation of personalized hrtfs in individual vas.
In Fourth International Conference on Natural Computation, pages 203–207. IEEE,
2008.
P. Hughes. Spatial audio conferencing. In ITU-T Workshop: "From Speech to
Audio:bandwidth extension,binaural perception", Lannion France, 2008.
C. Huygens. Traité de la lumière. published in Leyden, 1690.
M. Hyder, M. Haun, and C. Hoene. Measurements of sound localization performance and
speech quality in the context of 3D audio conference calls. In Internation Conference
on Acoustics, NAG/DAGA, Rotterdam, Netherlands, Mar. 2009.
M. Hyder, M. Haun, , O. Weidmann, and C. Hoene. Assessing virtual teleconferencing
rooms. In 129th Audio Engineering Society Convention, San Francisco, Ca, USA,
Nov. 2010a.
M. Hyder, M. Haun, and C. Hoene. Placing the participants of a spatial audio conference
call. In IEEE Consumer Communications and Networking Conference - Multimedia
Communication and Services (CCNC 2010), Las Vegas, USA, Jan. 2010b.
K. Inkpen, R. Hegde, M. Czerwinski, and Z. Zhang. Exploring spatialized audio &
video for distributed conversations. In Proceedings of the 2010 ACM conference on
Computer supported cooperative work, pages 95–98. ACM, 2010.
J. Irwin. Basic anatomy and physiology of the ear. Infection and hearing impairment,
page 1, 2006.
F. Itakura. Minimum prediction residual principle applied to speech recognition. IEEE
Transactions on Acoustics, Speech and Signal Processing, 23(1):67–72, 1975.
F. Itakura and S. Saito. Analysis synthesis telephony based on the maximum likelihood
method. Repts. 6th Int. Congr. Acoustics, pages 17–20.
T. ITU. Definition of quality of experience (qoe). TD 109rev2 (PLEN/12), Jan. 2007.
T. ITU. International telecommunication union, itu-t webpage, Dec. 2010. URL http:
//www.itu.int.
R. ITU-T. E.800: Terms and definitions related to quality of service and network
performance including dependability. ITU-T Recommendation, 1994a.
R. ITU-T. P.861: Objective quality measurement of telephone-band (300-3400 hz)
speech codecs. ITU-T Recommendation, 1994b.
R. ITU-T. P.800: Methods for subjective determination of transmission quality. ITU-T
Recommendation, 1996.
135
Bibliography
R. ITU-T. Bs.1534: Method for the subjective assessment of intermediate sound quality
(mushra). ITU-T Recommendation, 2001a.
R. ITU-T. P.862: Perceptual evaluation of speech quality (pesq): an objective method for
end-to-end speech quality assessment of narrow-band telephone networks and speech
codecs. ITU-T Recommendation, 2001b.
R. ITU-T. G.107: The e model, a computational model for use in transmission planning.
ITU-T Recommendation, 2003a.
M. Jeub, M. Schäfer, and P. Vary. A binaural room impulse response database for the
evaluation of dereverberation algorithms. In Proceedings of the 16th international
conference on Digital Signal Processing, pages 550–554. Institute of Electrical and
Electronics Engineers Inc., 2009.
136
Bibliography
M. Karjalainen. A new auditory model for the evaluation of sound quality of audio
systems. In Proc. ICASSP, volume 85, pages 608–611, 1985.
M. Karjalainen. Structure and function of hearing chapter 05, Aug. 2011. URL http:
//www.acoustics.hut.fi/teaching.
M. Karjalainen, M. Tikander, and A. Harma. Head-tracking and subject positioning using
binaural headset microphones and common modulation anchor sources. In Acoustics,
Speech, and Signal Processing, 2004. Proceedings.(ICASSP’04). IEEE International
Conference on, volume 4, pages iv–101. IEEE, 2004.
C. Kim, S. Ahn, I. Kim, and H. Kim. 3-dimensional voice communication system for
two user groups. Advanced Communication Technology, 2005, ICACT 2005. The 7th
International Conference on, 1:100–105, 0-0 2005a.
D. Kim, M. Tarraf, L. Technol, and N. Whippany. Enhanced perceptual model for non-
intrusive speech quality assessment. volume 1, 2006.
H. Kim, D. Jee, M. Park, and S. Yoon, B.and Choi. The real-time implementation of 3D
sound system using DSP. In IEEE 60th Vehicular Technology Conference (VTC2004),
volume 7, pages 4798–4800, Sept. 2004.
J. Kim, S. Kim, Y. Kim, J. Lee, and S.-i. Park. New hrtfs (head related transfer functions)
for 3d audio applications. AES Convention: 118, May 2005b.
137
Bibliography
A. Krokstad, S. Strom, and S. Sørsdal. Calculating the acoustical room response by the
use of a ray tracing technique. Journal of Sound and Vibration, 8(1):118–125, 1968.
K. Laghari, I. Yahya Ben, and N. Crespi. Towards a service delivery based on customer
experience ontology: shift from service to experience. In Proceedings of the 5th
IEEE international conference on Modelling autonomic communication environments,
pages 51–61. Springer-Verlag, 2010.
M. Laitinen. Binaural reproduction for directional audio coding. PhD thesis, HELSINKI
UNIVERSITY OF TECHNOLOGY, 2008.
138
Bibliography
C. Low and L. Babarit. Distributed 3D audio rendering. Computer Networks and ISDN
Systems, 30:407–415, 1998.
B. Mathews. Vector markup language (vml). World Wide Web Consortium Note 13-
May-1998, May 1998. URL http://www.w3.org/TR/1998/NOTE-VML-19980513.
139
Bibliography
W. Noble. Auditory localization in the vertical plane: Accuracy and constraint on bodily
movement. The Journal of the Acoustical Society of America, 82:1631, 1987.
V. Pulkki. Virtual sound source positioning using vector base amplitude panning. Journal
of the Audio Engineering Society, 45(6):456–466, 1997.
V. Pulkki, J. Huopaniemi, and T. Huotilainen. Dsp tool for 8-channel audio mixing. In
Proc. Nordic Acoustical Meeting, volume 96, pages 307–314, 1996.
140
Bibliography
A. Raake. 3cts-3 party conversational test scenarios for conference assessment. ITU-T,
Study Group 12-Contribution 201, Jan. 2011.
A. Rix and M. Hollier. Perceptual analysis measurement system for robust end-to-end
speech quality assessment. volume 3, pages 1515–1518, 2000.
W. Ryu and D. Kim. Real-time 3D Head Tracking and Head Gesture Recognition. pages
169–172, 2007.
141
Bibliography
D. Schröder and T. Lentz. Real-Time Processing of Image Sources Using Binary Space
Partitioning. Journal of the Audio Engineering Society, 54(7/8):604–619, 2006. ISSN
0004-7554.
E. Shaw. External ear response and sound localization. Localization of sound: Theory
and applications, pages 30–41, 1982.
E. Shaw. Acoustical features of the human external ear. Mahwah, NJ: Lawrence
Erlbaum, 1997.
H. Sinnreich and A. Johnston. Internet communications using SIP: delivering VoIP and
multimedia services with Session Initiation Protocol. John Wiley & Sons, Inc., 2001.
SpacePoint, PNI. Space point 9-axis sensor system, Oct. 2011. URL http://www.
pnicorp.com/products/spacepoint-gaming.
D. Spring. Selection of information in auditory virtual reality. PhD thesis, Otto-von-
Guericke-Universität Magdeburg, Universitätsbibliothek, 2007.
142
Bibliography
A. Takahashi, D. Hands, and V. Barriac. Standardization activities in the itu for a qoe
assessment of iptv. Communications Magazine, IEEE, 46(2):78–84, 2008.
C. Uni-Verse. Uni-verse - fp6 project. Uni-Verse, Consortium, Mar. 2007. URL http:
//www.uni-verse.org/.
R. Vaananen, V. Valimaki, J. Huopaniemi, and M. Karjalainen. Efficient and parametric
reverberator for room acoustics modeling. In ICMC 97, pages 200–203, 1997.
143
Bibliography
S. Vesa. Binaural sound source distance learning in rooms. IEEE Trans Audio Speech
Language Process, 17(8):1498–1507, 2009.
E. Von Hornbostel and M. Wertheimer. Über die Wahrnehmung der Schallrichtung [On
the perception of the direction of sound]. Akademie der Wissenschaften, 1920.
H. Wallach. The role of head movements and vestibular and visual cues in sound
localization. Journal of Experimental Psychology, 27(4):339, 1940.
144
Bibliography
E. M. Wenzel. The role of system latency in multi-sensory virtual displays for space
applications. In Proceedings of HCI International 2001, New Orleans, LA, pages
619–623, Aug. 2001.
F. Wightman and D. Kistler. Monaural sound localization revisited. The Journal of the
Acoustical Society of America, 101:1050, 1997.
F. Wightman and D. Kistler. Resolution of front–back ambiguity in spatial hearing by
listener and source movement. The Journal of the Acoustical Society of America, 105:
2841, 1999.
F. wikipedia. Signal-to-noise ratio. Wikipedia, Free Encyclopedia., Oct. 2011. URL
http://en.wikipedia.org/wiki/Signal-to-noise_ratio.
W. Yang and J. Bradley. Effects of room acoustics on the intelligibility of speech in
classrooms. Journal of the Acoustical Society of America, 125(2):1–12, 2009.
W. Yang and M. Hodgson. Auralization study of optimum reverberation times for speech
intelligibility for normal and hearing-impaired listeners in classrooms with diffuse
sound fields. The Journal of the Acoustical Society of America, 120:801, 2006.
W. Yang, M. Benbouchta, and R. Yantorno. Performance of the modified bark
spectral distortion as an objective speech quality measure. In IEEE International
Conference on Acoustics Speech and Signal Processing, volume 1. Institute of
Electrical Engineers-INC (IEE), 1998.
N. Yankelovich, W. Walker, P. Roberts, M. Wessler, J. Kaplan, and J. Provino. Meeting
central: making distributed meetings more effective. In Proceedings of the 2004 ACM
conference on Computer supported cooperative work, pages 419–428. ACM, 2004.
N. Yankelovich, J. Kaplan, J. Provino, M. Wessler, and J. M. DiMicco. Improving
audio conferencing: are two ears better than one? In Proceedings of the 2006
20th anniversary conference on Computer supported cooperative work (CSCW ’06),
pages 333–342, New York, NY, USA, 2006. ACM. ISBN 1-59593-249-6. doi:
http://doi.acm.org/10.1145/1180875.1180926.
J. Yim, E. Qiu, and T. Graham. Experience in the design and development of a game
based on head-tracking input. In Proceedings of the 2008 Conference on Future Play:
Research, Play, Share, pages 236–239. ACM, 2008.
X. Yun, E. Bachmann, and R. McGhee. A simplified quaternion-based algorithm for
orientation estimation from earth gravity and magnetic field measurements. IEEE
Transactions on Instrumentation and Measurement, 57(3):638–650, 2008.
P. Zahorik. Assessing auditory distance perception using virtual acoustics. The Journal
of the Acoustical Society of America, 111:1832, 2002.
145
Bibliography
R. Zhu and Z. Zhou. A Real-Time Articulated Human Motion Tracking Using Tri-
Axis Inertial/Magnetic Sensors Package. IEEE Transactions on Neural Systems and
Rehabilitations Engineering, 12(2):295, 2004.
146