Optimizing Spatial Audio Telephony and Teleconferencing

Optimizing Spatial Audio Telephony and Teleconferencing
Optimizing Spatial Audio Telephony

and Teleconferencing
Dissertation
der Mathematisch-Naturwissenschaftlichen Fakultät
der Eberhard-Karls-Universität Tübingen
zur Erlangung des Grades eines
Doktors der Naturwissenschaften
(Dr. rer. nat.)
vorgelegt von
M.Sc. Mansoor Hyder
aus Naushahro Feroze (Sindh) Pakistan
Tübingen
2011
Tag der mündlichen Qualifikation: 14.12.2011
Dekan: Prof. Dr. Wolfgang Rosenstiel
1. Berichterstatter: Dr. -Ing. Christian Hoene
2. Berichterstatter: Prof. Dr. Thomas Walter
Dedicated
To
My Mother
Nawab Khatoon Depar
My Father
Ghullam Hyder Depar
&
Also
To
Subhan Khatoon Depar
Muhammad Soomar Khan Depar
Rabia Hyder Depar
Shahid Hyder Depar
Abstract
The invention of telephony has brought a significant revolution to our lives and is
undoubtedly considered one of the most important inventions of the modern-day world.
But over the last decades hardly any improvements in audio quality have been achieved.
Telephony still suffers from issues such as low speech intelligibility, poor audio quality
and extraneous noise. To improve the quality of telephony the use of spatial (or 3D)
audio has been proposed. 3D audio can offer significant advantages, such as enhanced
overall audio and speech quality, since our natural listening ability is inherently three
dimensional. Here, the nature of Virtual Acoustic Environments (VAE), which are used
in most of the 3D audio simulations, play a very important rule in the perception of spatial
audio. Due to the importance of VAEs, there is need for studying various VAE parameters
to properly design virtual acoustic rooms for the benefit of better audio quality, speech
intelligibility and enhanced localization performance.
This thesis introduces a telephony and teleconferencing system supporting three
dimensional audio and customizable virtual acoustic environments. The system consists
of a VoIP based telephone extended by low-delay audio codecs, three dimensional
renderers, and head phones extended by head-tracking sensors.
This thesis also presents a series of experiments conducted to optimize the 3D
telephony system. In the experimental study various parameters are considered to
validate speech quality, locatability and speech intelligibility of the teleconferencing
participants. Within two different VAEs, seven different placement of participants were
studied. In addition, eleven sets of user experiments are described in this thesis that
examine the effects of simulated acoustic room properties, virtual sitting arrangements,
reflections of a conference table, number of concurrent talkers and voice characteristics.
This thesis also presents three interlocutor based live conversational tests to compare the
audio qualities of mono, stereo and spatial conversations with and without head-tracking.
A conceptual and holistic Quality of Experience (QoE) model comprising all domains
of a communication ecosystem and the relationships between QoE and virtual acoustic
environments is also presented. The model is evaluated through user studies and
empirical analysis. Based on this model, a use case study is presented for three
dimensional telephony. Also the interaction and classification of QoE factors and
contextual aspects are presented.
vii
Abstract
viii
Kurzfassung
Die Erfindung des Telefons übt einen signifikanten Einfluss auf das moderne Leben aus
und kann zweifelsohne als eine der wichtigsten Erfindungen der Moderne angesehen
werden. Doch während der letzten Jahrzehnte hat sich die sprachliche Qualität dieser
Erfindung kaum gebessert. So leiden Nutzer des Telefons noch immer an Problemen wie
Verständnisschwierigkeiten, mangelhafter Audioqualität und Störgeräuschen. Um diese
Probleme zu beheben wurde der Einsatz von Raumklang (3D-Audio) vorgeschlagen.
3D Audio bietet erhebliche Vorteile für die Audio- und Sprachqualität, da der Mensch
von Natur aus räumlich hört. Die Beschaffenheit der Virtuellen Akustischen Umgebung
(VAU) übt dabei einen großen Einfluss auf die Wahrnehmung des Raumklanges aus. Aus
diesem Grund ist es notwendig, unterschiedliche Parameter einer VAU auf ihren Einfluss
auf die Sprachverständlichkeit, Audioqualität und Lokalisierbarkeit zu untersuchen.
Diese Arbeit stellt daher ein Telefon- und Telekonferenzsystem vor, das 3D
Audiotechnologien und individuell konfigurierbare VAUs unterstützt. Das System
besteht aus einem VoIP-Softphone, das um Codecs mit geringer Verzögerung erweitert
wurde, 3D Renderern und Kopfhörern mit Head-Tracking Sensoren.
Weiterhin beschreibt diese Arbeit eine Reihe experimenteller Messungen zur
Optimierung des Telefonsystems. Die Experimente umfassen die Evaluierung
unterschiedlicher Parameter zur Beschreibung der Sprachqualität, Sprachverständlichkeit
und Lokalisierbarkeit der Teilnehmer einer Telekonferenz. So wurden sieben
unterschiedliche Teilnehmerplatzierungen in zwei unterschiedlichen VAUs evaluiert.
In elf weiteren Szenarien wurde der Einfluss unterschiedlicher Umgebungsparameter,
Teilnehmerplatzierungen, eines Konferenztisches sowie der Anzahl simultaner Sprecher
und deren Stimmcharakteristika untersucht. Zusätzlich wurden Konversationstests
durchgeführt, die den Einfluss von Mono-, Stereo- und Raumklang sowie den Einsatz
von Head-Tracking-Kopfhörern auf die Audioqualität messen.
Letztendlich beschreibt diese Arbeit ein konzeptuelles, ganzheitliches Quality of
Experience (QoE)-Modell, das alle Domänen eines Kommunikations-Ökosystems sowie
die Beziehungen zwischen QoE-Aspekten und VAUs umfasst. Eine Auswertung des
Modells anhand von Nutzerstudien und empirischer Analyse wird dargestellt. Auf der
Basis dieses Modells wird außerdem eine Fallstudie mit dem Schwerpunkt 3D Telefonie
beschrieben, und eine Klassifikation von QoE-Faktoren und deren Interaktionen mit
kontextuellen Aspekten präsentiert.
ix
Kurzfassung
x
Acknowledgments
The research work for this thesis was conducted in Wilhelm Schickard Institut für
Informatik (Computer Science) at the Eberhard Karls Universität Tübingen, Germany.
Preparation of this thesis was supported by Higher Education Commission (HEC)
Pakistan in collaboration with Deutscher Akademischer Austausch Dienst (DAAD)
Germany.
My enormous thanks and gratitude goes to my guide and supervisor Dr. -Ing. Christian
Hoene for his constant support and technical guidance throughout my PhD research.
I am thankful to him for inviting me in his research group to conduct PhD research
work for the accomplishment of this thesis. He is one of the wonderful persons I
have ever met in my life. He always kept me motivated for the achievement of this
thesis work and provided me his kind support and technical insights when ever it was
needed. I am also thankful to all the teachers, coworkers and staff at Wilhelm Schickard
Institut für Informatik, the University of Tübingen for their help and support to make this
dissertation possible. Specially, I am thankful to Prof. Dr. Andreas Zell for reviewing
my annual HEC-DAAD reports. I am also thankful to Prof. Dr. Michael Menth for
his kind support. I would also like to thank my second supervisor Prof. Dr. Thomas
Walter for his kind support and time. I would also like to thank my current workmates,
specially Michael Haun for being always so helpful and kind, with whom I published
research work in co-authorship as well. I am also thankful to Olesja Weidmann, Patrick
Schreiner, Stefan König, Mark Schmidt, Michael Höfling, Alfons Martin and Susanna
Uresch for their help and support.
Work in Chapter 7 was achieved in collaboration with Institut Telecom Sud Paris, Evry
France. I am thankful to Professor Dr. Noel Crespi and M.Sc Khalil ur Rehman Laghari
for their collaborative work. The outcome of our collaboration work has been formulated
in the shape of a journal article. I also thank all subjects who participated in user studies,
without whom this study would not have been possible. I express much appreciation to
my thesis supervisors/examiners for their guidance.
I would like to also acknowledge all my friends and family who supported me in
various ways to achieve this thesis work. I would like to thank all friends who helped me
proof reading of thesis. I am also thankful to my friends in Tübingen and in Germany
for their time and valuable discussions. I am thankful to my friends whom I met in
Tübingen specially Lala Faisal khan Bangash, Mian Irfan Ghani, Zaigham Mahmood,
Zafar Iqbal, Aftab Ali Shah, Iftikhar Alam Khatak, Faisal Shahzad, Kahsif Jilani, Khaver
Saeed, Muhammad Raza, Umer Zeb, Uwe Schmidt and Yasir Niaz Khan for their nice
company in the evenings and on weekends. I am also thankful to my friends who were
xi
Acknowledgements
living in other cities of Germany for their encouragement and support, specially I am
thankful to Azad Ali Wassan, Shahid Hussain Danwar, Syed Saif-ur-Rehman and Jam
Raja Ghazanfar Ali Sahito.
I am also very thankful to my family, specially my mother Nawab Khatoon Depar
who loved me so much. During pursuing my PhD studies she passed away. I love
you Amaan, miss you. I am thankful to my father Ghullam Hyder Depar who is the
greatest motivational force for me. He always supported me in every way to achieve
better education and he constantly insisted me to work hard. Love you Baba Saein.
Also, I express my love and gratitude to Muhammad Soomar Khan Depar and Subhan
Khatoon Depar for their love and prayers, love you both. Many thanks to all my brothers
and sisters for their love and support. I am particularly thankful to my wife Rabia who
always supported and encouraged me in every stage of my life and particularly supported
me during PhD research work through her endurance and love. My special thanks to
my son Shahid Mansoor Hyder who born in Tübingen, Germany during my PhD work,
whom I could not give proper time during last several months. But, I promise him that
this will be changed from now on.
I am really grateful to Allah Azz Wa Jall for his countless blessings on me and his help
for what ever I am and for what ever I have. Ya Allah Azz Wa Jall! Open the portal of
knowledge and wisdom for me, and have mercy on me! and all of us! O the One, who is
the most Honorable and glorious!
xii
Contents
1 Introduction 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 5
2.1 Hearing: Ear the listening organ . . . . . . . . . . . . . . . . . . . . . 5
2.2 Binaural Hearing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Sound Localization . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Sound Localization Cues . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Cone of Confusion . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Binaural Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 3D sound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Ambisonics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 Wave-Field Synthesis . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 3D Audio recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Dummy Head . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 B-format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 3D Audio Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.1 Head Related Transfer Functions . . . . . . . . . . . . . . . . 15
2.5.1.1 Individualized vs Generic HRTFs . . . . . . . . . . . 16
2.5.2 Amplitude Panning . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Acoustics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6.1 Virtual Acoustics . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.1.1 Image Source Technique . . . . . . . . . . . . . . . 22
2.6.1.2 Beam Tracing Technique . . . . . . . . . . . . . . . 22
2.6.2 Reflections Early and Late . . . . . . . . . . . . . . . . . . . . 23
2.6.3 Reverberation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.4 Signal-to-Noise Ratio . . . . . . . . . . . . . . . . . . . . . . 24
2.7 Different Head-Tracking Technologies . . . . . . . . . . . . . . . . . 25
2.7.1 Virtual Acoustic Environment and 3D Sound Localization . . . 25
2.7.2 Acoustic-based Trackers . . . . . . . . . . . . . . . . . . . . . 26
2.7.3 Video-based Tracker . . . . . . . . . . . . . . . . . . . . . . . 27
2.7.4 Accelerometer/magnetometer-based tracker . . . . . . . . . . . 27
2.7.5 Inertial/magnetometer-based trackers . . . . . . . . . . . . . . 28
xiii
Contents
2.7.5.1 Latency efficient head-tracker . . . . . . . . . . . . . 29

2.8 Speech/Conversation Audio Quality . . . . . . . . . . . . . . . . . . . 30
2.8.1 Subjective Quality . . . . . . . . . . . . . . . . . . . . . . . . 31
2.8.1.1 Listening Only Tests . . . . . . . . . . . . . . . . . . 31
2.8.1.2 Conversational Tests . . . . . . . . . . . . . . . . . . 32
2.8.2 Objective Quality . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.8.2.1 Intrusive Measures . . . . . . . . . . . . . . . . . . . 34
2.8.2.2 Non-intrusive Measures . . . . . . . . . . . . . . . . 34
2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 Design and Implementation of 3D Telephony 37

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Design Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.1 Classic VoIP Teleconferencing . . . . . . . . . . . . . . . . . . 38
3.2.2 Spatial Audio Teleconferencing . . . . . . . . . . . . . . . . . 39
3.3 Design of the 3D telephone system . . . . . . . . . . . . . . . . . . . . 41
3.4 Design Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.1 VoIP Phone and 3DTel Plug-In . . . . . . . . . . . . . . . . . . 45
3.5.2 Virtual Acoustic Server and Renderer . . . . . . . . . . . . . . 47
3.5.2.1 Virtual Acoustic Server . . . . . . . . . . . . . . . . 47
3.5.2.2 Virtual Acoustic Renderer . . . . . . . . . . . . . . . 47
3.5.3 Head-tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4 Experiments on the Placement of Teleconference Participants 53

4.1 Placement of Participants . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1.1 Sample Design . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.2 User Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.3 Test 1: Talker in the corners and Test 2: Listener in the corners . 57
4.1.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.4 Test 3: Horizontal placement . . . . . . . . . . . . . . . . . . . 59
4.1.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.5 Test 4: Frontal placement-1 . . . . . . . . . . . . . . . . . . . 61
4.1.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.6 Test 5: Frontal placement-2 . . . . . . . . . . . . . . . . . . . 62
4.1.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.7 Test 6: Surround placement-1 . . . . . . . . . . . . . . . . . . 64
4.1.7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.8 Test 7: Surround placement-2 . . . . . . . . . . . . . . . . . . 65
4.1.8.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
xiv
Contents
5 Assessing Virtual Teleconference Rooms 69

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 User Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . 74
5.3.1.1 Reference Test . . . . . . . . . . . . . . . . . . . . . 75
5.3.1.2 Voice Type . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.1.3 Number of Simultaneous Talkers . . . . . . . . . . . 77
5.3.1.4 Listener-to-Sound Source Distance . . . . . . . . . . 77
5.3.1.5 Sound Source Density . . . . . . . . . . . . . . . . . 78
5.3.1.6 Sound Source-to-Wall Distance . . . . . . . . . . . . 79
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.1 Reference Test . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.2 Voice Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.3 Number of Simultaneous Talkers . . . . . . . . . . . . . . . . . 81
5.4.4 Listener-to-Sound Source Distance . . . . . . . . . . . . . . . . 82
5.4.5 Sound Source Density . . . . . . . . . . . . . . . . . . . . . . 83
5.4.6 Sound Source-to-Wall Distance . . . . . . . . . . . . . . . . . 85
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6 Conversational Tests 89
6.1 Test design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Test description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7 Investigating Virtual Acoustic Environments & QoE Relationship 95

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.2.1 QoS and QoE . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.2.2 Virtual Acoustic Environment and Quality of Experience (QoE) 99
7.3 QoE Based Model for a Communication Ecosystem . . . . . . . . . . . 100
7.3.1 QoE Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.3.1.1 Subjective Human Factors . . . . . . . . . . . . . . . 101
7.3.1.2 Objective Human Factors . . . . . . . . . . . . . . . 102
7.3.1.3 Human Entity . . . . . . . . . . . . . . . . . . . . . 102
7.3.2 Contextual Domain . . . . . . . . . . . . . . . . . . . . . . . . 103
7.3.2.1 Contextual Entity . . . . . . . . . . . . . . . . . . . 103
7.3.2.2 Contextual Characteristics . . . . . . . . . . . . . . . 104
7.3.3 Business Domain . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.3.3.1 Business Entity . . . . . . . . . . . . . . . . . . . . 104
7.3.3.2 Business Characteristics . . . . . . . . . . . . . . . . 104
xv
Contents
7.3.4 Technological Domain . . . . . . . . . . . . . . . . . . . . . . 105

7.3.4.1 Technological Entity . . . . . . . . . . . . . . . . . . 105
7.3.4.2 Technological Characteristics . . . . . . . . . . . . . 105
7.4 A use case study - 3D Telephony . . . . . . . . . . . . . . . . . . . . . 105
7.4.1 Applied QoE Model for 3D Telephony . . . . . . . . . . . . . . 105
7.4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.5 Results & Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.5.1 Reliability and Validity Testing . . . . . . . . . . . . . . . . . 109
7.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.5.2.1 Experiment I: QoE Factors and Virtual Room Size . . 109
7.5.2.2 Experiment II: QoE Factors and Voice type . . . . . . 111
7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8 Conclusions 113
8.1 Outlook on Future Research . . . . . . . . . . . . . . . . . . . . . . . 114
A Research Papers 117

A.1 Conference Papers/Technical Reports . . . . . . . . . . . . . . . . . . 117
A.2 Journal Articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
B Summary of Contributions 119
C Conversational Tests 121

C.1 Questionnaire for Conversational Tests . . . . . . . . . . . . . . . . . . 121
C.2 Conversational Tests Descriptive statistics . . . . . . . . . . . . . . . . 123
Bibliography 129
xvi
List of Figures
2.1 Human Hearing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Interaural Time Differences (ITDs) . . . . . . . . . . . . . . . . . . . 8
2.3 Interaural Intensity differences (IIDs) . . . . . . . . . . . . . . . . . . 8
2.4 Cone of Confusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Three dimensional hearing (Listener’s perspective) . . . . . . . . . . . 11
2.6 Blumlein Microphone Set - Figure-of-Eight Concept . . . . . . . . . . 12
2.7 Wave-Field-Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 Dummy head . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.9 B-format microphone . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.10 Pinna Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.11 Head, Torso and Pinna Measurements . . . . . . . . . . . . . . . . . . 17
2.12 Vector Base Amplitude Panning (VBAP) . . . . . . . . . . . . . . . . . 18
2.13 Triplet-wise amplitude panning with five loudspeakers . . . . . . . . . 19
2.14 Direct Sound, early reflections and late reverberation . . . . . . . . . . 21
2.15 Image source method . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.16 Basic concept of beam-tracing . . . . . . . . . . . . . . . . . . . . . . 23
2.17 Automatically adjusted late reverberations . . . . . . . . . . . . . . . . 24
2.18 Head-tracking using binaural headset . . . . . . . . . . . . . . . . . . . 26
2.19 An overhead schematic view of a user in the display cube . . . . . . . . 27
2.20 MARG III inertial/magnetic sensor module . . . . . . . . . . . . . . . 28
2.21 Inertial and Magnetic Measurement System (IMMS) . . . . . . . . . . 29
2.22 Conversation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Central conference bridge . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2 End-point mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Fully meshed conferencing system . . . . . . . . . . . . . . . . . . . . 39
3.4 Centralized spatial audio conferencing . . . . . . . . . . . . . . . . . . 40
3.5 Meshed spatial audio teleconferencing . . . . . . . . . . . . . . . . . . 41
3.6 Design of a 3D telephone system . . . . . . . . . . . . . . . . . . . . . 43
3.7 3DTel on a server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.8 3DTel on the clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.9 3DTel Plug-in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.10 Overview of the main system components . . . . . . . . . . . . . . . . 46
3.11 A virtual conference room with a listener and two virtual talkers . . . . 48
xvii
List of Figures
3.12 Wiimote with six degrees of freedom . . . . . . . . . . . . . . . . . . . 50

3.13 IR-Beacon (custom built) . . . . . . . . . . . . . . . . . . . . . . . . 50
3.14 PNI sensor corporation’s SpacePoint fusion Head-tracker . . . . . . . . 51
4.1 Acoustic simulations: One listener and two sound sources . . . . . . . 56

4.2 Talker in the corners and Listener in the corners . . . . . . . . . . . . . 58
4.3 Test - Horizontal placement layout . . . . . . . . . . . . . . . . . . . . 60
4.4 Test - Horizontal placement results . . . . . . . . . . . . . . . . . . . . 60
4.5 Test - Frontal placement-1 layout . . . . . . . . . . . . . . . . . . . . . 61
4.6 Test - Frontal placement-1 results . . . . . . . . . . . . . . . . . . . . . 62
4.7 Test - Frontal placement-2 layout . . . . . . . . . . . . . . . . . . . . . 63
4.8 Test - Frontal placement-2 results . . . . . . . . . . . . . . . . . . . . . 63
4.9 Test - Surround placement-1 layout . . . . . . . . . . . . . . . . . . . . 64
4.10 Test - Surround placement-1 results . . . . . . . . . . . . . . . . . . . 65
4.11 Test - Surround placement-2 layout . . . . . . . . . . . . . . . . . . . . 65
4.12 Test - Surround placement-2 results . . . . . . . . . . . . . . . . . . . 66
5.1 Virtual room with three simultaneous talkers . . . . . . . . . . . . . . . 71

5.2 Graphical user interface of the testing environment . . . . . . . . . . . 73
5.3 The virtual conference room top view . . . . . . . . . . . . . . . . . . 74
5.4 The virtual conference room frontal view . . . . . . . . . . . . . . . . 75
5.5 Localization correctness vs. MOS − LQSW LE ratings - (Voice Type) . . 80
5.6 Talker localization vs. MOSLE ratings - (Number of Simultaneous Talkers) 81
5.7 Localization correctness vs. MOSLE ratings . . . . . . . . . . . . . . . 83
5.8 Localization correctness vs. MOSLE ratings - (Sound Source Density) . 84
5.9 Localization correctness vs. MOSLE ratings . . . . . . . . . . . . . . . 85
6.1 Test layout - three participants conversational test . . . . . . . . . . . . 90

6.2 Comparison of conversational quality MOS scores with 95% CI . . . . 93
7.1 A Framework for analyzing communication ecosystem lacking QoE . . 97

7.2 Virtual Acoustic Environment and QoE Relationship . . . . . . . . . . 98
7.3 QoE Based Model for Communication Ecosystem . . . . . . . . . . . . 101
7.4 Applied QoE model for 3D Telephony . . . . . . . . . . . . . . . . . . 106
7.5 QoE Factors and Virtual Acoustic Environment Relationship . . . . . . 107
7.6 Quality Scores Comparison for Different Virtual Acoustic Rooms . . . 110
7.7 Quality Scores Comparison for Different Voice Types . . . . . . . . . . 111
xviii
List of Tables
2.1 Basic algorithms of virtual acoustic simulation . . . . . . . . . . . . . . 20
4.1 Summary of placement of participants test parameters . . . . . . . . . . 54

4.2 Summary of listener and talker heights . . . . . . . . . . . . . . . . . . 54
4.3 Summary of MOS Scores . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1 Listing of all used speech samples . . . . . . . . . . . . . . . . . . . . 72

5.2 Test setups and parameters . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Talker localization distribution - (Voice Type) . . . . . . . . . . . . . . 80
5.4 Talker localization distribution - (Number of Simultaneous Talkers) . . . 82
5.5 Talker localization distribution - (Listener-to-Sound Source Distance) . 82
5.6 Talker localization distribution - (Sound Source Density) . . . . . . . . 84
5.7 Talker localization distribution - (Sound Source-to-Wall Distance) . . . 86
5.8 Average reverberation start delay time . . . . . . . . . . . . . . . . . . 88
6.1 Summary of test questions for conversational test . . . . . . . . . . . . 91

6.2 MOS values with 95% CI for the nine questions . . . . . . . . . . . . . 92
7.1 Summary of Cronbach Alpha test results . . . . . . . . . . . . . . . . . 108

7.2 Results of localization performance and MOS scores . . . . . . . . . . 109
C.1 Descriptive Statistics for Conversational test for Mono Audio Quality . 124
C.2 Descriptive statistics for conversational test for stereo audio quality . . . 125
C.3 Descriptive statistics for conversational test for spatial audio quality . . 126
C.4 Descriptive statistics for conversational test for spatial-HT . . . . . . . 127
xix
Chapter 1
Introduction
Telephone has special significance in our life and undoubtedly considered as one of
the most important inventions of the modern-day world. Invention of Telephone [Bell,
1876] has brought a revolution in the way people communicate personally and/or
professionally. According to [ITU, 2010] there were five billion mobile cellular
subscribers globally including 940 million subscriptions to 3G services at the end of
year 2010. In the year 2009 1.19 billion subscribers were already observed for only
fixed-line telephony. Despite the very significant advances in the number of subscribers
for fixed, mobile and VoIP services, the audio quality of the calls for telephony and
teleconferencing has not improved.
Humans’ natural listening ability is three dimensional. Sounds are perceived by human
beings from all distances and directions with spaciousness. Also, three dimensional
listening provide humans an ability to locate the origin of auditory events accurately. The
technological requirements, to reproduce human ability of three dimensional hearing in
computational systems to generate the same sound at the listeners eardrums as a real
sound source would have produced, are also known [Kim et al., 2005a].
On the other hand, literature study suggests the advantageous use of 3D audio for
telephony/teleconferencing [Kilgore et al., 2003, Yankelovich et al., 2006, Ahrens et al.,
2010]. 3D audio helps to improve overall audio quality and to overcome problems such
as: “Cocktail Party Effect” and as reported in [Yankelovich et al., 2004]. However, to
the author’s knowledge, there is no potent product or service in the telecommunication
industry yet which best utilizes humans’ natural listening ability that is inherently 3
dimensional. Additionally, virtual acoustic environment which is part of most of the
3D audio simulations, has gained a lot of attention during years. Study of a virtual
acoustic environment is essential in a recent sense that 3 dimensional audio telephony
or teleconferencing can be further improved by properly selecting virtual acoustic
parameters.
The main focus of this thesis work is to design, develop, test and enhance 3
dimensional telephony and teleconferencing system. With three dimensional telephony
users should be able to make one-to-one and one-to-many telephone calls with the
enhanced user audio perception, understandability of the speech of concurrent talkers
and increased localization performance.
1
Chapter 1 Introduction
Another aim of this thesis work is to design and study a Virtual Acoustic Environment
(VAE) for three dimensional telephony and teleconferencing system. 3D telephony based
on VAE will help the participants of the conference call to spatially separate each other, to
locate concurrent talkers in virtual acoustic space and to understand speech with clarity.
Also, VAE provides the level of freedom to modify specifications such as the virtual
room size, conference table size and shape, and to place the call participants at a specific
distance and direction as per their own requirements and ease.
Another aim of this thesis work is to present a conceptual and holistic Quality of
Experience (QoE) model comprising all domains of a communication ecosystem, such
as technical aspects, business models, human behavior and contextual aspects; and to
evaluate QoE and VAE relationship through user studies and empirical analysis. Main
contributions in this PhD thesis work are listed in the next section.
1.1 Contributions
The main contributions in this thesis work include:
• The design of a three dimensional telephone system aiming for comfortable and
mobile usage at low costs (Chapter 3).
• The implementation of a 3 dimensional telephony system (Chapter 3.5).
• The extension of VoIP soft-phone by a 3D sound processing software. A Verse

plug-in for the VoIP soft-phone has been created which enables the users to have
their calls placed in a virtual 3D environment (Chapter 3).
• The implementation of four head-tracking devices: (1) the Nintendo Wii remote
control using VRUI Virtual Reality toolkit by Kreylos Kreylos [2008]. (2) A
simple keyboard tracker that translates key strokes into translations and rotations.
(3) A tracking simulator to test and display the effects of changes in position and
orientation on sound rendering. (4) A PNI sensor corporation’s SpacePoint fusion
tracker (Chapter 3).
• User studies based on the virtual acoustic rooms and on placement of participants
of virtual teleconferencing system to analyze the impact of two different HRTFs
comprising of different channels and different frequency bands, two room
sizes, and different heights of the listener and talkers on the audio quality,
understandability and locatability of virtual participants (Chapter 4).
• User studies based on virtual acoustic rooms to examine the effects that simulated
acoustic room properties, virtual sitting arrangements, reflections of a conference
table, number of concurrent talkers and voice characteristics have on the perception
of speech quality, locatability and speech intelligibility in a 3D teleconferencing
system (Chapter 5).
2
1.2 Outline
• User studies to obtain subjective scores for three interlocutors based conversational
tests to compare mono, stereo, spatial with and without head-tracking audio
qualities ( Chapter 5).
• The presentation of a conceptual and holistic Quality of Experience model

comprising all domains of a communication ecosystem to show the interaction
among human, business, technology and contextual aspects. Also, to construct
an applied QoE model for 3D telephony to evaluate QoE and VAE relationship.
Also, to benchmark QoE terms, their analysis and validation in two different test
scenarios (Chapter 7).
1.2 Outline
This thesis is organized as follows. Chapter 2 serves as a background study. Design
and implementation of three dimensional telephony has been discussed in Chapter 3.
In Chapter 4, user studies based in virtual acoustic rooms to evaluate the placement of
participants have been presented. In Chapter 5 user studies based on different virtual
acoustic rooms have been discussed. Chapter 7 discusses quality of experience modeling
in communication ecosystem. Also, a case study of three dimensional telephony has
been presented. This study is concluded in Chapter 8.
3
Chapter 1 Introduction
4
Chapter 2
Background
This chapter provides a background for various portions of this thesis. In (Section 2.1)
a brief overview of the anatomy and the physiology of a human ear is presented.
In (Section 2.2) binaural hearing, sound localization, localization cues and the
phenomenon of cone of confusion is discussed. In (Section 2.3) an overview of 3D
sound technologies such as: binaural technology, ambisonics and wave field synthesis
are presented. In Section 2.4 a brief overview of 3D audio recording technologies have
been discussed. In (Section 2.5) a short introduction of 3D audio reproduction methods
such as: Head Related Transfer Functions (HRTFs) and Amplitude Panning (AP) have
been presented. In (Section 2.6) a short introduction of acoustics have been presented,
additionally, a brief introduction of virtual acoustics, virtual acoustic techniques,
reflections in virtual acoustic rooms, reverberations and 3D sound localization in
virtual acoustic environment have been discussed. In (Section 2.7) a short review of
head-tracking technologies have been presented. In (Section 2.8) speech, conversational
and audio quality assessment method have been briefly discussed and subjective and
objective testing procedures are also considered. In Section 2.9 background work have
been summarized.
Comprehensive reviews of these areas mentioned above are available in literature and
are also referred to as appropriate in the text.
2.1 Hearing: Ear the listening organ
The ear is a series of interlinked structures which provides humans the sense of hearing.
Ear collects sound waves as pressure changes in the air and sends these sound waves to
the brain. A depiction of a human ear is provided in (Fig. 2.1 adapted from [Karjalainen,
2011]). The outer part of the ear consists of the pinna and concha, which leads to the
eardrum via the ear canal. The outer ear acoustically filter incoming sound waves that
vibrates the eardrum.
The middle ear is air filled cavity that contains ossicular chain which consist of small
bones the malleus, incus and stapes, and also contains eardrum. As reported, “The prime
function of the middle ear is to transmit the vibrations of sound in air gathered at the
tympanic membrane to the fluid of the inner ear at the oval window.” [Irwin, 2006]
Inner ear contains the Cochlea and semicircular canals. “The inner ear is an intricately
shaped membranous tube suspended within a bony tube – the labyrinth ”. [Irwin, 2006].
5
Chapter 2 Background
Figure 2.1: Human Hearing
The cochlea is a coiled tube, located in the temporal bone of the skull, which is divided
along its length by membranes into three fluid-filled compartments. The vibration of
the eardrum causes pressure waves to travel through the fluid of the cochlea, setting up
traveling waves in the lower basilar membrane, which is approximately 35 mm long.
The organ of Corti is on the basilar membrane that includes several rows of hairs, which
are in contact with tectorial membrane. When the basilar membrance vibrates there is a
difference in motion between the basilar and tectorial membrane. That causes the hairs
in the hair cell to bend. Bending of the hairs causes hair cells to send impulses to the
auditory nerve. To simplify, these impulses are understood as sound by the brain.
For further details on the anatomy and physiology of human hearing readers are
referred to [Pickles, 1988, Irwin, 2006].
2.2 Binaural Hearing
Binaural hearing is defined as the process required to use two ears to perceive the location
of sound sources [Wightman and Kistler, 1997]. The Duplex theory presented by [Strutt,
1907] was the first extensive analysis of the physics of the binaural perception of audio
and this theory is still considered as valid. As [Strutt, 1907] noted, two physical cues
dominate the perceived location of an incoming sound source (Fig. 2.2 and 2.3), sound
arrives slightly earlier in time at the ear which is physically closer to the source and
with somewhat greater intensity. This produces a Interaural Time Difference (ITD)
because sound takes longer time to reach at the ear which is farther from the source. An
Interaural Intensity Difference (IID) is also produced because of the shadowing effect
of the head which prevents some of the incoming energy to reach the ear which is
farther from the source [Cheng and Wakefield, 1999]. Binaural hearing also enables us
to selectively attend to an individual conversation when there are many people having
conversations at the same time which is termed as “Cocktail Party Effect” [Cherry,
1953, Crispien and Ehrenberg, 1995, Brungart et al., 2007]. Conclusively, binaural
6
hearing provides directional localization to help accurately localize sounds particularly in

horizontal median plane. This capacity of localizing sounds largely remains intact even
in the presence of other interfering sound, it also provides distance perception [Vorländer,
2008].
2.2.1 Sound Localization
Sound localization is a complex human process. To calculate the position of sound source
normally humans take advantage of binaural hearing. According to [Blauert, 1997];
“Localization is the law or rule by which the location of an auditory event (e.g., its
direction or distance) is related to a specific attribute or attributes of a sound event, or
of another event that is in some way correlated with the auditory event”.
Sound source position is cued by differences in arrival time of sound and also by
differences in sound intensity at two ears. Following sections contain detailed description
about these localization cues.
2.2.2 Sound Localization Cues
Normally, human auditory system utilizes different acoustical cues to achieve static
sound localization. According to [Strutt, 1907, Begault, 1994, Blauert, 1997, Tsakostas
et al., 2007] and as explained in [Hirahara et al., 2011], the first acoustical cue is ITD
which is defined as the difference in arrival times of a sound’s wavefront at the left and
right ears and normally calculated from lower frequency components below 1.5 kHz,
and the second acoustical cue is Interaural Intensity Difference (IID) which is defined
as the amplitude difference generated between the right and left ears by a sound in the
free field and normally calculated from higher frequency components above 1.5 kHz. An
illustration of ITD and IID is presented (Fig. 2.2 and 2.3).
According to [Vorländer, 2008], “At the two ears, the sound signals arrive with
differences in time and amplitude. Sound from a source located at the side of the head
travels a longer time to the contralateral ear1 and suffers frequency-dependent damping
due to diffraction and absorption. Both effects are noticeable as differences between the
ear signals, as interaural time differences and interaural level differences”.
The third acoustical cue is the spectral cue which is defined as the spectral notches
appearing in the mid frequency range of around 4 to 14 kHz of the amplitude spectrum
of HRTFs. Three above discussed cues are known as static cues [Blauert, 1997].
1 Contralateral ear is defined as: the ear in the shadow zone of the head.
7
Figure 2.2: Interaural Time Differences (ITDs)

(adapted from [Cheng and Wakefield, 2001] ITDs are used to estimate the azimuth of
a sound source. In general, a source is perceived to be closer to the ear at which
the first wavefront arrives, the larger the magnitude o f the ITD, the larger the lateral
displacement.)
Figure 2.3: Interaural Intensity differences (IIDs)

(adapted from [Cheng and Wakefield, 2001] IIDs are used to estimate the azimuth of a
sound source. In general, a source is perceived to be closer to the ear at which more
energy arrives. The larger the magnitude o f the IID, the larger the lateral displacement.)
8
When either a sound source or the listener moves all these static cues for sound
localization change. This change in static localization cue may be called dynamic cue.
With dynamic cues not only moving but static sound can also be localized, since the head
movements have the tendency to change physically static sound to perceptually dynamic
sound [Hirahara et al., 2011].
For further reading readers are referred to [Begault, 1994, Blauert, 1997, Hirahara
et al., 2011].
2.2.3 Cone of Confusion

Human listening capacity for localizing sound and/or an audio event is limited. One
common error when perceiving the sounds and/or an audio events direction is the “cone
of confusion” or front-back reversal. The cone of confusion occurs because a sound can
have the same ITD and IID at points on a cone surface from any side of the ear (Fig. 2.4).
Figure 2.4: Cone of Confusion

(Points A and B and points C and D are having the same ITD and IID and therefore
create the cone of confusion for the listener according to the duplex theory)
Further, according to [Cheng and Wakefield, 1999], to estimate sound’s location in

free space where sound is allowed to vary in elevation and distance, ITD and IID cues do
not specify a unique spatial position, as there are an infinite number of locations along
curves of the equal distance from the observer’s head which have the same related ITD
and/or IID. This ambiguity was first noted by [Von Hornbostel and Wertheimer, 1920]
who described a center of source of all points sharing the same ITD as resembling a
cone in the far field and this set of points defined as “cone of confusion”. The cone of
confusion is further explained in Figure 2.4 for a particular situation.
9
2.3 Binaural Technology
Binaural technology is aimed at reproducing the human ability of binaural hearing in

computational systems to generate the same sound pressures at the listeners eardrums
as a real sound source would have produced. The principle of binaural technology is to
control the sound field at the listener’s ears so that the reproduced sound field coincides
with what would have been produced when the listener is in the desired real sound
field [Kim et al., 2005b]. According to [Tsakostas et al., 2007], in binaural technology,
the physical cues (Section 2.2), are incorporated in HRTFs (Section 2.5), which represent
directional dependent transfer function between the human listener’s ear canal and the
specific sound source placement [Pulkki, 1997]. Therefore, convolving the mono sound
source wave with the appropriately selected pair of HRTFs produce the sound waves that
correspond to each of the listener’s ears. This process is termed as binaural synthesis. By
employing the binaural synthesis approach virtual sound sources can be created around
the listener within an open or closed space. This concept is termed as binaural rendering
through which 3D virtual sound environments can be created [Cheng and Wakefield,
1999, Kim et al., 2005b, Tsakostas et al., 2007]. For binaural audio playback either
headphones or loudspeakers are utilized.
2.3.1 3D sound
The natural human hearing encounters different sounds everyday from different
directions and distances. Natural hearing is defined as how we hear sounds spatially
in everyday life with uncovered ears, our head moving and also in interaction with
other sensory input. It is also clear that human hearing is inherently three dimensional,
we experience not only the horizontal and vertical directions of the sound but also its
distance [Begault, 1994, Ericson and McKinley, 1997], and this is where we encounter
the term 3D sound (Fig. 2.5).
In literature, 3D sound’s equivalent designations include virtual acoustics, binaural
sound and spatial audio. Fundamentally, all of these designations of 3D sound refer to
techniques where the outer ears (the Pinna) are either directly implemented or modeled
as digital filters [Begault, 1994].
3D sound is defined as the simulation of a 3D sound field for any real environment with
the use of various techniques, “3D sound system uses processes that either complement
or replace spatial attributes that existed originally in association with a given sound
source” [Begault, 1994]. According to [Begault, 1994], 3D sound refers to a sound
which makes a listener discern significant spatial cues for a sound source such as
direction, distance and spaciousness. Therefore generating 3D sound means that one
can place sound anywhere—left or right, up or down, near or far—at one’s disposal in 3
dimensional space [Begault, 1994, Kim et al., 2004, Low and Babarit, 1998, Lee et al.,
1998].
10
2.3 Binaural Technology
Figure 2.5: Three dimensional hearing (Listener’s perspective)

(adapted from [Begault, 1994]. The figure shows a taxonomy of spatial perception,
including, the listener’s position, virtual sound source (sound image) to be localized,
linear distance between sound source and the listener perceived distance, the angular
perception of a virtual sound source: azimuth (normally described in terms of degrees
where 0 degree azimuth is at the point directly a head of the listener, also described
as increasing counterclockwise from 0-360 degrees along the azimuthal circle) and
elevation (increases upward from 0 to a point directly above a listener at up to a 90
degree, or directly below at down 90 degree), also shows environmental context (the
effects of reverberation caused by repeated reflections of sound source from the surfaces
of an enclosure e.g, Fig. 3.11).
2.3.2 Ambisonics
Ambisonics is a technique for sound spatialization. Ambisonics system is a thorough

approach of a directional sound pickup and reproduction system in a studio or live or
ideally in both by encoding and decoding a two-dimensional (planer or horizontal-only)
or three-dimensional (periphonic or full sphere) sound field [Gerzon, 1973, Rumsey,
2001]. Ambisonics is logical extension of the work of [Blumlein, 1931] which is known
as Blumlein Technique [Gerzon, 1970a,b] defined as: when using the two figure-of-
eight capsules positioned perpendicular to each other (Fig. 2.6 adapted from [Linkwitz,
2011]) any other response of figure of eight could be created. For further details on
Ambisonics theoretical and technical basis readers are referred to [Gerzon, 1973, 1974,
1975, Malham and Myatt, 1995].
11
Figure 2.6: Blumlein Microphone Set - Figure-of-Eight Concept
2.3.3 Wave-Field Synthesis
The Wave-field synthesis (WFS) concept was introduced by [Berkhout, 1988]. With this
technology a sound field can be generated with natural temporal and spatial properties
within a volume or area bounded by arrays of loudspeakers [De Vries and Boone, 1999].
WFS is a 3D audio rendering technique which is based on the concept of

Huygens [Huygens, 1690]. WFS generates artificial wave fronts synthesized by a large
number of individually driven speakers. All points on the wave front serve as individual
point sources of spherical secondary wave fronts. In acoustics this principal can be
realized by using a large number of small and closely spaced loudspeakers (loudspeaker
strips) [Brandenburg et al., 2004] as presented in (Fig. 2.7 adapted from [Brandenburg
et al., 2004]). Corresponding driving signals are then fed to each loudspeaker in the
array [Brandenburg et al., 2004, de Vries, 1996]. WFS can also be utilized to reproduce
the virtual sound sources which can be placed anywhere in the room [Boone et al., 1995,
Brandenburg et al., 2004].
For further details, readers are referred to [Berkhout, 1988, Berkhout et al., 1993,
Boone et al., 1995, De Vries and Boone, 1999, Brandenburg et al., 2004].
12
2.4 3D Audio recording
Figure 2.7: Wave-Field-Synthesis

(Wave-Field-Synthesis is based on the wave theory)
2.4 3D Audio recording

2.4.1 Dummy Head
A purpose behind the dummy head creation was the representation of an average head
representing an average listener for the evaluation of hearing aids [Burkhard and Sachs,
1975, Shaw, 1982]. In addition dummy heads are used for recording and reproduction of
3D audio for real and virtual environments.
The dummy head recording is a technique to achieve binaural recording. In dummy
head recording a copy of human head is used which is called an artificial head or a
dummy head2 . As explained in [Pekonen, 2008], for recording two microphones are
positioned inside the ear canals of a dummy head. Binaural recording with a dummy head
has an advantage of spatially-dependent filtering caused by the dummy head since sound
waves reaching the dummy head undergo the same transmission on their way to the ear
canal as if they were reaching a real listener [Møller, 1992]. In this recording technique
the binaural microphones achieve signals that listener ears would have achieved if
positioned at the same place as the microphones [Sherman, 1953, Kürer et al., 1969].
With recorded sound binaural effects can be reproduced with headphones playback.
Dummy head sample is presented nn (Fig. 2.8 adapted from [Jiscdigital, 2011])
2 In German a Dummy Head is called kunstkopf
13
Figure 2.8: Dummy head
2.4.2 B-format
The B-format is a four channel recording standard that uses a sound field microphone.
B-format consists of four channels W, X, Y and Z [Gerzon, 1973]. The W channel
represents acoustic pressure at a point in space while other channels represent
components of the pressure gradient at left-right (X), front-back (Y) and up-down
(Z) signals stem from figure-of-eight microphones. Where the W channel is feed from
an omnidirectional microphone [Vorländer, 2008, Barrett and Berge, 2010]. Further,
the directional patterns of the four microphones in a B-format microphone is presented
in (Fig. 2.9 adapted from [Vilkamo, 2008]).
The B-format is a technique of Ambisonics which was developed based on the work
of [Gerzon, 1973, 1974]. Any source material such as synthesized sound or mono
recordings can be positioned or moved within a B-format sound field. For further
reading, please refer to [Malham and Myatt, 1995, Vilkamo, 2008].
2.5 3D Audio Reproduction
The term 3D audio reproduction is defined as recreating the acoustic signals at the ears
of the listener in such a way that these signals are equal to a recorded or synthesized
audio scene [Blauert, 1997]. Headphones are typically used for 3D audio reproduction
with HRTFs. However, the amplitude panning method has also been described in the
following sections to cover 3D audio reproduction using loudspeakers based solution.
14
Figure 2.9: B-format microphone

The directional patterns of the four microphones in a B-format microphone
3D audio reproduction with HRTFs is of our interest in the context of this thesis.
Since HRTFs have been utilized to reproduced 3D audio signals on headphones for the
accomplishment of this thesis research work.
2.5.1 Head Related Transfer Functions

Knowledge regarding human perception of sound, particularly about the human ability
to locate a spatial sound is very important to obtain for generating a realistic auditory
environment. Listeners can distinguish between separate sound sources by the
sound location which may be determined by the intensity difference and time or
phase difference between sound signals at two ears [Middlebrooks and Green, 1991].
Furthermore, according to [Blauert, 1997], “the sound signals in the ear canals (ear
input signals) are the most important signals to the subject for spatial hearing”. Blauert
also explained that the quality of spatial hearing could be easily influenced by the change
of signals at ears.
A single HRTF can be defined as a specific individual’s left or right ear far-field
frequency response, as measured from a specific point in the free field to a specific
point in the ear canal [Cheng and Wakefield, 1999]. Moreover, human sound source
localization is based on spectral filtering. Sound reflects and diffracts from our head,
torso, shoulders and pinna folds in a unique way for every angle [Middlebrooks, 1999].
At the entrance of the ear canal those interactions combine into a signal that has a
different frequency response for each angle [Begault, 1994]. These frequency response
variations are called HRTFs or more precisely, Fourier transforms of the ratio of the
sound pressure at the entrance to that at the center of the head in a free sound field in the
15
Figure 2.10: Pinna Measurement
frequency domain is called HRTFs [Blauert, 1997]. Digital sounds can be processed by
these HRTFs to produce spatial audio signals that help the listener to believe that sound
emanates from the corresponding virtual source location [Park et al., 2005].
2.5.1.1 Individualized vs Generic HRTFs
HRTFs can be either individualized or generic. Since there are anthropometric
shape and size differences among the subjects [Wenzel et al., 1993]. Due to these
differences, literature study suggests to measure individual HRTFs experimentally for
each subject [Begault et al., 2001]. Individualized HRTFs are based on measurements
considering each persons unique physical head properties. Measurements of HRTFs
require special equipment, facilities and also expertise which are very difficult to
achieve [Hu et al., 2008].
Furthermore, literature study [Genuit, 1984, Shaw, 1997] suggests that exact HRTFs
are complicated to achieve rather their general behavior can be estimated from fairly
simple geometric models of torso, head and pinna. For understanding and estimating
HRTFs as suggested by [Genuit, 1984] and reported by [Algazi et al., 2001], a set of 27
anthropometric measurements, 17 for the head and torso (Fig. 2.11 adapted from Algazi
et al. [2001]) and 10 for the pinna (Fig. 2.10 adapted from Algazi et al. [2001]), are
shown. Further, like finger prints, human pinna are not identical and vary widely in
shape and size and consequently HRTFs also vary which makes it difficult to generalize
the spectral characteristics across large numbers of individuals [Rumsey, 2001].
On the other hand, generic HRTFs are a mathematical combination of multiple,
individualized HRTFs. For speech content, it does not matter whether individualized
or generic HRTFs are used [Begault et al., 2001]. Also, if an HRTF is omitted, the
externalization would be weak [Begault et al., 2001]; the sound would come from
16
17
Figure 2.11: Head, Torso and Pinna Measurements

Figure 2.12: Vector Base Amplitude Panning (VBAP)

(A loudspeaker triplet forming a triangle formulated for three-dimensional VBAP)
“inside-the-head”.
Non individualized HRTFs in literature [Fisher and Freedman, 1968, Weinrich, 1982,
Bronkhorst, 1995, Møller et al., 1996] have been cited as degrading localization accuracy,
decreasing externalization and increasing reversal errors. However, the mentioned
research reports are based on full-spectrum noise stimuli. Additionally, in [Bronkhorst,
1995] found no significant effect of using individualized HRTFs on reversals, but
on the other hand, research results reported by [Wenzel et al., 1993] indicated that
individualized HRTFs mitigated reversal confusions. According to [Møller et al., 1996]
non individualized HRTFs resulted in increased reversals, however, they reported no
effect on externalization in their experimental results, since their experiments were based
on speech stimuli.
2.5.2 Amplitude Panning

The amplitude panning method is typically applied for multi-loudspeaker setups, where
the signal is applied to two or more loudspeakers at equidistance from the listener with
an appropriate non-zero amplitude [Blumlein, 1986]. In this method, listener perceives
an impression of a virtual source at a location, which actually does not coincide with
any of the physical sound sources. Under normal circumstances the virtual sound source
should be reproduced the way it is targeted, but in practice there are some imperfections
as reported in [Pulkki, 2001b].
However, through the Vector Base Amplitude Panning (VBAP) method multiple
18
2.6 Acoustics
Figure 2.13: Triplet-wise amplitude panning with five loudspeakers
moving or stationary sounds can be positioned in any direction in the sound field spanned
by the loudspeakers (Fig. 2.12 adapted from [Pulkki, 2001b]).VBAP has been introduced
by [Pulkki et al., 1996] and is utilized to position virtual sources in arbitrary 2-D or 3-D
loudspeaker setups where the same sound signal is applied to a number of loudspeakers
with appropriate non-zero amplitudes (Fig. 2.13 adapted from [Pulkki, 2001b]). VBAP
can be generalized for 3-D loudspeaker setups as a triplet-wise panning method [Pulkki,
1997]. A sound signal is then applied to one, two, or three loudspeakers simultaneously.
VBAP has certain advantages compared to earlier virtual source positioning methods
in arbitrary layouts. Previous methods either used all loudspeakers to produce virtual
sources, which results in some artifacts, or they used loudspeaker triplets with a non-
generalizable 2-D user interface [Pulkki and Karjalainen, 2001].
2.6 Acoustics
Acoustics is the science concerned with the production, control, transmission, reception,
and effects of sound. The term acoustics is derived from the Greek akoustos, meaning
“hearing” [Britannica, 2011]. As described in his book [Vorländer, 2008], the study of
generation, transmission, reception, cognition and evaluation of sound waves may be
called acoustics [Vorländer, 2008].
Acoustics, in two fold scope, first, have an involvement of sound which might be
generated by either mechanical radiations produced by natural causes or by human
activity. Second, the generated sound has a psychological influence on the sensation
of human hearing for areas which have particularly strong association with human
listening such as: speech, music, sound recording and reproduction, telephony, and
audiology [Pierce, 1989] etc. Within the scope of this thesis, acoustics relating to
19
Table 2.1: Basic algorithms of virtual acoustic simulation
the speech, recording and reproduction and telephony is of our interest. However, we
specifically present it from the perspective of virtual acoustics which is discussed in the
following subsections.
For the acoustic quality in small to medium-size rooms a German standard [Beuth Verlag,
2004] is a good guideline to follow. This standard contains a detailed description of
acoustic quality requirements for small to medium-sized rooms of up to 5000 m³ in
volume. It also specifies design guidelines to maintain a good acoustic quality for
spoken communication in such rooms considering three main components such as:
speaker, transmission and hearing/understanding. For further reading please refer
to [Beuth Verlag, 2004].
2.6.1 Virtual Acoustics
The term virtual acoustics is often used as a subset of the Virtual Reality (VR)
techniques [Burdea and Coiffet, 2003] or an integration of acoustics into VR [Vorländer,
2008] and normally it includes simulation of the source, the acoustic space and the
receiver. VR is an environment generated in the computer, which the user can operate
and interact with in real time [Vorländer, 2008]. Among the other, definitions include:
digitally processing sounds so that they appear to come from particular locations in three
dimensional space, with the goal of simulating the complex acoustic field experienced by
the listener within a natural environment. This concept is also known as auralization or
three dimensional sound [McGraw-Hill, Dictionary., 2011]. It is also worth mentioning
here that two keywords auralization and rendering are frequently used in the field of
virtual acoustics. Auralization in it’s broad spectrum can be defined as the processing
of acoustic effects, primary sound signals or means of sound reinforcement or sound
transmission into an audible result [Vorländer, 2008]. Rendering can be defined as
the process of generating the cues for the respective senses (3D image, 3D audio
etc) [Vorländer, 2008].
First virtual acoustic software was developed and used way back in in the year
1968 by [Krokstad et al., 1968]. Virtual acoustic Simulation is normally done
with techniques such as: image source method [Allen and Berkley, 1979, Borish,
1984], ray tracing [Krokstad et al., 1968] or beam tracing [Funkhouser et al., 1998].
These techniques are described further in the following subsections. Additionally,
basic algorithms of virtual acoustic simulation are presented in (Table 2.1 adapted
from [Vorländer, 2008]).
Further, relating to the virtual acoustics, literature study suggests the importance of
20
2.6 Acoustics
early reflections which positively enhance direct sound [Wallach et al., 1949, Haas,
1951]. Additionally, it has also been reported [Begault et al., 2001] that reverberation
is very important for improving subjective realism and externalization achieved in a
virtual spatial auditory displays. However, it is reported in literature [Mershon et al.,
1989, Zahorik, 2002, Shinn-Cunningham, 2001, Begault et al., 2001] that reverberation
in ordinary closed environments is considered as a reliable cue in identifying source
distance but it also modestly degrades directional perception [Santarelli, 2001] and
speech intelligibility [Houtgast, 1980, Payton et al., 1994]. Reverberation can cause
modest degradation in speech perception in multi-talker situations where one needs to
concentrate on a talker of choice while ignoring other concurrent talkers [Houtgast, 1980,
Shinn-Cunningham et al., 2001].
Figure 2.14: Direct Sound, early reflections and late reverberation
In addition to the early reflections and reverberation, other important considerations

within virtual acoustics are the air or the sound propagation medium [Kleiner et al., 1993,
Kuttruff, 2000] and signal-to-noise ratio [Bradley et al., 1999]. Since, direct sound and
early reflections, the diffuse late reverberation and their frequency dependent attenuation
in the air is taken into account (Fig. 2.14 adapted from [Torgny, 2011]).
For further details readers are referred to [Begault, 1994, Blauert, 1997, Kuttruff, 2000,
Vorländer, 2008].
21
Figure 2.15: Image source method
2.6.1.1 Image Source Technique
The image source technique computes specular reflection paths by considering virtual
sources generated by mirroring the location of the audio source over each polygonal
surface of the environment [Allen and Berkley, 1979, Borish, 1984] (Fig. 2.15 adapted
from [Funkhouser et al., 1998]). A study presented in [Funkhouser et al., 1998] suggests
robustness of image source method because it provides the guarantee that all specular
paths up to a given order or reverberation time will be found at the cost of modeling only
specular reflections and exponential growth of computational complexity.
A comparison between a ray tracing technique, developed by [Krokstad et al., 1968]

and the image source technique, introduced by [Allen and Berkley, 1979], have been
presented in (Table 2.1). Both geometrical acoustics techniques have different physical
approaches in terms of implementation of energy detection and the internal nature of
physical energy propagation [Vorländer, 2008].
2.6.1.2 Beam Tracing Technique
Beam tracing technique classifies reflection paths from a source by recursively tracing
pyramidal beams (sets of rays) through the environment [Heckbert and Hanrahan, 1984].
22
2.6 Acoustics
Figure 2.16: Basic concept of beam-tracing
A beam can be used to represent potential reflections, transmissions and edge diffract
ions (Fig. 2.16 adapted from [Kajastila et al., 2007]). Also, the beams are collected
into a forest of tree structures, one tree for each sound source [Kajastila et al., 2007]. A
detailed description of beam tracing is given in [Funkhouser et al., 1998].
2.6.2 Reflections Early and Late
The importance of early reflections can be traced in the work of [Haas, 1951]
and [Wallach et al., 1949], where they showed how early reflections are integrated with
the direct sound to make it seem to be effectively louder and they have clear time gaps
among them [Kajastila et al., 2007]. Reflections that reach the listener within 100 ms
after direct sound refers to early reflections and support speech intelligibility. Benefits
of early reflection on speech intelligibility have also been reported by [Bradley et al.,
2003], early reflections arriving within about 50 ms after the direct sound have the effect
of usefully increasing the level of direct sound or signal-to-noise (S/N) by 7 dB or more.
After early reflections, more dense reflections that arrive to a listener from all
directions and are so close in time that individual reflections can not be separated by
human listeners are termed late reverberations [Kajastila et al., 2007].
2.6.3 Reverberation
In a particular enclosed space when original sound is removed, the sound normally
does not stop quickly and it persist itself in the environment for some time and slowly
decays until it is absorbed by the air and walls [Valente et al., 2008]. This ability of the
sound to persist itself in the enclosed space/room is of great importance for particularly
checking rooms for performance of speech or music and is called reverberation. Also,
the ability of sound to persist itself in closed space is described by specific decay
time, time required for a level decrease of 60 dB, and is denoted as reverberation
time T [Vorländer, 2008]. Furthermore, according to [Begault, 1994], the presence of
reverberation improves externalization of virtual acoustic images, on the other hand it
23
Figure 2.17: Automatically adjusted late reverberations
can decrease localization accuracy under real and simulated conditions [Hodgson and
Nosal, 2002, Yang and Hodgson, 2006].
The reverberation algorithm is used for rendering engine [Uni-Verse, 2007], which
we have also adapted during the course of this thesis work. It was first introduced
by [Vaananen et al., 1997] and has been modified to the current position by [Kajastila
et al., 2007]. Reverberation time implemented in the rendering solution is approximately
the same for all rooms of different volumes and sizes (Fig. 2.17 courtesy [Kajastila et al.,
2007]). It is implemented by the algorithm used in the program which cuts off the
higher/maximum values if there are any, and only allows reverberation within a minimum
and maximum level, resulting in nearly the same reverberation time RT60 for all rooms
of different sizes. This is verified and compared between calculated and measured values
(Chapter 5 and Table 5.8).
2.6.4 Signal-to-Noise Ratio
Signal-to-noise ratio is a measure used in science to compare the level of desired signal
to the level of background noise [Hawkins and Yacullo, 1984, Bradley et al., 1999,
wikipedia, 2011]. Signal-to-noise ratio which is denoted as SNR or S/N, is the ratio of
signal power to the noise power. According to [Bradley et al., 1999, Yang and Bradley,
2009] reflected sound is one of the important factors along a S/N ratio which influence
speech intelligibility in closed environments. It has also been argued that increasing
reflecting sound increases both speech and noise level. Within the scope of this thesis
speech and noise both translates into concurrent talkers speech level, and therefore there
is no change in S/N. According to [Hodgson and Nosal, 2002] the critical thing is the
24
2.7 Different Head-Tracking Technologies
relative distance of speech and noise sources (concurrent talkers) from the listener. In
the case when the noise source is closer to the listener than the target speech/talker then
early reflections would increase S/N values and would be expected to improve speech
intelligibility. Also, in [Good and Gilkey, 1996] studied effect of noise on localization,
they found that localization accuracy decreases with the decrease in the SNR. They also
found that azimuthal judgments (left and right) were less influenced than up and down
or front and back.
Head-tracking and locating position and/or orientation of a user has gained immense
importance in virtual and augmented reality fields, also the increasing trend of using
head-tracking technologies has been observed in recent years in games [Wang et al.,
2006, Yim et al., 2008] and in robotics [Stiefelhagen et al., 2004]. Head tracking
decreases front-back reversals or in-the-head localizing errors [Begault et al., 2001,
Noble, 1987] which are very common during headphone playback and influence the
localization [Blauert, 1997, Harma et al., 2004]. There has been considerable work
done in the area of tracking location and orientation based on different technologies
which are commonly categorized as magnetic [Zhu and Zhou, 2004, Roetenberg et al.,
2007, Auer and Pinz, 1999] (despite great successes magnetic trackers have inherent
weaknesses e.g latency and jitter [Lenz et al., 1990]), optical [Auer and Pinz, 1999,
Chow, 2009], 3D cylinder head model [Ryu and Kim, 2007], accelerometers [Keir et al.,
2007], gyroscope [Luinge, 2002, Roll et al., 2008], acoustic [Tikander et al., 2003,
Karjalainen et al., 2004] and video based [de Ipin A. et al., 2002, Kourogi et al., 2001]
tracking.
It has been observed that it is still very difficult for single technology to give all
solutions related with positioning and orientation, hence many researchers have taken
an advantage of using multiple sensors (sensor fusion) to estimate user’s location
and orientation [Azuma et al., 1999b,a, Hallaway et al., 2004, Tenmoku et al., 2003,
Zeimpekis et al., 2002].
In the following sections some of the work relating to above mentioned head-tracking
technologies have been discussed.
2.7.1 Virtual Acoustic Environment and 3D Sound Localization

In the experimental situations it has been found that listeners make errors by indicating
a sound source in the front when the sound source is in the rear of the listener or vice
versa [Wenzel et al., 1993, Wightman and Kistler, 1999, Hyder et al., 2010b]. This
front-back confusion is the result of inherent ambiguity of ITD which is called cone
of confusion [Von Hornbostel and Wertheimer, 1920] but small head movements can
prevent this ambiguity in real situations [Wallach, 1940]. In virtual acoustic environment
head-tracking helps to avoid localization problems and decreases front-back reversals or
in-the-head localizing errors [Begault et al., 2001, Noble, 1987] and influence sound
localization [Blauert, 1997, Harma et al., 2004]. Also, head-tracking is important
25
to achieve externalization in order to properly be immersed in the virtual acoustic

environment [Begault et al., 2001]. A short review of head-tracking technologies is given
in the following section.
2.7.2 Acoustic-based Trackers

A method for exact binaural positioning and orientation of the user in the room based
on binaural microphone arrays worn by a user and by keeping the sound sources
constant in the environment of the user have been proposed by [Tikander et al., 2003].
In [Karjalainen et al., 2004] they particularly study the acoustic positioning of a user
wearing binaural microphone in a user environment by using anchor sound sources
radiating high audio frequencies (Fig. 2.18 adapted from [Karjalainen et al., 2004]).
They implemented prototype tracker system with 2-to-4 anchor source loudspeakers to
work in real time.
Figure 2.18: Head-tracking using binaural headset

(Head-tracking and subject positioning system setup in a room using binaural headset)
Head-tracking with the use of sound intensity calculation for the orientation of the user
was implemented by [Laitinen, 2008]. Laitinen achieved head-tracking with the help of
six omnidirectional microphones and two fixed sound sources of different frequencies
in a horizontal plane to calculate azimuth, elevation and tilt. Sound sources were
utilized as anchor points since their positions were previously known. Also, in his
work, calculations of head-tracking was done in a Cartesian coordinates system. Laitinen
found the directional accuracy in the region of 3-10 degrees and also fluctuation of a few
degrees, which is far from perfect.
26
Figure 2.19: An overhead schematic view of a user in the display cube
2.7.3 Video-based Tracker

A headphone-free bidirectional immersive audio telepresence system called BiReality
was introduced by [Iyer et al., 2004]. The user of the BiReality experiences audio
from a remote location while sitting or standing in a 360-degree surround projection
display cube (Fig. 2.19 adapted from [Iyer et al., 2004]). They achieved head-tracking
based on near-infrared video technology by analysis and triangulation of near-infrared
video images, which obtains the user’s head position and orientation in the display
cube. Cameras acquiring the near-infrared video images using pinhole lenses along
each vertical edge of the display cube. According to the authors, one limitation in their
work is that system provides with only four audio channels from the remote location,
consequently, users may not identify the direction of the sound by their eyes closed and
not looking at the visual imagery.
2.7.4 Accelerometer/magnetometer-based tracker

In [Yun et al., 2008] have presented their work regarding development of algorithm to
calculate orientation using accelerometer and magnetometer for slow moving objects. In
their experiments, sensor data was collected using MARG III inertial/magnetic sensor
module (Fig. 2.20 adapted from [Bachmann et al., 2007]). Sensing components for this
module includes pair of ADXL202E micro-machined accelerometers and HMC1051Z
and HMC1052 one and two-axes magnetometers. According to their experimental data,
27
algorithm is able to track orientation, when it is combined with low pass filter for
accelerometer data. The algorithm has also been successfully used by them in real time
human-body-motion applications.
Figure 2.20: MARG III inertial/magnetic sensor module
2.7.5 Inertial/magnetometer-based trackers
A solution for estimating position and orientation which is based on an extended Kalman
Filter for fusion of magnetic and inertial sensor was presented by [Schepers et al.,
2010]. Normally change of position and orientation can be obtained by integration of
acceleration and angular velocity signals from inertial sensors. In this study inertial
sensing is fused with magnetic measurements, where magnetic update is activated
only when any uncertainty in the position or orientation is involved and exceeds to a
predefined threshold (Fig. 2.21 adapted from [Schepers et al., 2010]).
28
Figure 2.21: Inertial and Magnetic Measurement System (IMMS)

(Overview of the measurement system used to estimate relative positions and orientations
of an Inertial and Magnetic Measurement System (IMMS) with respect to the source.
The source consists of three circular coils that are mounted orthogonally with respect to
each other. An additional IMMS is mounted on the source to estimate its movement)
2.7.5.1 Latency efficient head-tracker

Head-tracker latency is an important factor regarding development and/or usage of
commercially available head-trackers for sound localization and orientation. Head-
tracker latency can be described as the time difference between the initiation of the head
movement and the change of audio signal at the headphones due to the recalculation of
sound field. A number of researchers have examined the effects of head-tracker latency
on sound localization. Some of them have reported head-tracker latencies as large as 150
ms [Bronkhorst, 1995] or 500 ms [Wenzel, 1998] have relatively no or little impact on
the localization of virtual sounds. Other researchers have reported significant increase in
localization error and response time for head-tracker latencies of 93 ms [Sandvad, 1996].
Differences in theses studies could be accounted for the fact that for listeners with the
500 ms head-tracker delays were tested with long-duration stimuli (8 s) and with 93 ms
head-tracker delays were offered only short stimuli (2 s).
Conclusively, to obtain sufficient presence in the virtual environment, only a coupled
system of head tracking and binaural synthesis of room-related sources, image sources
and reverberation is appropriate. An extension to room–related coordinates creates
a very big effect by enhancing plausibility and immersion, even for the use of non
individualized HRTF [Vorländer, 2008]. This finding illustrates the importance of
dynamic cues of localization and the necessity of implementing this feature in binaural
systems [Mackensen, 2003].
29
2.8 Speech/Conversation Audio Quality

The quality of communicated speech for telecommunication networks is one of the
most discussed issues these days. Voice quality or speech quality refers to the
clearness of speaker’s voice as perceived by a listener. As explained in [Möller,
2000], more general definition of speech quality is to obtain auditory judgments in
listening-only or conversational tests, user surveys, etc from human subjects to build
the relationship between expected and offered characteristics offered by the service.
According to [Guéguin et al., 2008, Jekosch, 2008] speech quality was described as
the result of a perception and judgment process, during which the listener establishes a
relationship between what he perceives (i.e, the sound event) and what he expects (i.e,
the internal reference): speech quality is not absolute, but is attributed by the listener.
Moreover, within the scope of three dimensional audio quality; researchers have
studied three dimensional sound localization [Blauert, 1997], intelligibility of three
dimensional sounds in virtual 3D spaces [Kitashima et al., 2008, Kobayashi et al., 2010],
recognizing unfamiliar voices with the help of spatialization and visual representation of
voice locations [Kilgore and Chignell, 2006, Kilgore, 2009], spatialized audio and video
multi-way conferencing [Inkpen et al., 2010] and the cocktail party effect [Brungart et al.,
2007]. However, three dimensional audio speech quality is still need to be understood to
form a proper relationship between user/customer- perception and expectations.
Additionally, as observed in literature, procedures adapted in three dimensional audio
quality testing are mostly based on listening only tests with some minor variations.
In [Kilgore et al., 2003] developed Vocal Village, a communication tool that allows for
real time spatialized audio conferencing across the Internet. They presented experimental
research to compare whether the real-time performance benefits including memory,
speaker identification and participant preference are sufficient to provide an edge over
traditional, monoaural audio conferences. In the test design they presented different
audio formats to the subjects including mono, random spatialized. Additionally, subjects
were presented spatialized audio on conference speakers which had to be arranged in
order determined by the subjects (vocal village format), and with spatialized audio
using with conference speakers assigned to fixed positions (CoolEdit format). After
listening to each of the four audio conferences, subjects were asked questions ranging
from memory tests, focal assurance questionnaires to post conference questionnaires.
Furthermore, in [Kilgore and Chignell, 2006, Kilgore, 2009] presented experimental
research to determine whether the combination of spatialization and simple visual
representation of voice location helps recognizing completely unfamiliar voices. In
the test design four voice- and eight voice-scenarios were investigated. Both scenarios
consisted of an equal number of male and female voices. Three different format
conditions were investigated, which were Mono with a visual component, spatial
format and spatial format with a visual component. The test results stated that
spatialized audio coupled with visual stimuli benefits only with 8 voices but not with
four. In [Vesterinen, 2006] performance differences for 3D audio, monophonic and
30
stereophonic audio conferences through subjective tests were tested. These tests were
conducted according to recommendation [ITU-T, 2001a]. Seven audio clips were
used in subjective intelligibility tests with 8 different placements ranging from 1 to
8 participants. The placements included front, rare and mixed hemisphere speaker
positioning where speakers were counting numbers from 1 to 9 and subjects had to
describe the total number of speakers, whether one speaker talks more than once and
the location of the speaker belonging to a certain number. Additionally, subjective
tests on perception were taken using pre-recorded audio scripts with conference-call-
like conversations including monophonic, panned stereophonic, flat stereophonic and
spatial audio recordings. Subject’s opinion scores were recorded using the MUSHRA
mean opinion score after evaluating each clip. Presented results show that the spatial
mixed hemisphere setup produced the most pleasing listening experience of a multi-
person conversation. In [Disch et al., 2004] discussed issues of test methodologies for
multi-channel sound quality assessment and presented test results for stereo-based and
mono-based representations. Additionally they described the potential behind spatial
audio coding. The listening test method was chosen was [ITU-T, 2001a]. In their
studies [Pulkki and Karjalainen, 2001, Pulkki, 2001a,c] regarding the localization of
amplitude-panned virtual audio sources, used both subjective and objective methods for
the evaluation of spatial sound. In the subjective tests, subjects were asked to adjust
the perceived direction of an amplitude-panned virtual source to match best with the
perceived direction of a virtual source.
For further details readers are referred to [Möller, 2000, Best et al., 2006, Raake et al.,
2007, Guéguin et al., 2008, Ahrens et al., 2010, Raake, 2011]. The methods to measure
speech and/or audio quality either subjectively and/or objectively are described in detail
in the following subsections.
2.8.1 Subjective Quality

The most common and standard approach to obtain the assessment or measurement of
user’s perception of voice quality or audio signals is to conduct subjective listening-only
test. There are many methods to assess the subjective quality of speech signals, which
are generally divided in two main classes, listening-only tests and conversational tests.
2.8.1.1 Listening Only Tests
All subjective methods require a large number of human listeners to produce statistically
valid data, as described in ITU-T P.800 and other related recommendations [ITU-
T, 1996, 2001b]. In listening-only tests, subjects vote on their opinion of the
quality after hearing a number of recordings for different listening-only test methods,
which include Absolute Category Rating (ACR), quantal-response detectability test,
degradation category rating (DCR) and Comparison Category rating (CCR). Among
these listening only test methods, the most famous one is (ACR) Listening Quality scale
(LQ), using the five point scale integer values which are assigned from 5 to 1 by utilizing
labels excellent, good, fair, poor and bad. The average score for a condition for all
31
subjects is termed as mean opinion score (MOS).

There are other opinion scales used for telephony subjective tests such as degradation
category rating (DCR) or comparison category rating (CCR) methods [ITU-T, 1996] but
are less common than ACR (LQ). In DCR test sequences are presented in pairs. The first
signal presented is a source reference without impairments and second one is impaired
by the test conditions.
The International Telecommunication Union Standardization Sector (ITU-T) in
recommendation [ITU-T, 2001a] describes a methodology for subjective evaluation of
audio quality called MUSHRA (Multiple Stimuli with Hidden Reference and Anchor).
Through MUSHRA subjects can rate very small differences because of the 0-100
continuous scale. Therefore, MUSHRA is considered advantageous as compare to MOS
values.
ITU-T has also approved a subjective method specifically for evaluation of quality of
noise reduction system, in which subjects are asked, after hearing each test item (with no
reference), to provide three separate votes on different aspects of quality including the
speech signal, the background and the overall quality [ITU-T, 2003b]. The diagnostic
acceptability measure (DAM) was widely used during the early development of quality
assessment models and codecs, [Quackenbush et al., 1988].
2.8.1.2 Conversational Tests
Conversational tests require pair of subjects talking and listening interactively via
network under test and vote independently using the quality scale, moreover, these
tests take into account of all the properties of network from each talker’s mouth to
ear [ITU-T, 1996, 2003a], which provides more realistic test environment. Also,
the International Telecommunication Union Standardization Sector (ITU-T) [ITU-T,
2007] describes methods and procedures for conducting conversational tests to evaluate
subjective communication quality which provides more realistic situation simulating the
actual communication conditions.
A recent work by [Raake, 2011] presents a conversation test method for assessing
conference quality with three participants. [Raake, 2011] submitted his work to the
International Telecommunication Union - (Study Group 12) to propose the method
as a potential appendix to recommendation [ITU-T, 2007] or to form a new, stand-
alone recommendation. The conversational structure is presented in (Fig. 2.22 adapted
from [Raake, 2011]), which is inspired by the Short Conversation Test Scenarios (SCTs)
developed by [Möller, 2000]. However, conversational tests are relatively rare because
they are very time consuming and often suffer from low reproducibility [ITU-T, 1994a],
therefore, listening-only tests are often recommended.
2.8.2 Objective Quality

Although subjective evaluation of speech quality is always accurate and preferable
but often is time consuming, slow and expensive to conduct. Alternatively, there are
many objective voice quality measures which provide automatic assessment of voice
32
Figure 2.22: Conversation Structure
33
communication system without any need of human listeners. These objective measure
are based on mathematical models and are used to supplement subjective test results.
Objective measures are classified into two classes: intrusive and non-intrusive.
2.8.2.1 Intrusive Measures
Intrusive measures are also called input-to-output measure because they base their
measurement on computation of distortion between the original speech signal and the
degraded or distorted speech signal. Block depending on the domain transformation
used, intrusive object measure are classified into time, spectral and perceptual
domains [Quackenbush et al., 1988, Itakura, 1975, Itakura and Saito, Kitawaki et al.,
1988, Karjalainen, 1985]. Further examples of intrusive measures are Bark Spectral
Distortion measure (BSD) developed by [Wang et al., 1992], Modified and Enhanced
Modified Bark Spectral Distortion (MBSD and EMBSD) measure [Yang et al., 1998],
Perceptual Speech Quality Measurement (PSQM) [Beerends and Stemerdink, 1994] and
PSQM+ [Beerends et al., 1997], Measuring Normalizing Blocks (MNB) [Voran, 1999]
and/or Perceptual Analysis Measurement System (PAMS) [Rix and Hollier, 2000].
The International Telecommunication Union Standardization Sector (ITU-T) in
recommendation [ITU-T, 2001c] described standard method to objectively measure
perceived audio quality in the year 1998 and last time updated in the year 2001. In the
year 1999, KPN Research improved PSQM as PSQM99 which provided more accurate
correlations with subjective test results than classical PSQM and PSQM+. Meanwhile,
ITU-T recognized significant merits of PSQM99 and PAMS and combined the merits
of each one into a new measurement technique for intrusive objective speech quality
assessment called perceptual evaluation of speech quality (PESQ). ITU-T approved
the PESQ under recommendation [ITU-T, 2001b]. PESQ is currently accurately
estimate the listening speech quality performed by wireless, VoIP and fixed networks
and is in fact, standard method for automated speech or audio quality measurement
technique. It can be used in a wide range of measurements applications, such as, codec
development and error distortions, equipment selection, equipment optimization and
network monitoring [Rix et al., 2001]. More recently, POLQA (Perceptual Objective
Listening Quality Assessment) has also been selected by the ITU-T to form the new
voice quality testing standard P.863.
2.8.2.2 Non-intrusive Measures
Non-intrusive measures, which are also known as out-based or single ended measures,
use only the degraded signal and have no access to the original signal, for instance, ITU-T
recommendation P.563:”single ended method for objective speech quality assessment in
narrow-band telephony applications” [ITU-T, 2004]. P.563 is used for the voice quality
measurements in narrow band telephony applications, such as, live network monitoring
and live network end-to-end testing using digital or analogue connection to the network
and live network end-to-end testing unknown speech sources at the far end side. ITU-T
emphasis that P.563 and PESQ can not be used to replace subjective testing but can be
34
2.9 Summary
applied where listening only tests would be too expensive or not applicable. It is also
interesting to note that the accuracy of current p.563 model will be always lower than
that of the PESQ [Kim et al., 2006].
2.9 Summary
This chapter has been written to provide the readers a background of the terms and
technologies used in this thesis as far as possible. This chapter includes background
for various portions of this thesis work such as human listening organ, binaural hearing,
sound localization, helpful cues for sound localization, binaural technology, three
dimensional sound, three dimensional audio recording and reproduction, virtual acoustic
environment, a short head-tracking technologies review, and also speech and audio
testing procedures. Also, readers are referred to comprehensive reviews available in
literature of the mentioned terms and technologies where ever it was appropriate in the
text.
35
36
Chapter 3
Design and Implementation of 3D
Telephony
3.1 Introduction
The technical requirements that are needed to implement a headphone-delivered 3D
sound are well known [Begault et al., 2001]. In order to improve sound localization
performance, three factors need to be considered. First, individualized head-related
transfer functions (HRTFs), to describe how the acoustic waves propagate through each
listener’s head [Tan and Gan, 1998, Kim et al., 2004]. Second, sound processing
(auralization) to simulate reverberations and reflections within a virtual surrounding
individually. Third, head tracking systems to follow the movements of the speaker’s
and listener’s heads.
These three factors with their basic concepts have been discussed in detail in Chapter 2.
However, to summarize the concepts, following points are presented:
1. HRTFs are used to produce spatial audio signals that helps the listener to believe
that sound emanates from the corresponding virtual source location [Park et al.,
2005] (HRTFs have already been discussed in detail in section 2.5.1).
2. In addition, realistic reverberant context caused by early and diffuse reflections

of sound waves help humans to locate sound correctly on a horizontal and
vertical plane (azimuth and elevation). Thus, 3D sound benefits if an auralization
software simulates the acoustic propagation in a virtual room. Especially the
early reflections (from 50ms up to 80ms delayed) are helpful in increasing the
localization performance [Begault et al., 2001, Bradley et al., 2003].
3. Front-back confusion is caused by the ambiguity of Interaural Time Difference

(ITD), a phenomenon known as cone of confusion [Von Hornbostel and
Wertheimer, 1920]. In real situations, this ambiguity can be avoided with small
head movements [Wallach, 1940]. In virtual acoustic environments however,
the counterpart of head movements in real situations is head-tracking which
helps to avoid front-back reversals [Begault et al., 2001, Noble, 1987]. Thus,
tracking of the head is important [Begault, 1994, Begault et al., 2001] to minimize
37
Chapter 3 Design and Implementation of 3D Telephony
localization errors. Without head tracking in headphones humans have difficulties

to distinguish whether sounds are coming from the front or back.
Based on the three essential requirements described above a 3D telephone system has
been designed. In the following sections the design background and the description has
been discussed.
3.2 Design Background
3.2.1 Classic VoIP Teleconferencing
A classic voice-over-IP teleconferencing system consists of three main components as
depicted in (Fig. 3.1 to 3.3 based on the concepts presented in [Sinnreich and Johnston,
2001] courtesy Haun, M.). The first component is the audio input produced by two or
more call participants. These inputs are mixed by the second component, the audio mixer
and played to the call participants by an audio output component. The components can
be arranged in several ways. While the audio in- and output components usually reside
inside a VoIP client or phone on each endpoint of the VoIP connection, audio stream
mixing can occur at different locations within the network [Sinnreich and Johnston,
2001].
The most simple form of VoIP teleconferencing is to employ a centralized conference
bridge to which each endpoint connects. This bridge can then handle audio mixing in a
centralized way by additionally providing audio transcoding between different endpoints
to satisfy different bandwidth restrictions.
VoIP audio VoIP

Phone Phone
Conference
Mixer
VoIP PSTN PSTN
Phone Gateway Phone
Figure 3.1: Central conference bridge
The second possibility places the mixer in one of the conference endpoints. This
setup limits the number of call participants by means of the available bandwidth and
computational power of the client handling the audio mixing [Sinnreich and Johnston,
2001].
The third possibility employs a network that establishes a full mesh of connections
between each call participant. Each endpoint mixes the incoming audio streams.
This setup minimizes media latency but complicates media synchronization. Finally,
a large teleconference can be realized by using multi-cast conference addresses to
enable the participation of millions of users. Although this provides the most powerful
38
3.2 Design Background
VoIP audio
Phone
VoIP Phone
+Mixer
VoIP PSTN PSTN
Phone Gateway Phone
Figure 3.2: End-point mixing
teleconferencing solution in terms of scalability. The current Internet protocols do not

support multi-cast and hence this method can only be employed in local and regional
area networks [Sinnreich and Johnston, 2001].
VoIP Phone audio VoIP Phone

+Mixer +Mixer
Mesh or
Multicast
VoIP Phone PSTN Gw. PSTN
+Mixer +Mixer Phone
Figure 3.3: Fully meshed conferencing system
To obtain further details about the concepts discussed above readers are referred
to [Sinnreich and Johnston, 2001].
3.2.2 Spatial Audio Teleconferencing

When designing a spatial audio teleconferencing system, additional components have
to be added as shown in (Fig. 3.4 and 3.5 courtesy Haun, M.). First, in addition to the
components mentioned in (Chapter 2), a simulator of a virtual acoustic environment is
required. Second, the audio streams are not simply mixed but the different audio sources
need to be rendered according to their positions within the virtual environment. Thus, an
audio renderer is needed.
Spatial audio can be played out on simple head-phones or surround speakers. A
headphone equipped with a head tracking unit or wave field synthesis provides realistic
presentations of spatial audio. In the following, we assume that teleconference
participants will use headsets extended by head tracking units, because such a system
is mobile and can be manufactured at low cost. Thus, the third component required
is a head-tracking device on the endpoints to monitor the head- or body-movements.
However, adapting this design to other forms of acoustic reproduction would be
straightforward. Fourth, those movements have to be communicated to the virtual
39
VoIP audio VoIP

Phone Phone
Head Wave Field
Tracking Spatial Synthesis
Audio
Renderer
VoIP PSTN PSTN
Phone geometrical Gateway Phone
Head data
Tracking
Virtual
Acoustic
position and Environement
orientation
Figure 3.4: Centralized spatial audio conferencing
environment to update positions and orientations of the participants.

Depending on the chosen conferencing architecture and the demands posed on the
spatial rendering engine by different call participants, these components can or must
be positioned at different places within the network. For example, central servers can
be used as shown in (Fig. 3.4), which extends the setup given in (Fig. 3.1). Here,
a centralized virtual reality server is connected to a conference bridge. The bridge
is enhanced by special software forwarding all incoming audio streams through the
rendering engine before they are mixed into separate outgoing audio streams. The
streams of head tracking data are sent directly to the host of the virtual environment.
This scenario allows for an easy synchronizing among audio- and tracking-data
because all data is centrally available. Drawbacks are due to the fact that multiple
channels have to be sent back to the phones: either two channels if headsets are connected
or all of them if a wave field synthesis has to be calculated. Also, the latency between
head movement and changes in the acoustic rendering and filtering is at least once the
round-trip-time between phone and conference bridge. This might reduce the spatial
acoustic impression, especially if the delay is above about 100 ms [Wenzel, 1999, 2001].
This setup can be beneficial if PSTN or mobile phones would be able to connect to the
conference session. Supporting spatial audio teleconferencing on a central server might
require substantial computational resources. Further drawbacks are that this system
requires an external control of the virtual world such as the bridging application running
40
3.3 Design of the 3D telephone system
VoIP audio VoIP

Phone Phone
Tracking, Wave Field
Renderer, Synthesis,
Environment Environment
Mesh or
Multicast
VoIP PSTN Gw. PSTN

Phone +Mixer Phone
Tracking, position and
Renderer, orientation
Environment
Figure 3.5: Meshed spatial audio teleconferencing
on the conference bridge. Also, due to the distributed nature of this setup and caused
by the latency between head movements and changes in the acoustic rendering, the
naturalness of spatial audio impression is reduced.
The second setup corresponds to the meshed setup discussed previously (Fig. 3.5). It
has the complete rendering engines in the users’ endpoints. All incoming audio streams
are sent through the rendering engine before being returned to the play out.
The meshed setup overcomes the problem of computational burden on a centralized
conference server by deploying separate spatial audio rendering systems for each of the
call participants. In addition, it allows for an individual representation and full control of
the virtual environment for each call participant. This might be beneficial if the virtual
environment should be mapped to the participants’ real environments. A drawback is due
to the fact that multiple audio and head tracking streams must be distributed and that the
virtual environments must be synchronized to a certain extent. This increases the burden
on the network. Also, this setup needs to simulate the virtual acoustic environment
multiple times and thus increases the computational demands. However, this is not really
a drawback because the end devices might have enough unused computational resources.
Then, scalability is achieved because spatial audio teleconferences are not limited by a
central conference bridge with limited computational resources, anymore.
3.3 Design of the 3D telephone system
Based on the three essential requirements described in (Section 3.1), the background
knowledge of the classic VoIP teleconferencing and spatial audio teleconferencing
requirements (see Section 3.2), we designed a 3D telephone system (Fig. 3.6) aiming for
41
comfortable and mobile usage at low costs while supporting spatial audio. The design
extends a VoIP based telephone by low-delay audio codecs, 3D sound renderers, and head
phones extended by head tracking sensors. More precisely, the design of 3D telephone
system consists of:
1. Stereo headsets extended by a head tracking unit, which follows the movement
of ears and mouth. Optionally, sensors can be used to determine the size of the
current room and the position of the head in the room. Usually, each participant of
the conference call requires one 3D sound capable headset.
2. The microphones of the headsets are coupled to low delay audio encoders. The
audio content is transmitted mono only enriched by the sensor data of the 3D sound
headset. The sensor data mainly include the relative position and orientation of the
mouth.
3. A 3D sound rendering unit, which simulates the acoustic wave propagation in

a virtual room and calculates the filter parameters for each listener, taken into
account the orientation and position of the listeners’ ears and the other sound
sources to determine the proper delay and filter parameters.
4. As each participant might sit in their own room having different dimensions—
the orientations and movements of the participants may also vary—the
teleconferencing system must decide where in a virtual room it would place
the participants. In (Fig. 3.6), this is displayed in the upper middle box. We also
conducted listening-only tests (Chapter 4) to determine suitable placements of the
participants.
5. The 3D telephones must be connected via a low latency network because of the
participant’s requirements on interactivity. Even more stringent are requirements
on change of the filter parameters after head movements. Human can tolerate a
delay between movements and spatial adaptation impression up to 70ms, before
the 3D sound experience becomes unrealistic [Brungart et al., 2006].
3.4 Design Description

Based on the basic design presented in (Section 3.3), the extended versions of the
design of the spatial acoustic teleconferencing system 3D Telephony are presented
in (Fig. 3.7 and 3.8) and their descriptions are also presented in the following.
The first setup is based on a centralized conference bridge and virtual acoustic server.
Each user connects to the conference bridge using a communication device of his or her
choice, either a VoIP (soft-) phone, a PSTN or ISDN phone, or a mobile communication
device. All call control is left to the bridge, and audio streams are forwarded to the virtual
acoustic server for rendering. Audio streams are rendered individually for each user
before being transmitted back to the bridge for mixing and transcoding. Head-tracking is
42
Callees Room
Callers Room head tracking
Headphone
Headphone Virtual rooms to simulate Microphone
head Microphone
tracking
acoustic wave propagation
Audio signal Head Audio signal Head

tracking and Virtual Room tracking and
processing 3D model processing
ambient ambient
detection detection
Low delay filter Mapping Low delay filter
parameters 3D sound parameters
HiFi audio from real to HiFi audio
orientation of head rendering orientation of head
codecs and environment virtual codecs and environment
filter environement
audio parameters and movements audio
packets Transport Transport packets Transport
protocols protocols protocols
Internet
Legend: Conference room Signal processing 3D movement processing Data transmission
Figure 3.6: Design of a 3D telephone system

Design of a 3D telephone supporting highly realistic spatial audio transmissions. (Courtesy Dr. -Ing. Christian Hoene)
43
3.4 Design Description
achieved by a separate and direct connection between each user and the virtual acoustic
server.
Figure 3.7: 3DTel on a server
The second system is based on a point-to-point architecture using a different virtual

environment for each user, whereas each user keeps full control over the virtual
environment placed at his or her end of the connection. All audio streams are rendered
locally and played back directly to the headphones of the respective user. Here, multiple
avatars, one for the local caller and one for each remote call party, are created. The
incoming audio stream is then forwarded to the virtual acoustic renderer for rendering
and outputted on the headphones of the local caller. Head-tracking is enabled by a
separate connection by connecting a host through virtual acoustic renderer by supplying
local virtual environments to the 3DTel plug-in and to modify the position of the local as
well as of the remote avatars.
3.5 Implementation
We implemented the system based on the open-source VoIP soft-phone Ekiga [Ekiga,
2010] (details are provided in the next section), which we enhanced by a plug-in to
control the virtual environment. As a virtual acoustic server or rendering engine we
utilized the Uni-Verse acoustic simulation framework [Uni-Verse, 2007] (details are
provided in the following sub sections). A custom build software was employed as a
conference bridge, a conference server and mixer. The current prototype system can be
installed on any desktop computer or laptop running an Ubuntu/Debian based operating
system. In the following, we describe details of the implementation.
44
3.5 Implementation
Figure 3.8: 3DTel on the clients
3.5.1 VoIP Phone and 3DTel Plug-In
As a VoIP client we used the open-source soft-phone Ekiga [Ekiga, 2010]. We extended
it with the Bluetooth SBC [Hoene and Hyder, 2010] codec to support stereo and full-
band audio. To connect the VoIP client to the virtual acoustic server or renderer for
spatial audio rendering , we enhanced Ekiga by a plug-in architecture and a 3DTel plug-
in (Fig. 3.9). The 3DTel plug-in or Ekiga plug-in consists of the five main components
Graphical User Interface (GUI), Database Backend, Virtual Reality Backend,
Tracking Unit and Rendering Frontend as shown in (Fig. 3.10 courtesy Haun, M.).
The graphical user interface allows the user to control different parameters related to
the choice of the tracking unit and the virtual reality back-end. The tracking unit interface
allows the connection of arbitrary tracking devices to modify the users’ position and
orientation within the virtual world in real-time. The virtual reality back-end interface
makes it possible to specify various sources of virtual world representations such as
static files Vector Markup Language (VML) [Mathews, 1998], or dynamic streams from a
content creation tool or a game server. This back-end interface provides additional means
to analyze the provided environment in order to dynamically generate seating-plans
according to which all call participants can be placed upon call initialization. Finally, the
core of the system is given by the rendering front-end interface. This interface specifies
the connection to the rendering engine to which all audio, position and orientation data as
well as the virtual environment are sent and where spatial audio is rendered individually
for each avatar from the audio input of all the other call participants. Afterwords, this
interface receives the rendered audio streams and either plays them on the user’s head-
phones or sends them across a network using RTP.
45
Figure 3.9: 3DTel Plug-in
Figure 3.10: Overview of the main system components
46
3.5 Implementation
3.5.2 Virtual Acoustic Server and Renderer

3.5.2.1 Virtual Acoustic Server
To form a virtual acoustic server to simulate spatial audio, we take advantage of the Uni-
Verse framework [Uni-Verse, 2007], which is an open-source software for developing
3D games. In our research, we use only the features that are needed for spatial
audio rendering. Uni-Verse is based on a distributed architecture consisting of multiple
components, which are described in the following.
The Uni-Verse systems main component is a centralized Verse server [Steenberg,
2010]. This server stores the virtual environment as a set of nodes representing
geometrical properties, surface materials or sound sources. More precisely, the Verse
server stores the virtual environment as a boundary representation, which is shared
between multiple clients. Clients like the acoustic simulator or the 3DTel application
can modify the stored environment. Additionally, Verse clients can subscribe to a set of
nodes or properties of such nodes. If a subscribed node is modified, the respective client
is notified of the changes and can take appropriate measures.
The Verse protocol uses the UDP protocol to notify clients of changes to the virtual
environment and enhances UDP by an acknowledgment mechanism on top. To prevent
out-of-order UDP packets to undo changes of the environment which were received
through later packets, Verse uses an event compression mechanism [Steenberg, 2010] by
dividing each command into an address and a data part. If an out-of-order packet carries
the same address command that was executed by a later packet, the packet arriving out-
of-order is simply dropped.
Since the process of spatial audio rendering depends heavily on the number of
polygons contained within a virtual environment [Kajastila et al., 2007], the number
is reduced as soon as a virtual geometric model is loaded on to the Verse server. This is
achieved via clustering and merging nearby vertexes into a new combined vertex [Bayik
and Akhan, 1999]. If a polygon degenerates in this process—for example if its vertex
count is reduced to two or lower—then the polygon is removed. At this point, nearby
vertexes are defined as all vertexes falling into a bounding box of a size that is equal to
the magnitude of the environment’s detail that should be preserved.
3.5.2.2 Virtual Acoustic Renderer
To render audio streams spatially according to the virtual environment stored on the Verse
server, Uni-Verse employs two different modules: the Uni-Verse Acoustic Simulator
(UVAS) and the Uni-Verse Acoustic Renderer (UVSR) [Uni-Verse, 2007].
The acoustic simulation is achieved in two steps. First, a binary spatial partition
(BSP) [Fuchs et al., 1980, Schröder and Lentz, 2006] tree is computed from the boundary
representation stored on the Verse server. Afterwords, a beam tracing algorithm is used
to compute reverberation paths from a beam tree. Within this tree, each node represents
a reflection from a surface and the properties of that reflection.
UVAS uses a priority driven beam-tree generator [Krishnaswamy et al., 1990] and
47
Conference
Room
Talkers
Listener
Phantom
Sources
Figure 3.11: A virtual conference room with a listener and two virtual talkers
computes the BSP tree by splitting the three dimensional object space into half spaces
using a two dimensional hyperplane [Schröder and Lentz, 2006]. To achieve a well-
balanced tree, the hyperplanes are chosen according to a heuristic that tries to minimize
the number of surfaces on either of the resulting sides of the hyperplane.
The beam tracing method classifies reflection paths from a source to the listener by a
set of pyramidal beams, where each beam represents a frustum consisting of an infinite
number of rays. When intersecting polygons of the virtual environment are detected,
the beam is clipped to remove the shadow region behind the intersecting polygon and
reflected by generating a virtual sound source by mirroring the original source at the
intersecting polygon.
To build the beam tree, the BSP graph is traversed in a depth-first manner starting
at the cell containing a source and recursively visiting adjacent cells. As the algorithm
traverses a cell boundary into a new cell, the current beam is clipped to include only
the space passing through the transparent polygon boundary and phantom sources are
created at the solid boundaries of the polygon.
From the resulting beam tree structure, reverberation paths can be derived by simply
traversing the tree from the listener to all sound sources and by collecting surface
absorption coefficients and distance information along the way. (Fig. 3.11) shows a
virtual room rendered by UVAS with a listener, two sound sources and reverberation
paths.
48
3.5 Implementation
Acoustic rendering is achieved by computing early reflections and late reverberations

for each sound source [Svensson and Kristiansen, 2002]. Both are obtained by applying a
series of filters to each source with the parameters computed during acoustic simulation.
To compute early reflections, UVSR uses a five-band HRTF. First, the signal is divided
by a filter bank into five individual frequency bands. Then, the correct absorption
coefficients according to wall material absorption are applied to each of the bands.
An individual gain is applied to match the sound level given by a certain reflection
coefficient. After applying the gains, the five sound streams are summed up to one
wide band stream for every auralized path. Moreover, filters are applied to include air
absorption, distance attenuation and delay. Distance attenuation is modeled by applying
a simple gain. Distance delay is produced by using a fractional delay line which also
produces the Doppler Effect occurring at fast changing distances. Air absorption, which
depends on humidity and temperature, is modeled by a second order IIR filter for
distances below 60 meters and by a first order low pass filter for distances above 60
meters. Last, binaural filtering is applied to the signal.
Since late reverberations are not distinguishable individually by the listener due to very
short time intervals between each reverberation, those reverberations can be modeled as
a diffuse sound field. The signal is first divided into five frequency bands and then an
individual reverberator is applied to each of the frequency bands.
For further details regarding the concepts discussed in this section, readers are referred
to the work of [Kajastila et al., 2007].
3.5.3 Head-tracking
In order to obtain position and orientation of the users/customers of 3D Telephony,
four tracking devices are implemented. A tracking unit can be represented by any
input device, for example a computer mouse and keyboard using game-style navigation,
or a head- or body-tracker which passes actual body movements onto the virtual
avatar. Movements within the surrounding environment can be tracked using different
indoor locating mechanisms such as WLAN finger printing, acceleration sensors and
compasses.
The current implementation features support for four tracking devices which are
described in this section. We connected the Nintendo Wii remote control using Vrui
Virtual Reality toolkit by [Kreylos, 2008]. The second tracker was implemented as a
simple keyboard tracker that translates key strokes into translations and rotations. The
third tracker is a tracking simulator which has been implemented to test and display
the effects of changes in position and orientation on sound rendering. The tracking
simulator does not rely on any user input but creates a constant movement along an
orbit through the virtual conference room. The fourth head-tracker was based on PNI
sensor corporation’s SpacePoint fusion [SpacePoint, PNI., 2011]. SpacePoint tracker
offers 9 axes of motion tracking which are 3-axis magnetometer, 3-axis gyroscope and
3-axis accelerometer which are driven by PNI’s motion tracking engine [SpacePoint,
PNI., 2011].
49
Figure 3.12: Wiimote with six degrees of freedom
Figure 3.13: IR-Beacon (custom built)

IR-Beacon contains 4 IR-LEDs, a 560Ω resister, a switch and a power supply (9V
Battery)
50
3.6 Summary
Figure 3.14: PNI sensor corporation’s SpacePoint fusion Head-tracker
All data received from the tracking device is computed to relative or absolute
movements in Cartesian coordinate system. Translations are represented by either
Euclidean vector transformation in case of absolute values or affine transformations in
case of relative values. Given an position p0 = (x0 , y0 , z0 ), a translation pt = (xt , yt , zt )
and a scaling factor s, the new position p1 = (x1 , y1 , z1 ) can be obtained by calculating
p1 = pt ? s in case of absolute values or by p1 = p0 + (pt ? s) in case of relative values.
Rotations are represented by unit quaternions. Given an initial position p0 = (x0 , y0 , z0 )
and a rotation or = (xr , yr , zr , wr ), the new position p1 = (x1 , y1 , z1 ) can be obtained by
calculating p1 = or p0 or .
The current implementation does not feature any collision detection mechanisms yet.
If a position is tracked that falls outside the virtual environment, the audio signal is
lost until the virtual avatar reenters the environment. An engine to control and limit
movements within the virtual environment is subject of future enhancements.
3.6 Summary
In this chapter design and description of 3D telephony system is presented. Furthermore,
two implementations of 3D telephony is discussed. Additionally, main components such
as: VoIP phone client Ekiga and its 3D telephony plug-in, virtual acoustic server and
renderer with its basic components from our implementation perspective is presented.
Head-tracking support of four devices for our current design and implementation has
also been discussed.
51
52
Chapter 4
Experiments on the Placement of
Teleconference Participants
In order to optimize and enhance a 3D audio supported telephony and teleconferencing
system up to an acceptable level for a user/customer, user experiments were conducted
to study the virtual placement of teleconference participants in particular. In this user
study, focus was based on sound quality, understandability and locatability of virtual
participants. Additionally, the occurrence of any front/back or elevation localizing errors
were also studied. Front/back reversals and elevation localization errors1 are commonly
seen in 3D audio systems when non-individualized HRTFs are used [Wenzel et al., 1993]
(refer to Chapter 2).
In addition, we investigated the trade-off between sound source direction perception
and distance perception. According to [Shinn-Cunningham, 2000], reverberation
degrades perception of the sound source direction, but enhances distance perception.
Also, in this study azimuth errors (deviations along horizontal plane), elevation errors
(deviations along vertical plane) and reversal error (front-back or back-front “confusion”)
which are very common with 3D sound reproduction over headphones [Begault, 1994]
were evaluated separately.
This chapter is organized as follows: To figure out how to position the participants
in a conference call, seven formal listening only tests had been conducted (Section 4.1).
After presenting experimental setup and results, the work presented in this chapter has
been summarized. The major portions of user experiments and results presented in this
chapter were already published ([Hyder et al., 2009, 2010b]).
4.1 Placement of Participants
In order to provide a better teleconferencing solution, it was very important for us to
study the participants positioning arrangement within virtual acoustic environment so
that they could not only locate each other properly in three dimensional space but their
understandability of the speech may not be decreased as well. Further, the speech quality
should not be impaired by reduced loudness, reverberations and by echoes. Thus, we
1 Localization error refers to the deviation of the reported position of a sound stimulus from measured or
synthesized target location.
53
Chapter 4 Experiments on the Placement of Teleconference Participants
conducted subjective listening-only tests to study the impacts of virtual placement of

participants on sound quality, understandability and locatability. In total seven different
listening-only tests from four different settings were conducted by keeping in a view of
teleconferencing scenario.
Table 4.1: Summary of placement of participants test parameters
(Headsize refers to internal distance between two ears in centimeters)
We selected four sets of simulation parameters and used them to judge the seven
different placements of participants in the virtual room (in total 22 combinations). We
tested by changing one Uni-Verse parameter at a time in every setup and kept the other
parameters the same to see the effect of every single changing parameter to study the
impact of virtual placement of participants on sound quality, understandability and
locatability. We have used two different HRTFs, two different room sizes, different
heights of the listener and talker and kept the headsize constant. The following
parameters can be chosen for the acoustic simulation (Table 4.1).
Table 4.2: Summary of listener and talker heights
Room dimensions: In our test experiments, we used two rooms. A Big Room
having dimensions (HxWxL=20 x 20 x 40 m³) and a Small Room having dimensions
(HxWxL=10 x 10 x 20 m³).
HRTF: We have used two HRTFs in these tests, HRTF-1 and HRTF-2. HRTF-1 has 5
reverberations for 5 frequency bands and HRTF-2 has 10 reverberations for 10 frequency
bands.
Head size: Headsize refers to internal distance between two ears in centimeters. We
kept the head size to its default value which is 0.17 in all the setups, because we did not
notice any difference by changing its value ranging between 0.1 to 0.3. (Head-Size is a
Uni-Verse UVSR parameter scalable from 0.1 to 0.3).
54
Placement: Seven different placements of the talkers and listeners were studied. We
name these placements Talkers in the Corners, Listener in the Corners, Horizontal
Placement, Frontal Placement-1, Frontal Placement-2, Surround Placement-1 and
Surround Placement-2. They are described further in the following sections.
Height: The placement of listeners and talkers in terms of height in the virtual room
is summarized in (Table 4.2). We have used the same height parameters for Default,
HRTF-2 and Small Room which we call Height-A and for Talker standing we have used
Height-B.
4.1.1 Sample Design
The samples were processed by the open-source 3D audio rendering engine Uni-
Verse [Kajastila et al., 2007].
The virtual rooms were based on the sample UVAS file “testscene_no_doors.vml”.
The walls of the rooms had the typical acoustic properties of concrete. Based on
the results of the acoustic simulation, a sound renderer auralizes the direct sound
and early reflections paths calculated by the room acoustic simulation module. The
acoustic simulator transmits the listener, source and image source information including
position, orientation, visibility and the URL of the sound source to the sound renderer.
Then the sound renderer applies a minimum phase HRTF on the sound source. A
detailed explanation of the used minimum-phase HRTF can be found in the paper
by [Savioja et al., 1999]. The reverberation algorithm used in the implemented system
was introduced by [Vaananen et al., 1997], which has been modified by [Kajastila et al.,
2007]. Because the reverberation time (RT) is frequency dependent, the sound renderer
uses 10 individual reverberators for 10 frequency bands and separate RTs for different
frequency bands.
Further parameters used for the sample design, such as positions of listeners and sound
sources are given in the following test descriptions.
4.1.2 User Experiments
User experiments with 32 normal-hearing subjects (29 male, 3 female) were conducted
to find out the sound quality, understandability and locatability of the virtual talkers in
the implemented system.
The listening only tests were conducted based on the recommendations [ITU-T, 1996]
as far as possible along with the use of an additional in-house tailored test method which
encompasses 3D audio component of the subjective study. Test method encompassing
3D audio component contains tasks which are described at the end of this section. The
reason for using an in-house tailored test method, along with a recommendation P-
800, which describes methods for subjective determination of telephone transmission
quality, is the lack of standards available particularly for testing of 3D audio supported
transmission quality.
The reason for developing an in-house tailored test method was due the fact that
there is no standard available for a testing of 3D audio supported transmissions quality.
55
Figure 4.1: Acoustic simulations: One listener and two sound sources
(Acoustic simulations having one listener and two sound sources. The white lines show
the direct beam between sound source and listeners. The yellow lines are due to phantom
sound sources (plotted as red points). The green lines are reflection of real sound sources)
56
However, the recommendation P-800 describes methods for subjective determination of

telephone transmission quality.
Two reference sound samples were obtained from the database ITU BS.1387-1 [ITU-
T, 2001c] of which one of them is a male voice and other one is a female voice. The tests
were conducted with pre-recorded samples played in random order to the listeners. The
technical equipment was based on an ordinary computer running Linux operating system.
The computer had a 2.60 GHz Intel Pentium 4 processor and 1 GB physical memory.
We used M-Audio Delta 44 external sound card and Sennheiser HD 280pro headphones.
Subjects were presented with three tasks in the same order for every individual test.
Prior to each setup tests, subjects were asked to familiarize themselves with the given
technology. Following were the tasks presented to each subject.
Task-1: Please locate the talker with the help of a map which describes possible
locations of the talker.
Task-2: Can you understand and concentrate when there is one or more than one
talkers?
Task-3: Please score the talker’s sound quality from 5 to 1 (5=excellent, 4=good,
3=fair, 2=poor, 1=bad) (When there is one talker and when there are more than one
talkers?)
Subjects were given a printed sheet of the general layout of the virtual room and they
were also described the possible location(s) of a talker/talkers.
4.1.3 Test 1: Talker in the corners and Test 2: Listener in the corners
In the test talker in the corners and Listener in the corners, we used a Big Room having
dimensions (HxWxL=20 x 20 x 40 m³).
In the test talker in the corners, listener was positioned at the center of the room at
ground level and talkers were positioned in all eight corners of the room. The listener
was facing the wall appointed by the corners 5, 6, 7 and 8. We wanted to study: (1)
Whether subjects could locate the sound sources correctly? (2) Whether subjects could
identify the orientation of the sound? (3) What judgment subjects had regarding the
quality of the speech? The layout for the virtual acoustic room can be seen in (Fig. 4.2).
In the test Listener in the corners talker position was fixed at the center of the room
at ground level while the listener positions were changed among any one position out of
the all eight corners of the room at a time and the orientation of listener remains facing
the wall depicted by the corners 5, 6, 7 and 8. The layout for the room can be seen in
(Fig. 4.2).
57
Figure 4.2: Talker in the corners and Listener in the corners

Test 1: Talker in the corners and Test 2: Listener in the corners (layout of the virtual
acoustic room for Test 1 and Test 2)
4.1.3.1 Results
These tests were preliminary subjective study conducted to make sure that
listener/subject attain proper orientation within the simulated virtual acoustic
environment. Results indicated that it was very difficult for subjects to correctly
identify the virtual talker positions in this test and very frequent elevation errors [Wenzel
et al., 1993] were seen. There were no significant results achieved relating with
subjects correctly identified talker locations in these two tests. However, relating with
audio quality, talker in the corners test achieved MOS-LQS value (95% Confidence
Interval) 3.85±0.76 and listener in the corners test achieved MOS-LQS value (95% CI)
3.68±0.79. Moreover, subjects attained proper orientation and no orientation errors were
found throughout these test. Thus, we achieved our primary target of having proper
orientation within our developed virtual acoustic environment.
Within these tests, it could be concluded that following might be the possible factors
that caused difficulty for subjects in properly identifying a talker location:
• Non-individualized HRTFs
• listener and talker location
Non-Individualized HRTFs: The use of non-individualized HRTFs normally results in

two particular kinds of localization errors which are front/back confusions and elevation
errors [Wenzel et al., 1993] which are commonly associated with 3D audio systems.
Listener and talker locations: Additionally, listener location in these experiments
were adapted as at the center of the room near ground level and in the corners. These were
not the normal listening positions and could lead towards localizing errors. Furthermore,
58
we do not attain this kind of talker as well as listener positions in everyday life. However,
it was important for us to check over our developed virtual acoustic environment with
some added difficulties. Since, we wanted to test our developed environment thoroughly,
and also wanted to start the second phase of testing with a perfect orientation.
Surround Placement-1
Surround Placement-1
Horizontal Placement
Frontal Placement-1
Frontal Placement-2
Test
Setups
MOS±CI MOS±CI MOS±CI MOS±CI MOS±CI

Default 4.14±0.70 4.03±0.60 3.78±0.82 4.19±0.63 4.04±0.72
HRTF2 4.21±0.69 4.16±0.70 4.20±0.58 4.10±0.67 4.17±0.58
Small Room 4.44±0.51 4.04±0.68 3.83±0.69 4.22±0.51 4.12±0.56
Talker Standing 4.21±0.61 4.03±0.73 3.92±0.62 4.15±0.61 4.14±0.59
Average 4.24±0.63 4.07±0.68 3.93±0.68 4.16±0.60 4.12±0.61

Table 4.3: Summary of MOS Scores
4.1.4 Test 3: Horizontal placement

In test Horizontal Placement, a listener position was surrounded by four talker positions.
Talkers positions were shown corresponding to the listener position and orientation such
as left, right and front, back of the listener. The orientation of listener is facing talker
position 2. Talker position 3 was on the left, talker position 4 was on the right and
talker position 1 was behind the listener. The layout for the Horizontal Placement can
be viewed in (Fig.4.3) In this test, we tried to depict the normal meeting arrangement but
at the same time we wanted to check that up to what extent our current solution helps
to reduce front/back confusion, which is normal when non-individualized HRTFs are
used [Wenzel et al., 1993].
4.1.4.1 Results
According to the test results (Fig. 4.4 and Table 4.3), the highest successful result in
terms of properly identifying talker positions was achieved by Default by scoring 89%.
Lowest results were produced by Small Room by scoring 56%. If we take a look at the
overall result of Horizontal Placement by averaging results of Default, HRTF2, Small
Room and Talker Standing we get 72 percent successful result produced by subjects who
correctly identify talker locations. Moreover, some front/back confusions among the
subjects were observed during this test where as left/right orientation had a near hundred
59
Figure 4.3: Test - Horizontal placement layout
Figure 4.4: Test - Horizontal placement results
60
percent success rate. It is interesting to report that Small Room accumulated highest
MOS score (Table 4.3) among all setups, yet produced the lowest successful scores.
MOS-LQS value (95% CI) was 4.25±0.63 (Table 4.3)
4.1.5 Test 4: Frontal placement-1
In Frontal placement-1, we tried with normal sitting positions which we observe in a

normal office meeting environment where all participants are seated by facing each other.
In a layout in (Fig. 4.5) possible talker positions can be seen along side a listener position
(in this case a subject who takes part in this study). These five talker positions were
named relative to the listeners’ position, such as: 1 (near left), 3 (far left), 2 (near right),
5 (far right) and 4 (in front).
Figure 4.5: Test - Frontal placement-1 layout
Speech samples were processed on these talker positions, however a listener position
was fixed as shown in the layout (Figure 4.5). During the test, subjects were presented
with the samples in randomized order containing a one-talker situation; this means only
one talker speech was processed at a time at one of the various positions as described
above. Subjects were then asked to identify the position/location of a talker. Within this
virtual acoustic room, a listener and a talker were having the same height, details of a
listener and a talker height are presented in (Table 4.3).
61
Figure 4.6: Test - Frontal placement-1 results
4.1.5.1 Results
Default and HRTF2 were the highest successful result accumulators with a score of 89%
each. Small Room produced lowest results by scoring 64%. Talker Standing achieved
score of 69%. The main highlights of the Frontal placement-1 test remained Default and
HRTF2. These results showed an effectiveness of Frontal placement-1 in combination
with Default and HRTF2.
4.1.6 Test 5: Frontal placement-2

In Frontal placement-2, (see Figure 4.7) we continued with the same placement of talkers
and a listener as we did in the Frontal placement-1 test. However, instead of one talker
we introduced two-talkers situation. This means that two talkers’ speeches were taken
and were processed simultaneously at a time at any two of the various positions as shown
in (Fig. 4.7) and then presented to the subjects. For a two-talker situation we opted to
take a male and a female talker speech in this test. The objective behind this test was to
check whether participants take advantage of their natural ability to concentrate and to
understand to a particular speech when there are two speeches in the audio scene. In other
words this was done to check a cocktail party effect [Crispien and Ehrenberg, 1995].
Since, successful results in Frontal placement-1 raised interesting questions such as: (1)
whether it would be easier for subjects in Frontal Placement-2 to concentrate on a single
talker when two talkers are talking simultaneously in a scene? (2) Whether virtual talkers
are correctly locatable within virtual acoustic environment? In literature [Drullman
and Bronkhorst, 2000] it was reported that in a complex audio scene human ability to
recognize a talker is better when 3D presentation of a scene is offered. Additionally, in
62
the literature it was also found out that the time required recognizing a person is shorter
in such cases.
Figure 4.7: Test - Frontal placement-2 layout
Figure 4.8: Test - Frontal placement-2 results
4.1.6.1 Results
Default produced highest successful results by scoring 73%. Talker standing
accumulated second highest results by scoring 66%. Small Room had the lowest
successful results by scoring 38%. HRTF2 produced only 62% successful score. It
can be safely concluded that increase in the number of simultaneous talkers decreases
localizing performance. Default in earlier test (Frontal Placement-1) in which one-talker
63
situation was presented produced nearly 90% results but with two-talker situation we
observed around 16% decrease in scores.
4.1.7 Test 6: Surround placement-1

In Surround Placement-1 test we made some changes in sitting arrangement of
participants, we increased number of talker positions from 5 to 8, and also we changed
a listener position. A listener was positioned in the center relating to all talker positions
(a listener in this case is a subject who takes part in this study). The layout for the
Surround Placement-1 test can be seen in (Fig. 4.9). Within virtual acoustic environment,
talker positions were shown up relating to the listener position and orientation such as:
talker positions 1, 4 and 6 appeared on the left of the listener. Talker positions 3, 5 and
8 appeared at the right of the listener. Talker position 2 appeared in front and talker
position 7 appeared at the back of the listener. Listener/talker heights in the virtual
acoustic environment were presented in (Table 4.2).
Figure 4.9: Test - Surround placement-1 layout
4.1.7.1 Results
In test Surround Placement-1, Default accumulated highest successful result by scoring

68%. HRTF2 produced second highest successful results by scoring 65%.
Default and HRTF2 once again appeared as a major highlight of this test. On the other
side, Talker Standing produced lowest successful result by scoring 40%.
64
Figure 4.10: Test - Surround placement-1 results
4.1.8 Test 7: Surround placement-2

In test Surround Placement-2, we continued with the same placement of talker positions
and a listener position that we observed in Surround Placement-1 test. However, we
introduced two-talkers situation. This means that two talkers’ speeches were taken and
were processed at a time at any two of the various positions as shown in (Fig. 4.11).
Figure 4.11: Test - Surround placement-2 layout
65
Figure 4.12: Test - Surround placement-2 results
4.1.8.1 Results
In Surround Placement-2, Default produced a highest successful result by scoring
46%. HRTF-2 produced second highest result by scoring 41%. Small Room yielded
lowest result by scoring 30%. Two-talker situation caused overall reduction in scores.
Additionally, in Surround Placement-2 subjects faced frequent front/back confusions.
Therefore, it can be safely concluded that for teleconferencing solutions, a listener at the
center of the talker positions is not a suitable position at all.
4.2 Summary
The quality of conference calls can be significantly enhanced if the telephones do
not reproduce the speech in mono but instead use stereo headphone and spatial audio
rendering. Then, one can identify participants by locating them and one can listen to one
specific talker even if multiple talkers speak at the same time.
Listening-only tests using normal stereo headphones have shown that listeners can
locate the origin of sounds and the position of talkers quite well. At the same
time, the speech quality is only slightly reduced by adding reverberations echoes, and
HRTF related filters. No subject complained about the lack of an understandability to
understand the talkers or of any extra efforts required on user behalf to concentrate on
talkers during user tests.
The test results revealed that the performance of speech locating test is good when
speech is placed at the same height with the listener and poor when it is vertically placed
down or up in the direction of the listener. In listening-only tests subjects seemed quiet
sure about the speech orientation. The speech quality remained very good throughout
all the tests and there were no impairments even with two echoes and reverberations.
66
4.2 Summary
Same is true with two sound sources at a time, each of the sound source could be clearly
heard and distinguished during the tests. Summary of MOS scores confirms speech
quality (Table 4.3).
Small Room accumulated highest MOS score (Table 4.3) for Horizontal Placement
test but could not produced better results than Default and HRTF2. It could be concluded
safely that with smaller rooms we can produce better speech quality but not the better
localization scores.
Front/back reversals and elevation localization errors were commonly seen throughout
listening only tests. Possible reasons for front/back reversals could be the use of non-
individualized HRTFs and also the fact that our tests were done without any head-
tracking system installed.
The Default setup employing an HRTF consisting of five reverberations for five
frequency bands produced better results among HRTF-2 consisting of 10 reverberations
for 10 frequency bands, Small Room and Talker Standing setups.
67
68
Chapter 5
Assessing Virtual Teleconference
Rooms
5.1 Introduction
3D audio simulations of 3D telephony and teleconferencing system are based on virtual
acoustic environment. Properly choosing virtual acoustic environment is essential to
further improve this system. This chapter describes a series of experiments and examines
the effects that simulated virtual acoustic room properties, virtual sitting arrangements,
reflections of a conference table, number of concurrent talkers and voice characteristics
have on the perception of speech quality, locatability and speech intelligibility in a 3D
teleconferencing system. Particularly, the tests conducted were designed to answer the
following questions: To what extent are multiple talker localization performance and
subjective speech quality ratings influenced by the size of the virtual conference room?
What are the results when a conference table is simulated and what is the overall impact
of changing the conference table size? What results are achieved when the number of
simultaneous talker increases? Do different voice types have an influence on the easiness
of locating simultaneous talkers? What are the results when there is an increase in talker
position density?
Also, to author’s knowledge, there is hardly any literature available regarding
simulating a conference table in the virtual acoustic conferencing rooms to study its
impact on overall speech intelligibility. Additionally, it has been reported by [Jeub
et al., 2009] that reflections of the conference table can cause decrease in the speech
intelligibility. We experimented with different properties of conference table to study its
impact on speech intelligibility in particular.
Also, we know that changes in the room properties (change in volume) and changes
in source-to-receiver configuration (distance or orientation) causes change in direct-to-
reverberant ratios at receiver which helps in sound source distance perception [Vesa,
2009]. We experimented with different room properties and different source-to-receiver
configuration to study the near and far perception of talkers in these placement tests.
Additionally, the room size is an other important factor which need to be studied in
order to study which room size allows listeners of the teleconference to easily understand
the multiple talkers and locate them in space.
69
Chapter 5 Assessing Virtual Teleconference Rooms
The remainder of this chapter is structured as follows: (Section 5.2) lists related and
ongoing research related to 3D audio, spatial audio teleconferencing systems and the
quality assessment of such systems. (Section 5.3) discusses the methodology, setup and
performance of the listening-only tests presented in this paper by listing the utilized
testing scenarios, procedures and terms. Afterwords, the results of these tests will be
presented in detail in (Section 5.4). Finally, the paper is concluded with a summary of
the obtained results in (Section 5.5).
5.2 Related Work
Teleconferences suffer from many well known problems. For example, the listener
performance in multi-talker scenarios decreases in terms of understanding speech,
locating talkers and concentrating on a talker of choice as there is an increase in auditory
scene complexity [Brungart et al., 2007]. If binaural or even 3D audio is incorporated in
teleconferencing systems, the quality of teleconferences can be increased [Yankelovich
et al., 2006, Begault, 1994].
Multiple 3D audio teleconference systems have been implemented. In [Hughes, 2008],
Hughes presented a 3D audio teleconferencing system called Senate. In [Reynolds
et al., 2009] a distribution model for headphone based spatialized audio conferences
was presented. In [Herre et al., 2010] described a combination of Spatial Audio
Object Coding and Directional Audio Coding technologies to be used for interactive
teleconferencing. In [Ahrens et al., 2008] the Sound Renderer Framework that can be
used to render 3D audio for teleconferences was presented.
Spatial audio teleconferencing systems under development are far from mass market
usage as their quality of experience does not fulfill all user demands yet. Consequently, it
is very important to measure the quality of existing systems to understand how to improve
them. In their work [Kilgore and Chignell, 2006, Kilgore, 2009], experimental research
to determine whether the combination of spatialization and simple visual representation
of a voice’s location helps recognizing completely unfamiliar voices was presented.
The test results evidently show that localization easiness benefits when spatial audio
is coupled to a visual interface only with a large number of voices, as in this case with
eight, but not with four voices. In [Vesterinen, 2006] performance differences between
3D, monophonic and stereophonic audio conferences through subjective tests in her work
“Audio Conferencing Enhancements” was tested. Results presented show that spatially
mixed hemispherical audio produced the most pleasing listening experience of a multi-
person conversation.
The impact of spatialized audio and video on user-experience in multi-way video
conferences using a proprietary software was explored in [Inkpen et al., 2010]. Their
study didn’t reveal any significant differences between mono audio and spatialized
audio. The results of other studies [Kilgore et al., 2003, Yankelovich et al., 2006, Hyder
et al., 2010b] however showed positive influence of spatial audio. Because of these
contradicting research results, we see it as an important task to improve spatial audio
conferencing as different spatial teleconferencing systems might perform significantly
70
5.3 User Experiments
Figure 5.1: Virtual room with three simultaneous talkers
different.
In our research review it was also found that in auditory selective attention listening
tasks, dichotic [Hillyard et al., 1973], Interaural Level Difference (ILD) and/or Interaural
Time Difference (ITD) [Darwin and Hukin, 1999, Shinn-Cunningham and Ihlefeld,
2004] presentations are employed. In [Spring, 2007] HRTFs presentation have been
utilized with four simultaneous talkers stimuli presented to listener for selective attention
tasks. Subjects were asked to concentrate on one story told by one of the talkers
while ignoring the other three stories. Average correct responses reported were 58%
ranging from 18% to 84%. In contrast, our work includes presentation of stimuli to
subjects containing four simultaneous talkers employed at different locations in the
virtual acoustic environment. Questions included identifying the mixed gender talkers,
to understand speech and to locate every concurrent talker in virtual space.
In order to enhance our 3D Telephony system we conducted formal listening-only tests
to measure localization performance, localization easiness, spatial and overall speech
quality of different virtual teleconferencing scenarios.
To measure localization performance each test participant was presented a map with
possible talker locations. Then, the actual location of each talker was compared to the
location selected by the test participant. Localization easiness described the subjectively
perceived effort required by test participants to localize a talker, while spatial quality
described how well the participant could perceive that talkers were spatially separated,
71
Sample File name Duration

Male 1 A_eng_m5.wav 14.62s
Female 1 A_eng_f4.wav 09.95s
Table 5.1: Listing of all used speech samples
and overall speech quality referred to the perceived speech quality as compared to a
real life conversation. Localization easiness, spatial and overall speech quality were
measured using discrete Mean Opinion Score - Listening Quality Scale Wide-band
(MOS-LQSW) scores with the values 1 (bad), 2 (poor), 3 (fair), 4 (good) and 5 (excellent).
The MOS-LQSW values were named MOS-LQSW LE for localization easiness, MOS-
LQSW SQ for spatial quality and MOS-LQSW OQ for overall speech quality.
During the tests the five parameters voice type, number of concurrent talkers, table
size, talker position density and room size were modified. The influence of each
parameter was evaluated by comparing a specially designed test setup consisting of a
series of two tests to a given reference test.
User experiments were conducted with 31 paid subjects, 13 of them female and 18
of them male, according to [ITU-T, 1996]. All test participants were aged between 20
and 45 years with an average age of 27 years. 8 out of 31 participants showed earlier
experiences with listening-only tests, and all subjects indicated a good to professional
level of computer proficiency. The average time taken by the subjects to complete
all tasks given in the tests was 62 minutes. Each subject participated in 11 different
tests contained in 5 different setups and one reference test, thereby assessing quality
and localization information on 71 audio samples. Thus, 2201 audio samples were
collectively assessed during the tests all together.
All audio samples consisted of anechoic speech samples taken from [ITU-T, 1998].
They were prerecorded from and processed by the open-source 3D audio rendering
engine Uni-Verse [Kajastila et al., 2007] at a sampling rate of 16 kHz.
A screen-shot of Uni-Verse’s rendering engine is shown in (Fig. 5.1), and further
details about the usage of the Uni-Verse framework can be obtained in (Chapter 3).
The speech samples were recorded using three different male and three different female
voices each speaking four sentences in American English. Table 5.1 lists all samples used
during the experiments as well as their duration. Using human speech samples as sound
sources in the experiments has been thought of as a direct application to the problem of
a multi-party teleconferencing system.
All tests were conducted in a quiet listening room on a computer using a specially
designed user interface as shown in (Fig. 5.2). Before the tests were conducted, each
72
73
Figure 5.2: Graphical user interface of the testing environment

participant received an introduction into the testing environment and instructions about
the tasks to be accomplished during the tests. Every test was preceded by a learning phase
during which the participants were presented reference samples with their accompanying
correct locations. In the training phase, all samples were presented in the same linear
order to each participant and could be played up to three times using the provided play
button, before moving on to the next sample by pressing the next button. To enable
participants to distinguish the different talkers contained in each sample, each talker was
represented by a number as well as its spoken text.
Each participant was asked a series of questions to be answered for each talker
contained within each sample. First, the locations of all talkers were to be determined by
selecting a location from a map of possible talker locations. Secondly, location easiness,
spatial and overall speech quality had to be selected by using the previously described
discrete MOS values MOSLE , MOSSQ and MOSOQ .
5.3.1 Experimental Design
All tests were performed in cubic virtual conference rooms with varying dimensions.
The walls of the rooms showed the typical acoustic properties of concrete. A schematic
overview over the virtual test environment and all measured parameters is shown
in (Fig. 5.3 and 5.4).
A round conference table showing the acoustic properties of wood was placed at the
center of the room at a height of htable = 0.75m above the floor. The table had a variable
radius of 2, 3 or 4 meters depending on the test.
Either 5, 7, or 9 participants were distributed equally around the table. All participants
were placed at a distance of d part = 0.25m from the table and at a height of h part = 1.25m
above the floor.
Figure 5.3: The virtual conference room top view
74
Figure 5.4: The virtual conference room frontal view
In each test, one of the participants always represented the listener and was placed
at a fixed position. To simulate the listener, a generic HRTF for five frequency bands
was assumed due to good experiences obtained during our previous studies (Chapter 4).
All other participants represented talkers whose positions and numbers were varied in the
different setups, with at least 2 and at most 4 participants talking concurrently at the same
time. Additionally the distribution of male and female talkers was varied to examine the
influence of the different voice types on localization performance and subjective speech
quality.
Beside a reference test setup, we tested five different setups varying one of
the above mentioned parameter at a time as compared to the reference test.
(Table 5.2) lists all setups and their respective parameters. The setups were called
Voice Type, Number of Simultaneous Talkers, Listener-to-Sound Source Distance,
Talker Position Density and Sound Source-to-Wall Distance and are described in the
following sections.
5.3.1.1 Reference Test
The reference test is based on processed speech signals with an average length of 14.38s,
simultaneously spoken by two male talkers from four possible locations distributed
around the table. The virtual conference room has a size of 20 × 20 × 20m3 , the radius of
the table is set to 2m. Sound source positions are labeled relatively to the listener location
as 1-NearLeft, 2-FarLeft, 3-FarRight, 4-NearRight, the position of the listener is labeled
Listener. The listener and all sound sources are facing the center of table. Within the
reference test, six samples with different combinations of voice-to-position assignments
were recorded. The total number of samples assessed for this test is 186.
75
Name Room Participants Table Simultaneous Voice
dimension radius talkers type
Reference 20 × 20 × 20m3 5 2m 2 m/m
f/f
Voice Type 20 × 20 × 20m3 5 2m 2
m/f
3 m/m/m or f / f / f
Number of simultaneous- Talkers 20 × 20 × 20m3 5 2m
4 m/ f /m/ f
3m
Listener-to-Sound Source Distance 20 × 20 × 20m3 5 2 m/m/m
4m
7
Talker Position Density 20 × 20 × 20m3 2m 2 m/m
9
15 × 15 × 15m3
Sound Source-to- Wall Distance 5 2m 2 m/m
10 × 10 × 10m3
Table 5.2: Test setups and parameters
76
5.3.1.2 Voice Type

The goal of this setup was to test the impact of relative and absolute differences in voice
types, such as two concurrent male, female or mixed talkers. Therefore, the two tests
within this setup, Voice Type-1 and Voice Type-2 utilize two simultaneous female talkers
with an average signal length of 13.03s, and two mixed talkers with an average signal
length of 14.42s as opposed to the two male talkers used in the reference test.
It is assumable that the results attained by this setup will show better localization
performance and localization easiness scores for mixed gender samples, since both
voices can be distinguished more easily than with two voices of the same gender.
5.3.1.3 Number of Simultaneous Talkers
This setup was used to measure the changes in localization performance, easiness and
speech quality when the number of concurrent talkers increases. Thus, the two tests
Number of Simultaneous Talkers-1 and Number of Simultaneous Talkers-2 employ three
and four simultaneous talkers respectively. To level the effect of mixed-gender voice
types, half the samples recorded for Number of Simultaneous Talkers-1 consisted of
three male talkers, while the other half consisted of three female talkers. The average
signal length for this test was 14.36s. In Number of Simultaneous Talkers-2, all samples
consisted of two female and two male talkers and the average signal length was 14.47s.
The results of this setup are likely to show two opposing trends: while an increase
in the number of simultaneous talkers is likely to confuse the test participants,
Number of Simultaneous Talkers-2 is expected to show better results in localization
performance, since four out of four possible sound source locations are occupied in each
presented sample. Thus, the error of misperceiving a sound source’s location with an
empty location on the presented map should be minimized. Although the localization
performance will decrease for Number of Simultaneous Talkers-1 and increase for
Number of Simultaneous Talkers-2 in relation to the results achieved in the reference
test, it is safe to assume that an increase in the number of simultaneous talkers will lead
to a decrease in subjective localization easiness and speech quality scores.
5.3.1.4 Listener-to-Sound Source Distance
The Listener-to-Sound Source Distance setup was used to measure the impact of
the distance between the individual sound sources and the listener. Therefore, the
tests Listener-to-Sound Source Distance-1 and Listener-to-Sound Source Distance-2 use
virtual conference tables with radii of 3m and 4m respectively. The average stimuli
lengths were 14.42s and 14.47s.
It can be assumed that the localization performance as well as subjective localization
easiness scores will enhance with an increase in table radius, since the the distance
between the individual sound source locations also increases with the size of the table.
Thus, the different possible locations of the sound sources should be distinguishable
more clearly.
77
5.3.1.5 Sound Source Density

This setup was designed to measure localization performance, easiness and speech
quality when the number of possible sound source locations increases. Thus, the two
tests Sound Source Density-1 and Sound Source Density-2 employ six and eight possible
talker locations respectively. All locations were distributed equally around the virtual
conference table as shown in (Fig. 5.3).
The average signal lengths for these tests were 14.41s and 14.43s. While Sound
Source Density-1 features six possible sound source locations occupied by two male
voices and a fixed listener position, Sound Source Density-2 uses eight possible sound
source locations as well as a fixed listener position.
Sound source positions in Sound Source Density-1 are labeled as:
• 1-NearLeft,
• 2-CenterLeft,
• 3-FarLeft, 4-FarRight,
• 5-CenterRight
• 6-NearRight
Sound source positions in Sound Source Density-2 are labeled as:
• 1-NearLeft,
• 2-NearCenterLeft,
• 3-FarCenterLeft,
• 4-FarLeft,
• 5-FarRight,
• 6-FarCenterRight,
• 7-NearCenterRight
• 8-NearRight
Sound Source Density-1 employs seven different voice-to-position assignments, and ten
different voice-to-position assignments were recorded for Sound Source Density-2. The
total number of samples assessed was 217 and 310 respectively.
The results of this setup are likely to show that localization performance decreases
with an increasing density of possible sound source positions, since the distance between
adjacent sound sources diminishes, thus reducing the difference in spatial information
carried by the individual voices. Additionally, reflections produced by the conference
table are more likely to be close to possible sound source locations, thereby confusing
the participant’s spatial perception.
78
5.4 Results
5.3.1.6 Sound Source-to-Wall Distance

To determine the effect of room size and sound source-to-wall distance on all
measured scores, this test uses two different rooms with dimensions of 15 ×
15 × 15m3 and 10 × 10 × 10m3 for the tests Sound Source-to-Wall Distance-1 and
Sound Source-to-Wall Distance-2 respectively. The average lengths of the presented
stimuli add up to 14.65s and 14.43s for the two tests.
The amount of reverberation time and echo increases with the room size. On the other
hand, as per [Shinn-Cunningham, 2000], reverberation degrades perception of the sound
source direction, but enhances distance perception. Hence, it is assumable and expected
that the localization performance will increase with a decrease in room size.
5.4 Results
In normal listening situations we segregate the information masked by other
simultaneous sounds by utilizing our natural ability to hear in three dimensions. We
extract required sounds and/or information by taking advantage of the “Cocktail Party
Effect” [Stifelman, 1994, Crispien and Ehrenberg, 1995]. It was of great concern
for us to check to what extent our audio teleconferencing and telephony system “3D
Telephony” helps users to solve the “Cocktail Party Effect” problem and to what degree
our solution is acceptable and comparable to natural human listening phenomena. A
detailed analysis of the acquired experimental data is presented in the following sections.
5.4.1 Reference Test
Since Reference Test was the foundational test and was designed in order to be used
for the comparison to all other tests, the results obtained for this test are significant
for the whole experimental process. The analysis of the reference test showed that in
46% of all samples both talkers were located correctly, in 35% just one, and none of
the talkers in the remaining 19%. Overall, 64% of all talkers were located correctly.
Misperception occurred mostly between the 4-NearRight talker location and 3-FarRight
location (30%) and between position 2-FarLeft and 3-Far Right (22%). MOS ratings on
a 95% confidence interval (CI) were found to be 3.68 ± 0.11 (MOS-LQSW LE on 95%CI),
3.84 ± 0.10 (MOS-LQSW SQ on 95%CI) and 3.86 ± 0.10 (MOS-LQSW OQ on 95%CI).
5.4.2 Voice Type
For this setup, the localization correctness yielded better results (overall 77%, both
61%, one 31%, none 8%) with mixed gender talkers (Voice Type-2) than with two male
talkers (overall 64%, both 46%, one 35%, none 19%, as above) and two female talkers
(Voice Type-1, with scores of overall 49%, both 37%, one 23% and none 40%) as shown
in (Fig. 5.5 and Table 5.3). Misperception was observed to happen between similar
positions as in the reference test. The MOS ratings for two female talkers and mixed
gender talkers did not shown any statistical significant difference as compared to the
reference values. The only exception is the (MOS-LQSW LE on 95%CI) rating for mixed
gender talker, which was found to be 3.83 ± 0.12.
79
Figure 5.5: Localization correctness vs. MOS − LQSW LE ratings - (Voice Type)
Voice Type
Test parameter Number of talkers
located correctly (in
%)
S.No: Relative speech 2 1 0
frequency Talkers Talker Talkers
1 Male/Male 46% 35% 19%
2 Female/Female 37% 23% 40%
3 Male/Female 61% 31% 8%
Table 5.3: Talker localization distribution - (Voice Type)
80
5.4 Results
5.4.3 Number of Simultaneous Talkers
Figure 5.6: Talker localization vs. MOSLE ratings - (Number of Simultaneous Talkers)
The results of this setup showed an increase in localization correctness with an

increasing number of concurrent talkers. In Number-of-Simultaneous-Talkers-1, an
overall localizing correctness of 70% was observed (Fig. 5.6). More precisely, in 51% of
all samples, three simultaneous talkers were located correctly, in 17% of the cases only
two out of three, in 24% of the cases one out of three, and in the remaining 9% none
of the talkers was correctly located (Table 5.4). The MOS values were all lower than
the corresponding reference values: (MOS-LQSW LE on 95%CI) was 3.08 ± 0.13, (MOS-
LQSW SQ on 95% CI) was 3.12 ± 0.11 and (MOS-LQSW OQ on 95% CI) was 3.19 ± 0.11.
In Number of Simultaneous Talkers-2, overall 74% of all talkers were located rightly.
All four talkers were located correctly in 52% of the cases, which is comparatively
better than the results using three simultaneous talkers. Additionally, in 9% of the
cases three talkers were located correctly, in 26% two talkers and in 7% of all presented
81
Number of Simultaneous Talkers

Test
Number of talkers located correctly (in %)
Parameter
Number of
talkers 4 3 2 1 0
S.No:
present in the Talkers Talkers Talkers Talker Talkers
scene
1 2 Talkers * * 46% 35% 19%
2 3 Talkers * 51% 17% 24% 9%
3 4 Talkers 52% 9% 26% 7% 6%
Table 5.4: Talker localization distribution - (Number of Simultaneous Talkers)
talkers one out of four talkers was located correctly. Only at 6% of the time no talker
could be located correctly. The MOS ratings were similar to the ratings found in
Number of Simultaneous Talkers-1, only (MOS-LQSW LE on 95% CI) was slightly better
at 3.14 ± 0.13.
5.4.4 Listener-to-Sound Source Distance
The results of Listener-to-Sound Source Distance show that a larger table leads to better
localization performance. Listener-to-Sound Source Distance-1 employed a table radius
of 3m. Here, 71% overall correctly located talkers result was achieved as compared
to 64% obtained in the reference test, as shown in (Fig. 5.7). In 57% of the cases,
both talkers were located correctly, one of two in 28%, and none in 15% of all cases
(Table 5.5). Misperception occurred in a matter similar to the reference test, while all
MOS scores were slightly higher at 3.72 ± 0.10 (MOS-LQSW LE on 95%CI ), 3.68 ± 0.09
(MOS-LQSW SQ on 95%CI) and 3.75 ± 0.09 (MOS-LQSW OQ on 95%CI).
Listener-to-Sound Source Distance

located correctly (in
%)
S.No: Conferencing Table 2 1 0
Size Talkers Talker Talkers
1 2 m Radius 46% 35% 19%
2 3 m Radius 57% 28% 15%
3 4 m Radius 59% 31% 10%
Table 5.5: Talker localization distribution - (Listener-to-Sound Source Distance)
Using a radius of 4m for the virtual conference table in Listener-to-Sound Source Distance-2
82
5.4 Results
Figure 5.7: Localization correctness vs. MOSLE ratings

Localization correctness vs. MOSLE ratings - (Listener to Sound Source Distance)
yielded 75% overall correctly located talkers, while in 59% both talkers were located
correctly, in 31% only one of two and in 10% none of the talkers were located correctly.
All MOS scores for this test were found to be within the confidence interval of
Listener-to-Sound Source Distance-1.
5.4.5 Sound Source Density

In Sound Source Density-1, six possible talker locations were used. An overall
correctness for talker localization of 47% was found (Fig. 5.8), while both talkers could
be located in 28%, one talker in 39% and no talkers in 34% of all cases as presented
in (Table 5.6). Misperception occurred mainly between 4-FarRight and 5-CenterRight
(48%) as well as between 1-NearLeft and 2-CenterLeft (47%). MOS scores were slightly
lower than in the reference test with values of 3.34 ± 0.10 (MOS-LQSW LE on 95% CI),
3.56 ± 0.09 (MOS-LQSW SQ on 95%CI) and 3.62 ± 0.09 (MOS-LQSW OQ on 95%CI).
83
Figure 5.8: Localization correctness vs. MOSLE ratings - (Sound Source Density)
Sound Source Density

located correctly (in %)
S.No: Number of talker 2 1 0
positions Talkers Talker Talkers
1 4 positions 46% 35% 19%
2 6 positions 28% 39% 34%
3 8 positions 17% 41% 42%
Table 5.6: Talker localization distribution - (Sound Source Density)
In Sound Source Density-2, each talker could be placed on one of eight possible
locations. Here, only 37% overall talker localization correctness was achieved. In 17%
of all cases, both talkers were located correctly, in 41% only one and in 42% none
of the talkers were located correctly. Misperception occurred between 5-FarRight and
84
5.4 Results
6-FarCenterRight (44%) and between 1-NearLeft and 2-NearCenterLeft (42%).

Again, MOS ratings were found to be of 3.15 ± 0.09 (MOS-LQSW LE on 95% CI),
3.48 ± 0.08 (MOS-LQSW SQ on 95% CI) and 3.53 ± 0.08 (MOS-LQSW OQ on 95% CI),
which is slightly lower than those obtained in the reference test.
5.4.6 Sound Source-to-Wall Distance

In Sound Source-to-Wall Distance-1 which was conducted in a medium sized room of
15x15x15m3 with a volume of 3375m3 , an overall localization correctness of 72% could
be achieved as depicted in (Fig. 5.9). Both talkers could be located in 57% of all cases,
one out of two in 30% and in 13% none of the talkers was located correctly as presented
in (Table 5.7). Location misperception was found to be near equal to the values found
in the reference test. All MOS ratings were found to be slightly lower but within the
confidence interval of the ratings achieved in the reference test.
Figure 5.9: Localization correctness vs. MOSLE ratings

Localization correctness vs. MOSLE ratings - (Sound Source-to-Wall Distance)
Sound Source-to-Wall Distance-2 with the dimensions of 10x10x10m3 and a volume
85
of 1000m3 also exhibited a correctly located talker ratio of 72%, while both talkers could
be located in 58% of the cases, one talker in 30% and none of the talkers in 12%. Again,
misperception was found to be similar to the reference test, and MOS ratings were near
equal to Sound Source-to-Wall Distance-1 and the reference test.
Sound Source-to-Wall Distance

located correctly (in %)
S.No: Teleconference Room Size 2 1 0
Talkers Talker Talkers
1 Big Room (8000 m³) 46% 35% 19%
2 Medium Room (3375 m³) 57% 30% 13%
3 Small Room (1000 m³) 58% 30% 12%
Table 5.7: Talker localization distribution - (Sound Source-to-Wall Distance)
5.5 Summary
As shown by the results listed in Section 5.4, each of the measured parameters has a
substantial influence on talker localization performance.
Results of the Voice Type setup clearly state that participants were able to locate two
simultaneous talkers more often when the presented stimuli were of different genders as
was previously assumed, and that two male talkers were easier to locate than two female
talkers. While the first finding can be explained by the fact that it is much easier to
distinguish two different voices if their pitches differ greatly. A possible explanation is
that the male voices showed greater differences in voice pitch and hence were easier to
differentiate than the female voices. But since the subjective location easiness ratings do
not show any significant differences between the reference test and Voice Type-1/2, one
can assume that the reasons were not that obvious. Another explanation can be given
by the fact, that the experiments were performed by more male than female participants.
Both tests achieved subjective MOS quality ratings at an acceptable level.
It could also be shown that an increasing number of participants leads to higher
localization correctness ratios, which partly contradicts the preliminary assumptions
made in (Section 5.3.1.3). Although this result seems counter-intuitive, one has to keep
in mind that the number of possible talker locations was kept constant while the number
of concurrent talkers increased, and hence the talker-to-location ratio increased with the
number of concurrent talkers. Therefore, participants were able to directly compare all
concurrent talkers and the error of misperceiving a talker location with an empty location
was minimized. Subjects reported that spatial separation of all simultaneous talkers
helped them to determine the corresponding locations to a good extent, although echoes
and reverberations for three simultaneous talkers made it difficult to absorb the situation
for a longer period, thereby resulting in significantly lower MOS-LQSW LE ratings for 3
86
5.5 Summary
and 4 simultaneous talkers.

It was assumed that as the size of conference table would increase, better localization
rates could be observed, since the increase in size is accompanied by greater distances
between different talker locations. This assumption was verified by the results obtained
in Listener-to-Sound Source Distance-1/2. While subjects found it similar easy to judge
the talker’s correct locations in all three tests, their performance improved from 63%
using a table with a radius of 2m in the reference test, to 72% and 75% when using tables
with radii of 3m and 4m respectively.
While in Number of Simultaneous Talkers the talker-to-location ratio increased
with the number of concurrent talkers, this ratio significantly decreased in
Sound Source Density-1/2 as compared to the reference test. Additionally, the distance
between two adjacent talker locations decreased since the table radius was kept constant.
Hence, the hypothesis of a significant decrease in talker localization performance could
be verified by the results presented in (Section 5.4.5.) These results state that the density
of possible talker locations distributed around a conference table has a significant impact
on talker location performances, as the ratio decreases from 63% in the reference test
to as low as 47% and 37% for six and eight possible talker locations. Additionally,
Number of Simultaneous Talkers-2 yielded the lowest MOS-LQSW LE ratings of all tests,
while subjective speech quality ratings were found to be only slightly lower than those
of the reference test.
Results of Sound Source-to-Wall Distance state that with a decrease in room
size and volume localization performance increases. This verifies the prediction
made in (Section 5.3.1.6) and can be explained by the increasing number of echoes
and reverberation in larger rooms, which according to Shinn-Cunningham Shinn-
Cunningham [2000] enhance the distance perception but degrades localization
performance. While localization performance increases, test subjects found it equally
easy to judge the talker’s locations for the reference, Sound Source-to-Wall Distance-1
and Sound Source-to-Wall Distance-2 tests, and the perceived spatial and overall speech
quality showed no statistically significant differences between these three tests.
Aside from influences on localization performance and easiness, subjects were found
to misperceiving the talker locations. These misperceptions occur more often, as the
density of possible talker location increases, and can be explained due to the phantom
sources created by reflections on the virtual conference table. When the density of
talker location increases, a phantom source is more likely to be close to one of the
possible talker locations and hence might be confused with that position. Therefore,
misperception mostly occurred between two adjacent positions either on the near left
(between positions l1 and l2 as shown in (Fig. 5.3) or near right side (positions ln and
ln−1 ) of the listener.
In smaller room size having dimensions of 10 m3 and an average reverberation start
delay time1 of 50 ms − 62 ms, achieved localization accuracy of 72.31%. In medium size
1 Reverberation start delay time is the difference of time b/w direct to reverberant signal and is calculated
87
room having dimensions of 15 m3 and average reverberation start delay time of 89ms −
94 ms, localization accuracy of 71.77%. was achieved.
Audio Sources Small Room Medium Size Large Size

(10 m³) Room (15 m³) Room (15 m³)
Source_1 61.74 ms 93.4 ms 123.8 ms
Source_2 51.62 ms 89.31 ms 107.5 ms
Table 5.8: Average reverberation start delay time
Average reverberation start delay time comparison for three virtual acoustic rooms for
two sound sources
No significant difference between results of smaller and medium size room was found.
While in bigger room size having dimensions of 20 m3 and average reverberation start
delay time of 107 ms − 124 ms, localization accuracy of 63.44% was achieved. This
accuracy was found significantly lower than it was achieved in small and medium-sized
room results. This difference in localization results suggested the importance of early
reflections. Early reflections were found lower in value for smaller rooms (Table 5.8).
In [Bradley et al., 2003] it was reported that room volume reduction from 1777 m³ to
1092 m³ resulted an increase in taking benefit from early reflections up to 3dB. Reduction
in the ceiling height from 10 m to 7 m was also included in their work. Lowering a ceiling
height produced lower reverberation times.
through changing length of early reflections by average values of the longest early reflections for each
source separately [Kajastila et al., 2007].
88
Chapter 6
Conversational Tests
A conversation may be defined as the alternative adaptation of roles of listener and
talker by the conversation partners for interacting with each other [Richards, 1973,
Guéguin et al., 2008]. The International Telecommunication Union Standardization
Sector (ITU-T) in recommendation [ITU-T, 2007] describes methods and procedures for
conducting conversational tests to evaluate subjective communication quality. Based on
the recommendations [ITU-T, 2007], pair of subjects take part in the conversational test
by talking and listening interactively and at the end of the test vote using MOS quality
scores.
Normally, subjective test results obtained through pair of subjects do not properly
reflect the teleconferencing requirements. In typical teleconferencing situations it can be
assumed that the number of participants would be more than two. Additionally, there
are many chances that more than two persons start talking at the same time during the
conversation [Yankelovich et al., 2006]. Unfortunately, there is no standard available yet
which covers methods and procedures for conducting multi party conversational tests
covering proper teleconferencing situations.
In his recent work [Raake, 2011] presented a conversation test method for assessing
conference quality with three participants. Raake submitted his work to International
Telecommunication Union - (study group 12) to propose the method as a potential
appendix to the recommendation [ITU-T, 2007] or to form a new, stand-alone
recommendation.
In order to evaluate and optimize 3D telephony system for a conversational audio
quality we conducted 3 participants’ based conversational tests. In conversational tests,
four audio qualities such as mono, stereo, spatial and spatial audio with head-tracking
were compared and results have been presented in the following sections.
6.1 Test design
Test layout for a conversational test is presented in (Fig. 6.1). The detailed explanation of
each of the components shown here in this layout can be studied in detail in (Chapter 3).
This was a three participant’s based conversational test. This means that in this subjective
study only three participants could take part at any given time. Each participant was
sitting in a separate quiet room during the tests.
89
Chapter 6 Conversational Tests
Figure 6.1: Test layout - three participants conversational test
Conversational test participants were connected with each other through the
conference bridge by using Ekiga VoIP client. The conference bridge provides an overall
control of the call among the participants during this test. However, for spatial audio, the
conference bridge after establishing a connection forwarded all individual audio streams
to a virtual acoustic server for rendering. Additionally, a separate connection between a
client and virtual acoustic server was utilized for transferring of position and orientation
of each user for head-tracking purpose.
6.2 Test description
In conversational tests, 23 paid subjects (9 female, 14 male having average age of 30)
voluntarily participated. Subjects voted using MOS quality score for mono audio, stereo
audio, spatial audio and spatial audio with head-tracking quality separately.
Four conversational test scenarios were developed to test each audio quality separately.
Each conversational testing scenario was of five minutes duration. At the end of each
testing conversational scenario, we asked subjects nine questions which were based
on the recommendations [ITU-T, 2007]. Summary of questions that were asked from
the subjects for conversational test is provided in (Table 6.1) and details can be seen
in (Appendix C). Head phones used in the conversational tests were Sennheiser PC-230.
Head-tracking was achieved by the PNI sensor corporation’s Space Point-gaming tracker.
90
S: No. Test Questions
1 How would you assess the sound quality of the other person’s voice?
2 How well did you understand what the other person was telling you?
What level of effort did you need to understand what the other person
3 was telling you?
How would you assess your level of effort to converse back and forth
4 during the conversation?
5 How annoying was it for you when all partners were talking?
6 What is your opinion of the connection you have just been using?
7 Would you please rate overall audio quality of conversation?
How easy you think you feel to get the direction of conversational
8 partner’s speech in the listening environment?
9 Would you please rate spatial quality of conversation?
Table 6.1: Summary of test questions for conversational test
SpacePoint tracker offers 9 axes of motion tracking which are 3-axis magnetometer, 3-
axis gyroscope and 3-axis accelerometer which are driven by PNI’s motion tracking
engine [SpacePoint, PNI., 2011].
Subjects were provided with 10 discussion topics and were asked to select one
discussion topic unanimously for each test scenario. Discussion topics ranged from
sports, students affairs in the universities, music, food and European financial crises etc.
6.2.1 Results
In a category of sound quality of conversational partner’s voice, stereo audio performed
well yielding MOS ratings on 95% CI at (4.45 ± 0.28). However, interestingly spatial
audio and spatial audio with head-tracking yielded lower MOS ratings on 95% CI at
91
Spatial-HT
Conversational
Spatial
Stereo
Mono
Test
Questions
from (1 to 9)
MOS ± CI MOS ± CI MOS ± CI MOS ± CI
1 3.95 ± 0.32 4.45 ± 0.28 3.90 ± 0.50 3.75 ± 0.43
2 4.65 ± 0.35 4.80 ± 0.19 4.40 ± 0.53 4.35 ± 0.38
3 4.70 ± 0.22 4.75 ± 0.21 4.20 ± 0.58 4.35 ± 0.41
4 4.70 ± 0.22 4.65 ± 0.23 4.45 ± 0.47 4.25 ± 0.40
5 4.35 ± 0.31 4.50 ± 0.36 4.35 ± 0.41 4.15 ± 0.38
6 4.10 ± 0.37 4.30 ± 0.31 3.75 ± 0.48 3.80 ± 0.39
7 4.30 ± 0.34 4.30 ± 0.27 3.85 ± 0.49 3.90 ± 0.40
8 4.55 ± 0.39 4.65 ± 0.27 4.15 ± 0.55 4.05 ± 0.47
9 4.05 ± 0.49 4.25 ± 0.34 3.70 ± 0.53 3.80 ± 0.45
Table 6.2: MOS values with 95% CI for the nine questions
(Comparison of conversational quality MOS values with 95% CI for the nine questions)
(3.90 ± 0.50) and at (3.75 ± 0.43) respectively (Table 6.2).

In a category of understanding what conversational partners were telling, stereo audio
yielded better results MOS ratings on 95% CI found to be at (4.80 ±0.19), however,
spatial audio quality and spatial audio with head-tracking achieved third and fourth
numbers their MOS ratings on 95% CI found to be at (4.40 ± 0.53) and (4.35 ± 0.38)
respectively.
On the level of effort to understand conversational partners, stereo audio yielded better
results with MOS ratings on 95% CI at (4.75 ± 0.21) and mono audio yielding MOS
ratings on 95% CI at (4.70 ± 0.22). However, spatial audio with head-tracking yielded
MOS ratings at 95% CI at (4.35 ± 0.41) and achieved third category of responses among
four audio qualities.
On the level of effort to converse back and forth, mono audio and stereo audio yielded
better results and their MOS ratings on 95% CI were found at (4.70 ± 0.22) (4.65 ± 0.23)
respectively. Spatial audio achieved third best results surpassing spatial audio with head-
tracking their MOS rating on 95% CI were found at (4.45 ± 0.47) and (4.25 ± 0.40)
respectively.
Results obtained for annoyance while all conversation partners were talking, stereo
audio yielded slightly better results than mono and spatial audio. However, spatial audio
with head-tracking was considered not as good as other competing audio qualities.
Results in terms of connection used by the subjects, stereo and mono audio qualities
achieved better results and their MOS rating on 95% CI were found at (4.43 ± 0.31)
and (4.10 ± 0.37) respectively. However, spatial and spatial with head-tracking audio
qualities achieve MOS ratings less than 4.
92
Mono
Stereo
5 Spatial
MOS Scores with 95% CI

Spatial-HT
1
0 1 2 3 4 5 6 7 8 9
Conversational Quality Testing Questions
Figure 6.2: Comparison of conversational quality MOS scores with 95% CI

(Comparison of conversational quality MOS scores with 95% CI for the nine questions
asked from subjects)
In terms of overall audio quality of conversation results reported by subjects, stereo

and mono audio yield same level of results. However, spatial audio with head-tracking
yielded slightly better results than spatial audio their MOS ratings on 95% CI were found
at (3.90 ± 0.40) and (3.85 ± 0.49) respectively.
In terms of easiness reported by subjects to get the direction of conversational partners,
stereo and mono performed well once again and their MOS ratings on 95% CI were found
at (4.65 ± 0.27) and (4.55 ± 0.39) respectively. Though MOS ratings of spatial audio and
spatial audio with head-tracking were found more than 4 but results are relatively less
than stereo and mono. We were expecting better results particularly for this question but
strangely subjects felt other way around.
In terms of spatial quality perceived by the subjects, stereo and mono performed well
against spatial and spatial audio with head-tracking.
Conversational quality MOS scores with 95% CI for mono, stereo, spatial with
and without head-tracking have been compared graphically against test questions
in (Figure 6.2).
A detailed descriptive analysis have been presented in (Appendix C) particularly in
(Table C.1 , Table C.2 , Table C.3 and Table C.4).
93
6.3 Summary
Conversational tests were performed among three interlocutors (three conference
participants at a time) to optimize 3D telephony system. Conversational tests were done
in real time, it was of our interest to check at what extent our teleconferencing solution
performs well and also to check which audio qualities are liked by the participants. Since
it is not so easy to listen and to understand three simultaneous talker even in real life.
In literature it is stated by [Stifelman, 1994] that listening to three simultaneous audio
streams is cognitively difficult, even in face to face situations.
In conversational test results it was found that stereo audio surpassed audio qualities
such as: mono, spatial with and without head-tracking. Their respective MOS ratings on
95% CI remained at (4.57 ± 0.27), (4.37 ± 0.33), (4.08 ± 0.5) and (4.04 ± 0.41). Though
we were expecting better MOS scores with spatial with and without head-tracking audio
qualities, since spatial audio quality is more natural representation of audio. But test
participant’s perception was found opposite of our expectations. The reason for not
getting better scores by spatial with and without head-tracking audio qualities than mono
and stereo audio qualities, may be due to the fact that users/subjects are more acquainted
to the mono and stereo audio qualities in their normal usage of communication solutions
(VoIPs, land-line phones and mobile phones). Therefore, subjects preferred the mono
and stereo over spatial with and without head-tracking audio qualities. It was looked
as participant’s preferences were based on their previous audio quality experiences
which they encounter in their daily life using communication channels. Importantly, no
participant complained about spatial audio quality with or without head-tracking rather
they reported that they had a complete new experience of teleconferencing while having
conversations with their partners. We can argue safely here that acceptance of spatial
audio quality (also with head-tracking) among masses (users/customers) can be further
observed when users/customers shall be offered spatial audio quality conferencing web
services to talk to more than three partners at a time. In near future, having spatial audio
conferencing with more than three participants through 3D telephony would be possible.
94
Chapter 7
Investigating Virtual Acoustic
Environments & QoE Relationship
Quality of experience (QoE) is an assessment based on human perception, feeling and
behavior. On the other hand, a communication ecosystem represents the interaction
among various domains, such as technical aspects, business models, human behavior
and contextual aspects. The main contribution in this chapter is to present a conceptual
and holistic QoE model comprising all domains of a communication ecosystem and to
evaluate QoE-Context relationships through user studies and empirical analysis. Virtual
acoustic environment is a subcategory of a contextual model and it comprises of the
virtual rooms and different voice types present in it. We present findings of user studies
to analyze the impact of a virtual acoustic environment on QoE. Furthermore, using
a statistical approach, QoE terms, their analysis and validation in two different test
scenarios have been benchmarked. In first scenario, the investigation shows a strong
correlation between the virtual rooms and three QoE factors which are localization
performance, spatial audio quality and overall audio quality and moderate correlation
with localization easiness. The investigation also led to the discovery that simultaneous
mixed gender talkers in a conference call has secured better QoE scores and values.
7.1 Introduction
Along with the rapid technological advances, there has been a proliferation of new and
innovative systems, services, applications and end-user devices. Network management
concepts are also evolving, and an autonomic network management paradigm aspires to
bring human-like intelligence to telecommunication management tasks [Laghari et al.,
2009]. Thanks to these technical advancements, the fulfillment of customer demands
and user experience requirements have also come into focus and are becoming the
main differentiators for the effectiveness of telecom operators and service providers.
To understand human quality requirements, the notion of Quality of Experience (QoE)
is used since it provides an assessment of human expectations, feelings, perceptions
and cognition with respect to a particular product, service or application [Kilkki, 2008,
Laghari et al., 2011]. Traditionally, a technology-centric approach based on QoS
parameters has been employed to ensure quality and better performance to end users.
However, QoE expands this horizon as it tries to capture people’s aesthetic and hedonic
95
Chapter 7 Investigating Virtual Acoustic Environments & QoE Relationship
needs. The International Telecommunication Union (ITU-T) defines QoE [ITU, 2007]
as, "the overall acceptability of an application or service, as perceived subjectively by
the end-user". We define QoE as a blueprint of all human quality requirements and
experiences arising from the interaction of a person with technology and with business
entities in a particular context. QoE comprises of human subjective and objective factors
developed in a particular context.
To understand the QoE concept, at first, it is pertinent to know and understand the
communication ecosystem. Human behavior, business, technological and contextual
aspects constitute a communication ecosystem. The term ecosystem has been used in
various fields; in ecology [Dictionary, 2011] it is defined as, "a system involving the
interaction between a community of living organisms in a particular area and its non-
living environment". Similarly, a communication ecosystem could be defined as, "the
systematic interaction of people, technology and a business in a particular context”.
In a communication ecosystem, different actors interact with each other and they may
have different approaches. For instance, technical people try to provide a better user
experience by assuring network and service performance based on Quality of Service
(QoS) models. Business people develop economic models and strategies to assess
the profit, cost and customer churn rate. Psychologists and social scientists analyze
human attitude, intentions and cognition to understand human behavior in a particular
context. All actors of a communication ecosystem may have different vocabularies,
semantics and models, but to get a holistic and unified view of human needs and
behavioral requirements, these different approaches in business, technology, psychology
and cognitive science should be integrated into one framework. In a communication
ecosystem, where these domains interact with each other, it would be interesting to
converge and combine these different models to understand how human behavior is
actually shaped in a communication ecosystem. The QoE notion is thus a converging
factor that combines the influences of all these aspects to produce a blueprint of human
aesthetic and hedonic needs.
For a communication ecosystem, Kilkki’s QoE model [Kilkki, 2008] proposes
a simple and intuitive interaction between various actors in a communication
ecosystem. Kilkki presents a generic interaction between a person, technology and
business. However, by referring to Killki’s framework for analyzing communication
ecosystem (Fig. 7.1 adapted from [Kilkki, 2008]) we argue that in his framework
he neither classifies QoE factors nor includes any contextual aspects for analyzing a
communication ecosystem. We therefore extend Kilkki’s work by adding contextual
aspects to the model and by defining the taxonomy of each domain in a communication
ecosystem.
ITU-T’s [G-1080, 2008] proposes a QoE model that classifies QoE factors into two
parts, one part is related to subjective human components or emotions and the second part
with objective QoS parameters. Additionally, also in [G-1080, 2008], technology-centric
parameters are considered as objective factors. We propose objective QoE factors based
on human physiology, cognitive science and psycho-physics, because cognitive science
96
7.1 Introduction
Figure 7.1: A Framework for analyzing communication ecosystem lacking QoE
and mental models can be utilized to obtain precise quantitative information about human
performance [ITU, 2007]. A consolidated QoE based communication ecosystem has also
been proposed with extended concepts as described in a later section.
3D Telephony was selected to be used as a case study. It consists of a 3D
audio telephone and a teleconferencing system. Classic teleconferencing often suffers
from issues such as low intelligibility, limited ability of the participants to discern
unfamiliar interlocutors. 3D Telephony is a possible solution to address the shortcomings
of traditional teleconferencing services. 3D Telephony provides a virtual acoustic
environment and 3D sound improves the quality of experience of a teleconferencing
service. To evaluate 3D Telephony [Hyder et al., 2010b,a] system, user studies were
conducted following ITU-T’s P.800 standard.
This chapter is divided into two main contributions. First, a theoretical framework
for a consolidated QoE model and its taxonomy is presented. In the second half of
the chapter, an experimental set up and results of subjective studies have been presented.
Additionally, results presented in this chapter help us to analyze the relationship between
the virtual acoustic environment and the QoE (Fig. 7.2).
The chapter is organized as follows. In section (7.2) we present related work. In
section (7.3) we discuss our proposal for a consolidated QoE model for communication
ecosystem. In section (7.4) we present a use case study based on 3D Telephony and
present the methodology adapted for our user studies. In section (7.5) we present test
97
Figure 7.2: Virtual Acoustic Environment and QoE Relationship

Hypothesis:
QoE is influenced by interaction between human entity and virtual acoustic environment.
results and discuss our findings. Conclusion and map out of some future work are
presented in the last section of the chapter.
7.2 Background
7.2.1 QoS and QoE
QoE is considered to be an extension to the QoS concept, which is why most audio
telephony services, such as VoIP services, are assessed based on Quality of Service (QoS)
parameters [Bai and Ito, 2006, Radhakrishnan and Larijani, 2010]. Existing QoS metrics,
such as packet loss rate, jitter, delay and throughput are typically used to indicate the
impact on the audio quality level from the network point of view and do not directly
reflect the user’s experience. Consequently, these QoS parameters fail at capturing
the subjective and objective aspects associated with human perception and cognition.
QoE approaches have been introduced to overcome the limitations of current QoS-
aware multimedia networking schemes as far as their human perception and subjective-
related aspects [Takahashi et al., 2008]. QoE applicability scenarios, requirements,
evaluations and assessment methodologies in multimedia systems have been investigated
by several researchers and working groups, such as the International Telecommunication
Union – Telecommunication Standardization Sector (ITU-T) [G-1080, 2008], and the
European Technical Committee for Speech, Transmission, Planning, and Quality of
Service [ETSI, 2009]. The ITU-T proposed the E-Model [ITU-T, 2003a] to assess
the quality of experience indirectly from network traffic patterns. Furthermore the
ITU-T recommended the use of Perceptual Speech Quality Measure (PSQM) in
its recommendation P.861 [ITU-T, 1994b], but it was recognized as having certain
limitations in specific application areas. It was replaced by P.862, known as Perceptual
98
7.2 Background
Evaluation of Speech Quality (PESQ) [ITU-T, 2001b], which contains an improved

objective speech quality assessment algorithm. Furthermore, more recently, POLQA
(Perceptual Objective Listening Quality Assessment) has been selected by the ITU-
T to form the new voice quality testing standard P.863. Research work in [Fajardo
et al., 2009] focuses on QoE-based evaluation of VoIP services over Best Effort UMTS
networks. They analyzed the most relevant configuration parameters in order to evaluate
the performance of VoIP communications in different conditions. Even though the
influence of QoS parameters on QoE can not be ignored, it is not the sole influencing
factor; other important aspects in a communication ecosystem concerning context and
business also influence human behavior.
7.2.2 Virtual Acoustic Environment and Quality of Experience

(QoE)
For 3D Audio teleconferencing services, the virtual acoustic environment plays an
important role and its influence over user experience is worth investigating. Binaural
technology [Lee et al., 1998] is the means most often used to produce a virtual acoustic
environment for a listener [Begault, 1994, Møller, 1992]. The principle employed by this
technology is to control the sound field at the listener’s ears so that the reproduced sound
field coincides with what would be produced when the listener is in the desired sound
field. This can be achieved using loud speakers at different positions in the listening
space [Pulkki, 2001c], or by headphones [Kim et al., 2005b]. To our knowledge, there is
no literature available in the field of virtual acoustic environment which covers quality of
experience modeling aspects that deal specifically with telephony and teleconferencing
solutions.
3D audio telephony and teleconferencing systems under development are still far
from mass market usage as their quality of experience does not yet live up to user
expectations. Consequently, it is very important to measure the quality of existing
systems to understand how to improve them. Additionally, in the field of spatial
audio, most of the work is focused on studying sound localization [Blauert, 1997], on
intelligibility in virtual 3D spaces [Kitashima et al., 2008, Kobayashi et al., 2010], on the
recognition of unfamiliar voice with the help of spatialization and visual representation
of voice location [Kilgore and Chignell, 2006, Kilgore, 2009], on spatialized audio and
video multi-way conferencing [Inkpen et al., 2010] and also on the famous phenomenon
of the cocktail party effect [Brungart et al., 2007]. In the works mentioned, there is
still a lack of information on the intuitive relationships between the various actors of a
communication ecosystem, which our work is aimed to address.
Moreover, literature study suggests the advantages of spatial audio for
teleconferencing [Baldis, 2001, Best et al., 2006, Ericson and McKinley, 1997, Kilgore
et al., 2003] over stereo or mono quality. Furthermore, in [Raake et al., 2007] reported
that the spatial separation of talkers in a narrow band context has considerable advantage
in talker localization and reduces talker localization errors. A good study of spatial
99
and binaural hearing in a real or virtual environments is presented in [Ericson and

McKinley, 1997]. Further, [Best et al., 2006] report on the influence of spatial audio on
divided listening. While literature reviewing we also found out that, in [Ahrens et al.,
2010] developed conversation test scenarios for audio conferences with three remote
participants and present the test results of listening and conversational tests conducted
to validate test scenarios. Their work lacking in the in the area of localization of talkers,
and it is also not clear to what extent a listening environment presented in their work is
flexible or customizable for the end users.
In the speech intelligibility area, the influence of stereo audio coding by testing
the subjective quality of localized speech at various azimuths on the horizontal plane,
specifically on sound quality and Japanese word intelligibility was studied in [Kobayashi
et al., 2010]. Their aim was to use sound localization to separate the main speaker’s
speech from other talkers in a multi-party 3D audio conferencing environment utilizing
voiscape [Kanada, 2008]. The effect of a competing noise source on the intelligibility of
target speech by particularly focusing on the acoustical aspects of a conference system
in which participants from stand-alone PCs share a common virtual space was studied
in [Kitashima et al., 2008, Kobayashi et al., 2010]. However, in our study we have
incorporated human speech as competing sound sources so that they are thought of as a
direct application to the problem of a multi-party teleconferencing system. Additionally,
we also tested for the relative and absolute differences in concurrent talker’s voice types,
using two-male, two-female and two mixed-gender talker scenarios. Work in [Hawley
et al., 1999] investigated speech intelligibility and localization in a multi-source scenario.
Their tests were performed using fixed distances in relation to the listener. Our study,
however, evaluates the results from three different listeners-to-wall distances to study to
what extent the virtual acoustic environment helps speech to be comprehensible, using
localization scores. The test results will help us to better map the user/customer QoE
requirements.
7.3 QoE Based Model for a Communication Ecosystem
Quality of Experience (QoE) is a rapidly emerging multi-disciplinary field based on
social psychology, cognitive science, economics and engineering science to assess
overall human quality requirements and expectations. To understand QoE, the interaction
between human actors, context, business and technology in a communication ecosystem
needs to be investigated. In a communication ecosystem, the major interactions are
between (i) the QoE domain to the context, (ii) the QoE domain to the technological
and business domains (iii) the context to the technological and business domains, and
(iv) the technological domain to the business domain, (Fig. 7.3).
Context is an important influencing factor, because it is quite possible that a person’s
feelings and perceptions may change with a change in his or her context. Similarly,
network or service quality parameters also exert an influence on the user experience.
Business elements such as pricing, advertisements and promotions can also influence
customers/users. To get a holistic view of all these disparate aspects, we present a high-
100
Figure 7.3: QoE Based Model for Communication Ecosystem
level QoE interaction model (Fig. 7.3).

The model consists of a QoE domain, contextual domain, technological and business
domain. We briefly discuss the taxonomy of the model in the following subsections.
7.3.1 QoE Domain

The QoE domain consists of the human entity, his/her QoE factors, his/her characteristics
and roles, and all other important relevant user information. The QoE factors produce an
overall assessment of human needs, feelings, performance and intentions. The QoE is
classified into two categories, subjective and objective human factors.
7.3.1.1 Subjective Human Factors
Subjective human factors are qualitative parameters that reflect customer/user
perceptions, feelings and intentions. Primarily, subjective human factors are based
on human psychological aspects. These factors are captured qualitatively and analyzed
101
empirically. Some common subjective QoE factors are perceptions, feelings, ease of
use, joy of use, satisfaction, etc. These factors are normally obtained through surveys,
customer interviews, and ethnographic field studies [Cooper et al., 2007]. For more
information on a subjective studies, ITU-T proposes the P.800 recommendations [ITU-
T, 1996]. In marketing and social psychology, psychological models are normally
used to understand human intentions and behavior. One widely-recognized model is the
Technology Acceptance Model (TAM), [Davis, 1986] which is a derivative of the Theory
of Reasoned Action (TRA) [Sheppard et al., 1988]. The TAM is a simple model to help
understand human intention and behavior towards the adoption of a particular product or
service. Over time, the TAM model has been revised and advanced by other models, such
as the Theory of Planned Behavior (TPB), the Demodified Theory of Planned Behavior
(DTPB), and the Unified Theory of Acceptance and Use of Technology (UTAUT) [Al-
Qeisi, 2009], etc. These approaches are used to understand the subjective acceptability
of any product or service by end users or customers. These psychological models can
also be utilized for capturing subjective human factors.
7.3.1.2 Objective Human Factors
Objective human factors are quantitative in nature and are related to human physiology,
to psycho-physical aspects and to cognition. Some examples of human objective factors
are the human audio-visual system, brain waves, heart rate, blood volume pressure,
memory, attention, language, task performance and human reaction time. The influence
of biology and of the cognitive system on human behavior or decision making is
normally investigated in cognitive psychology, behavioral neuroscience or in biological
psychology. Audio-visual systems have received increased attention with the innovation
and development of teleconferencing, computer games, and virtual reality systems. The
use of psycho-physical aspects and physiology could contribute to a massing significant
data about the human biological state. Quantitative data answers questions such as “how
much”, “how many” or “where”, etc [Cooper et al., 2007]. These factors can be gathered
and evaluated through subjective testing and/or via quantitative research.
The line between subjective and objective human factors suggests that they may be
interdependent and could possibly be inferred from each other through some mechanism,
e.g., a change in human biological and cognitive parameters could also influence human
subjective perceptions and feelings or vice versa.
7.3.1.3 Human Entity
The human entity category provides information about a person such as his or her
roles (i.e., customer, user) and characteristics (i.e., age, gender etc). The roles can be
categorized into three main categories: user, customer and group. In the current work, we
focus more on user and customer roles. A customer is the entity/person who subscribes
to a service and is a legal owner of that service; however, he or she may or may not be
the primary user of that service. A user is the individual who actually uses a service. The
line between the user and customer boxes indicates the possibility that their roles can
102
interchange. A customer who is paying for an on-line telephony service may be stricter
about quality than a user who is using a free on-line audio chat service. In [Laghari et al.,
2010] a customer experience model was presented to specifically understand customer
experience requirements. In addition to human entity roles, it is also possible that
people in a different age groups or gender or demographic groups may have different
QoE requirements. This sort of differentiation of human roles and characteristics helps
researchers to better understand QoE requirements and to document them with much
more precision.
7.3.2 Contextual Domain

Context is any information that can be used to characterize the situation of an entity, and
it is typically the location, identity and state of people, groups, and computational and
physical objects [Dey, 2001]. In other words, context simply represents the situation and
circumstances in an ecosystem. The contextual domain is broadly categorized into two
components: contextual entity and contextual characteristics, as described below.
7.3.2.1 Contextual Entity
A contextual entity is a representation of the situation, environment and circumstances
within a communication ecosystem. It is broadly classified into three categories: real,
virtual and social context.
(i) Real context represents the real situation in a communication ecosystem. It is
subdivided into temporal, spatial and climatic contexts. The temporal context contains
time information like the time zone of the user/customer, the current time or any virtual
time. The spatial context is for information related to physical objects and spatial
attributes such as the location of a person. The climatic context is for climate and weather
information such as warm or cold temperatures, precipitations, etc.
(ii) Virtual context is an image of the real environment that tries to bring a natural
feeling to a virtual world. A virtual environment may be utilized to bring innovation to
how people communicate, play on-line games, participate in remote classrooms and/or
any other possible application of virtual reality. We classify virtual context in to two main
categories: (i) Virtual Acoustic Environment (ii) Virtual Visual Environment. The virtual
acoustic environment is a customizable virtual environment with 3D audio support which
is part of 3D telephony and teleconferencing service but that can be utilized in on-line
gaming, e-learning or in e-classrooms and/or any other virtual reality applications. In
a virtual acoustic environment, strictly in the 3D Telephony case, the participants in a
telephone or teleconferencing call can locate each other and place other participants at
their places of choice as well as hear head or body movements of other participants
due to changes in the acoustic delays and echoes. Therefore, call participants are
not only distinguishable by the clarity of their voices but also by their location in the
virtual environment. In a virtual visual environment, there may also be a provision for
communicating video information through the virtual environment, but the study of the
virtual visual environment is out of the scope of this work.
103
(iii) Social context describes the social aspects of the contextual entity. The
social context usually contains interpersonal relations such as the social associations,
connections, or affiliations that may exist between two or among many people. For
instance, social relations can contain information about friends, family, neighbors, co-
workers, etc.
7.3.2.2 Contextual Characteristics
Each contextual entity may have some specific characteristics and parametric
specifications. For example, GPS data for a location, the echoes and reverberations of
teleconferencing rooms, the size of the a virtual teleconferencing room, etc. Changes in
contextual aspects have the tendency to influence human behavior. A person participating
in a teleconference or a telephony call who is sitting in a quiet room has different QoE
requirements than a person conducting a call or conference while standing in a railway
station, at a bus stop or in a cafeteria. To provide improved customization and better user
experience, the technological domain should be agile enough to adapt to the needs of a
user/customer as appropriate to their changing context. Context-aware applications and
systems are being developed to cater to the needs of real context. In the area of virtual
context, a user has more freedom to shape his/her context according to his/her own
needs and ease. For example, in a 3D virtual acoustic environment for teleconferencing
services, end users can vary the size of virtual teleconferencing rooms and/or place the
participants of a teleconference anywhere in the virtual acoustic environment that suits
his or her own needs. Thus, it becomes very interesting to investigate the significance of
the impact of contextual aspects over QoE, and then how contextual information could
be exploited by the technological and business domains to develop a service with better
user experience and customized business models.
7.3.3 Business Domain
7.3.3.1 Business Entity
A business entity represents service providers, network operators, marketplace owners
and/or device vendors, etc. Most business entities have customer/user service touch
points that customers reach to in order to subscribe to a service that fulfills their intended
goals or to report a service problem. This interaction between customer/user and provider
could be direct or indirect (on-line) but in both cases this interaction experience develops
positive or negative feelings, and possibly a combination of both.
7.3.3.2 Business Characteristics
A business entity has certain properties such as business model and strategies,
which basically define the direction of its business. Business characteristics include
advertisement, pricing, promotion and brand image, etc. To avoid a high customer/user
churn rate and a bad word-of-mouth, business characteristics should be mapped with
QoE so that they can fulfill customer expectations. Furthermore, there should be
an alignment between business and technical characteristics to create an integrated
104
7.4 A use case study - 3D Telephony
technical and business solution.

7.3.4 Technological Domain
7.3.4.1 Technological Entity
The technological entity represents a set of services, networks and devices offered by
a business entity. The customer/user can use various technological entities to achieve
his/her goals. The usage experience of these technological entities has a great influence
on overall QoE.
7.3.4.2 Technological Characteristics
The technological characteristics represents all of the key parameters and indicators
related to services, network resources and end-user devices. QoS parameters, service
features and end user device functions are part of the the technological characteristics.
Network failure, packet losses, delay and other QoS parameters have a profound
influence on a user’s experience, and it is very important to map technological
characteristics with QoE factors.
Basic architecture of 3D telephony has been already described in (Section 3.3).
Therefore, in the next section we present an applied QoE model for 3D Telephony to
show the relationship between QoE and virtual acoustic environment or context.
7.4.1 Applied QoE Model for 3D Telephony
To study the overall quality of experience and to develop an applied QoE 3D Telephony
model, it is important to understand and analyze end users’ objective and subjective
factors, along with their perception and performance evaluation capacity with respect to
their interaction with the virtual acoustic environment of 3D Telephony.
As part of our current contribution, we study the relationship between QoE and
contextual domains, in which we also present the interaction and classification of QoE
factors and contextual aspects (Fig. 7.4). The human entity is a participant in the current
context that performs listening and localization tasks in a virtual acoustic room. Our
hypothesis is that the interaction between a human entity and the contextual domain
influences QoE factors. We also present a more specific relationship between the VAE
characteristics and certain QoE factors (Fig. 7.5). We have attempted to evaluate and
assess this relationship using an empirical approach in our user experimentation’s.
QoE Factors: To understand the human quality requirements during an interaction
with a virtual acoustic environment, we define four QoE factors, given below. These
QoE factors represent human perception and performance about two important aspects
i.e., the localization of talkers and the perceived audio quality experience in a virtual
acoustic environment.
Localization Performance (LP): LP is an objective human factor, as it is quantitative
parameter related to human performance. We define LP as an assessment of how
105
Figure 7.4: Applied QoE model for 3D Telephony

(An Interaction and Relationship between QoE and Contextual Domains in
Communication Ecosystem - A use case study in our current work is based on this
applied QoE model for 3D Telephony)
correctly listeners could locate the positions of the concurrent talkers in a virtual
teleconferencing room. LP data are real quantitative data based on the actual
performance of listeners and it represents the listener’s ability to locate either both talkers
correctly, or only one, or neither, in a virtual acoustic environment with the help of a map.
LP data are presented as percentage values.
We define three subjective human factors: Localization Easiness, Spatial Audio
Quality and Overall Audio Quality. To obtain measures of these subjective human
factors, subjects were asked to give their opinion ratings on a five point MOS scale.
Localization Easiness (LE): LE represents human perception and feelings about
localizing talkers. We define LE as an assessment of listener’s feelings to locate
concurrent talkers easily in VAE.
Spatial Audio Quality (SAQ): This factor is also a perception and feeling related
106
parameter with respect to 3D audio quality. We define it as an assessment of listener’s

perception towards spatial separation of talkers and the pleasantness of 3D speech in
terms of audio quality”.
Overall Audio Quality (OAQ): OAQ represents the overall acceptance of the 3D
acoustic environment and 3D sound effects. Subjects were asked to give their opinion
about how they perceive the overall quality of the 3D audio telephony.
Figure 7.5: QoE Factors and Virtual Acoustic Environment Relationship
Contextual Domain: The Quality of Experience in a virtual acoustic environment

depends on its specifications, such as the virtual room size, the voice types of the talkers,
the virtual table size in a virtual conference room, the number of concurrent talkers in
a virtual conference, etc. To be precise enough within reasonable limits for the sake
of simplicity, we focused on two contexts: virtual room size and the voice types of
participants in a conference. For the virtual room size context, we opted for three room
sizes 10 m³, 15 m³, and 20 m³ . By changing the size of rooms one by one, we analyzed
their impact on QoE factors. To evaluate voice type, we opted for three scenarios: both
talkers’s being male, both female, or mixed (one male and one female). By changing
the voice type of concurrent talkers in the virtual acoustic environment, we were able to
analyze its impact upon a listener’s perception and performance and hence better evaluate
the overall QoE.
We were also interested in addressing some specific questions, such as: (i) How do
the QoE and a virtual acoustic environment interact with each other? (ii) How easy is
it for listeners to locate the call participants/talkers in a virtual acoustic environment?
107
(iii) What is the actual performance of listeners in correctly locating the talkers at their
positions? (iv) How is the 3D audio quality rated by the subjects? (v) Is there any
difference in the listeners’ perception and the performance with respect to voice type and
virtual room size?
To validate this model and investigate the relationship between the QoE and a virtual
acoustic environment, we conducted user studies based on the following methodology.
7.4.2 Methodology
The methodology adapted for tests has been already defined in detail in (Section 5.3). A
selection of scenarios and sub-scenarios for current user study was created based on the
following facts and reasoning.
Virtual Room Size: In this scenario, we analyzed how varying virtual room sizes
and sound source/talker-to-wall distances impact on QoE factors and measured how
participants’ opinions and performance vary with varying room size. To determine the
effect of room size and sound source/talker-to-wall distance on all QoE scores, this test
used three different rooms with dimensions of 10³ m³, 15³ m³ and 20³ m³. The average
lengths of the presented stimuli were 14:38s, 14:65s, and 14:43s, respectively, for the
three tests.
Voice Type: In this scenario, our goal was to test the impact of relative and absolute
differences in voice types (such as two concurrent male, female or mixed gender talkers)
on the QoE. Therefore, the three tests within this setup were Voice Type-1: two
simultaneous female talkers with an average signal length of 13:03s, Voice Type-2: two
mixed gender talkers with an average signal length of 14:42s, and Voice Type-3: two
concurrent male talkers with speech signals of an average length of 14:38s, and each
from four possible locations distributed around the table.
Table 7.1: Summary of Cronbach Alpha test results
Summary of Cronbach Alpha test results which verify reliability and internal consistency
of QoE factors. All results are well above 0.6 value which shows a high level of reliability
for the construct variables.
108
7.5 Results & Discussions

7.5.1 Reliability and Validity Testing
For the reliability and validity of the test results, we conducted two tests: Confidence
Interval (CI) and Cronbach Alpha [Nunnally, 1987]. Normally, Cronbach Alpha test is
conducted to verify the reliability, validity and internal consistency of testing parameters
and passing this test is a requirement for further analysis [DeVellis, 1991]. Therefore,
we verified QoE constructs (LP, LE, SAQ, and OAQ) based on Cronback Alpha. QoE
factors at each sub-scenario level and as a whole were tested and the results are presented
in (Table 7.1). For the Cronbach Alpha test, the cutoff threshold is 0.6 and it is evident
from (Table 7.1) that all values are within generally accepted thresholds and more than
0.6 value. Thus indicating a high level of reliability for the construct variables and
underlying measurement items [Hair et al., 1998].
The second test is based on the Confidence Interval (CI) which is used to indicate the
reliability of an observed interval by the confidence level.
7.5.2 Discussion
In this section, results about the two main scenarios based on the virtual room size and
the voice types of participants are presented.
7.5.2.1 Experiment I: QoE Factors and Virtual Room Size
In this experiment, the QoE factors are analyzed based on changes in the size of a virtual
teleconferencing room. The results (Table 7.2) and (Fig. 7.6) suggest that there is a
very small decrease in localization performance when we switch from a small room
(10³ m³) to a medium size room (15³ m³). However, when we switch the room size
Table 7.2: Results of localization performance and MOS scores
to that of a big room (20³ m³) a sudden decrease in localization performance can be
109
observed, of up to value of 7 %. The overall trend suggests a strong negative correlation

(−0.89) between virtual room size parameters and the localization performance factor.
This means that with increasing room size, performance of test participants for locating
concurrent talkers in a virtual room decreases. Localization Easiness does not follow the
same pattern as that of LP. Only a weak correlation is found between virtual conference
room size parameters and localization easiness (0.39). However, it is evident from the
results (Table 7.2) that subjects felt easier to locate concurrent talkers in a medium sized
room than in either a small or a large teleconference room. The results suggest that for
an optimal localization experience in a virtual room, i.e., where both LP and LE values
converge, the scenario of a medium sized room proved optimal and thus it is the best
choice for teleconferencing.
Figure 7.6: Quality Scores Comparison for Different Virtual Acoustic Rooms
Additionally, relating to the spatial audio quality and overall audio quality experience
in virtual teleconference rooms, results show that both the subjective spatial audio quality
and the overall audio quality MOS scores gradually improve with an increase in the
size of a virtual room. In contrast to the LP, a strong positive correlation is found
for both SAQ (0.94) and OAQ (0.98). This implies that localization performance of
test participants decrease with increasing room size, while spatial and overall audio
quality increase along with an increase in virtual room size. One possible factor for
this result could be the echoes and reverberations, since they are stretched in larger
rooms. As reported in [Mershon et al., 1989, Zahorik, 2002, Shinn-Cunningham, 2001,
Begault et al., 2001], reverberation in acoustic environments is considered to be a
reliable cue in identifying sound source distance, but it also modestly degrades sound
source directional perception [Santarelli, 2001] and speech intelligibility [Houtgast,
110
1980, Payton et al., 1994]. In addition, it is reported [Shinn-Cunningham and Ihlefeld,

2004] that reverberation enhances the distance perception but degrades localization
performance. It is suffice to conclude that a small size room provides better localization
performance but it has the lowest localization easiness, spatial audio quality and overall
audio quality scores. While in the large room scenario, subjects give their highest MOS
scores to spatial and overall audio quality. Thus, a medium sized room is the optimal
choice wherein all of the QoE factors are in suitable ranges.
Figure 7.7: Quality Scores Comparison for Different Voice Types
7.5.2.2 Experiment II: QoE Factors and Voice type

In this experiment, we selected the large room size and we changed the voice types
of simultaneous talkers in order to verify if there are any changes in QoE values and
ratings. The results shown in (Table 7.2) and (Fig. 7.7) suggest that, with female voice
type samples (both concurrent talkers in the virtual room were female) the listeners’
localization performance was greatly reduced and subjects could not perform well
enough to locate the position of concurrent (female voice type) talkers correctly with an
LP value of 48.66%. When both concurrent talkers were male, localization performance
was improved and was found to be 63.44%. With mixed gender voice types (one
male and one female talker), even better localization performance value was obtained:
76.61%. In terms of perception-based score, for localization easiness, subjects gave an
almost similar rating to male as well as female talkers voice types i.e. MOS ratings on
95% CI found to be at (3.68 ± 0.11) and (3.70 ± 0.13) respectively. However, the mixed
gender voice types had even better ratings, i.e. MOS ratings on 95% CI found to be
(3.83 ± 0.12). These results suggest that better localization performance and localization
111
easiness is achieved with mixed gender voice types. To assess the impact of voice type
over spatial audio quality and overall audio quality, the results in (Table 7.2) indicate that
the highest subjective spatial audio quality and overall audio quality MOS scores were
achieved with mixed gender voice types and were the lowest with female voice types.
It is safe to conclude that mixed gender voice types achieve better LP values and MOS
scores, since when both voices are of a different gender they can be distinguished much
more easily than two voices of the same gender.
7.6 Conclusion
In this chapter a consolidated Quality of Experience (QoE) model is presented which
is based on the various domains or actors to develop a holistic and integrated view of
QoE in a communication ecosystem. The model shows the interaction between human,
business, technology and contextual aspects. To evaluate this model, we focused to study
the influence of contextual aspects on QoE. However, the influence of technological and
business parameters may be considered in future work.
The case study was based on the major idea of this thesis work which is called 3D
telephony and teleconferencing system. Particularly, an applied QoE model for 3D
telephony was constructed to study the influence of the virtual acoustic environment
on user/customer quality of experience. Further, a methodology designed to act as a
framework to conduct useful user studies was presented. In the user study the impact of
changing the characteristics and contextual aspects within 3D telephony system on a user
QoE factors such as localization performance, localization easiness, spatial audio quality
and overall audio quality were assessed. According to the results it is safe to conclude
that contextual aspects do influence QoE constructs. It was found out that changes in
the virtual room sizes and in the voice types of concurrent talkers produced different
values/scores for the QoE factors. Additionally, this study suggests that a medium sized
(15³ m³) teleconferencing room and mixed voice type provide the optimal quality of
experience in a 3D telephony-based virtual acoustic environment.
112
Chapter 8
Conclusions
As an out-come of this thesis work, a 3D audio telephony and teleconferencing system,
which is called 3D telephony, has been achieved. 3D telephony system is based on
customizable virtual acoustic environment. In order to optimize 3D telephony, a series of
subjective experimental studies were conducted and the empirical analysis of the results
have been presented.
First user study contained four sets of 7 different placement of the participants of
the conferencing call to judge the audio quality, understandability, locatability and the
occurrence of front/back or elevation localization errors. In this user study, localization
performance of subjects was measured and quality scores on perception were obtained
by varying HRTFs, virtual acoustic room sizes and different heights of the listener and
talkers. When the listening tests were carried out by placing the talkers in the corners of
the virtual acoustic rooms, it was found that most of the time subjects were unsure about
the talker locations. Subjects were found making front/back and up/down localization
errors frequently. On the other hand, when placement of talkers and listener was made
horizontal by placing the listener at the center of the room and talkers at left, right,
front and back of the listener, the localization performance improved. Especially left
and right orientation of listener was nearly perfect. However, nearly 50 percent front
and back localization errors were still observed. Further, change of the placement from
horizontal to frontal was found more productive. Better localization scores with frontal
placement were found than other competing placements of talkers and listeners. It was
identified that frontal placement listeners found it very easy to locate virtual talkers and
their success rate in locating virtual talkers remained nearly perfect.
For further optimization of 3D telephony and teleconferencing solution, subjective
experiments were conducted that helped to understand how to select proper virtual
acoustic environment for teleconferencing. This second subjective study was based
on 11 sets of user experiments to examine the effect that simulated virtual acoustic
room properties, virtual sitting arrangement, reflections of conference table, number
of concurrent talkers and voice characteristics have on the perception of audio quality,
locatability and speech intelligibility. It was identified that two different gender
simultaneous talkers were often localized correctly with better achievement of quality
scores on perception. Also, it was identified that increase in the number of simultaneous
113
Chapter 8 Conclusions
talkers brought increase in higher localization correctness ratios but at the cost of lower
quality scores. It was also found that increasing the table size within virtual acoustic
environment brings increase in overall localization scores. However, there was no major
difference found in quality scores among different table sizes used for the experiments.
It was also identified that increase in talker density produce decrease in localization
performance and quality scores. It was found that increase in the volume and size
of the virtual acoustic room brought positive change in the quality scores. However,
localization performance scores do not follow this trend. Localization scores got better
with smaller room sizes.
To optimize 3D telephony solution further, three interlocutor based conversational
tests were conducted. Virtual acoustic environment for conversational tests was selected
on the basis of successful results obtained in earlier subjective studies. Through
conversational tests we obtained subjective opinion to compare among audio qualities
such as: mono, stereo, spatial and spatial with head-tracking. It was identified that
spatial audio performed slightly less than mono and stereo but overall conversational
tests based on spatial audio produced satisfying results. Within conversational tests, it
was concluded that to clearly observe the advantages of spatial audio for conversational
quality, the future conversational tests should be conducted based on four interlocutors,
at least.
Further, a QoE model for a communication ecosystem, based on various domains or
actors, has been presented. QoE model presented develops a holistic and integrated
view of QoE in a communication ecosystem and shows the interaction among human,
business, technology and contextual aspects. To evaluate this model, the influence of
contextual aspects on user Quality of Experience has been studied. Also, an applied user
QoE model for 3D telephony was constructed to particularly study influence of Virtual
Acoustic Environment (VAE) on user Quality of Experience (QoE). Through user study,
the impact of contextual aspects on QoE factors was assessed. It was found that change
in context brings change in QoE constructs.
8.1 Outlook on Future Research
Based on the knowledge gained in how to design virtual acoustic environment, it is
considered an important future work to further optimize virtual acoustic rooms so that
they may be perceived as near to real conference rooms. As a future work to optimize
virtual acoustic rooms to a further extent a standard for acoustic quality of room such
as [Beuth Verlag, 2004] could be followed. This standard applies to small to medium-
sized rooms to ensure good acoustic quality precisely for spoken communication in such
rooms. We can take advantage of this standard and apply these design guidelines to
virtual acoustic rooms of different volumes and sizes.
Further, as a future work it may be considered to conduct conversational tests based on
four interlocutors (at least four interlocutors or more) to optimize 3D telephony solution
to a further extent. Through conversational tests we may compare among different audio
qualities such as: mono, stereo, spatial with and without head-tracking and different
114
8.1 Outlook on Future Research
positioning of teleconference participants to validate and further optimize 3D telephony

and teleconferencing environment. Further, as a future work we see it a much needed
work to develop a proper testing method to assess a multiparty conference call quality. It
is very unfortunate that there exist no testing standard yet. Due to unavailability of test
standard/method to assess multiparty conferencing call, test results obtained in different
research laboratories of world in this domain are incomparable.
It is also envisaged the use of a consolidated QoE model and the development of
additional QoE studies and methodologies to bring about a unified view of QoE in
communication ecosystems. Future plans may include investigating the influence and
interaction of additional contextual aspects, as well as assessing the impact of QoS and
business model elements.
115
Chapter 8 Conclusions
116
Appendix A
Research Papers
A.1 Conference Papers/Technical Reports
1. Mansoor Hyder, Michael Haun, Christian Hoene, “Measurements of Sound
Localization Performance and Speech Quality in the Context of 3D Audio
Conference Calls”, In International Conference on Acoustics, NAG/DAGA, March
2009, Rotterdam, Netherlands.
2. Christian Hoene and Mansoor Hyder. “Considering Bluetooth’s Subband Codec

(SBC) for Wideband Speech and Audio on the Internet”. Technical Report,
October 2009, WSI-2009-3, Universität Tübingen - Wilhelm-Schickard-Institut
für Informatik, 72076 Tübingen, Germany.
3. Mansoor Hyder, Michael Haun, and Christian Hoene, “Placing the Participants
of a Spatial Audio Conference Call”, In IEEE Consumer Communications
and Networking Conference - Multimedia Communication and Services (CCNC
2010), January 2010, Las Vegas, USA.
4. Mansoor Hyder, Michael Haun, Olesja Weidmann, Christian Hoene, “Assessing

Virtual Teleconferencing Rooms”, 129th Audio Engineering Society Convention,
November 2010, San Francisco CA, USA.
5. Christian Hoene and Mansoor Hyder, “Optimally Using the Bluetooth Subband
Audio Codec (SBC) Over Wireless links and on the Internet”, In 35th Annual
IEEE Conference on Local Computer Networks (LCN) (LCN 2010), October
2010, Denver, Colorado, USA.
6. Michael Haun, Mansoor Hyder and Christian Hoene. "3DTel - A Spatial

Audio Teleconferencing System." Audio Engineering Society Conference: 44th
International Conference: Audio Networking. Audio Engineering Society, 2011.
7. Mansoor Hyder, Khalil ur Rehman, Christian Hoene, “Are QoE requirements for
Multimedia different for men and women?”, Second International Multi Topic
Conference, IMTIC 2012, March 28-30, 2012, Jamshoro, Pakistan.
117
Appendix A Research Papers
A.2 Journal Articles

1. Khalil Ur Rehman Laghari, Mansoor Hyder, Noel Crespi, Michael Haun, Christian
Hoene, Tiago H. Falk, “An Investigation into the Relationship Between Perceived
Quality-of-Experience and Virtual Acoustic Environments: the Case of 3D Audio
Telephony“, Journal of Universal Computer Science, vol. 19, no. 12 (2013), 1718-
1735, June, 2013.
118
Appendix B
Summary of Contributions
I would like to summarize the contributions of my colleagues, collaborators and students
in achieving this PhD thesis. The basic scientific idea of this research work belongs to Dr.
-Ing. Christian Hoene. He contributed many useful comments while conducting research
work, selecting testing parameters and writing research results and publications. Michael
Haun, Olesja Weidmann and Jonas Leidig did their diplom and undergraduate thesis
under my supervision, their implementation and research contribution was important
to achieve this thesis work. I contributed in initiating the basic research idea from
scratch, conducted and implemented all experimental user studies, obtained data through
subjective/user studies, analyzed research results, wrote and presented research articles,
guided and supervised students to accomplish their diplom and undergraduate thesis
work. All this eventually helped me to accomplish this thesis work. I also initiated
a collaborative work with Institut Telecom SudParis, Paris, France on the idea of “An
Investigation into the Relationship Between Perceived Quality-of-Experience and Virtual
Acoustic Environments: the Case of 3D Audio Telephony”. I worked with Prof. Noel
Crespi and his student Mr. Khalil ur Rehman Laghari on this collaborative work. We
developed a consolidated Quality of Experience model for a communication echo system
and also developed applied user Quality of Experience model for 3D telephony system1 .
Chapter (7) of this thesis is based on our collaborative work. Later on we (collaborative
partners) wrote a journal article together in a co-authorship and got acceptance for
publication from Journal of Universal Computer Science.
1 3D Telephony system is the core of thesis work
119
Appendix B Summary of Contributions
120
Appendix C
Conversational Tests
C.1 Questionnaire for Conversational Tests
Please provide your feedback regarding the call quality you just have observed for the
following questions:
(1) How would you assess the sound quality of the other person’s voice?
• No distortion at all, natural
• Minimal distortion
• Moderate distortion
• Considerable distortion
• Severe Distortion
(2) How well did you understand what the other person was telling you?
• No loss of understanding
• Minimal loss of understanding
• Moderate loss of understanding
• Considerable loss of understanding
• Severe loss of understanding
(3) What level of effort did you need to understand what the other person was
telling you?
• No special effort required
• Minimal effort required
• Moderate effort required
• Considerable effort required
121
Appendix C Conversational Tests
• Severe effort required
(4) How would you assess your level of effort to converse back and forth during the
conversation?
• No special effort
• Considerable Effort Required
• Severe Effort Required
(5) How annoying was it for you when all partners were talking?
• No annoyance
• Minimal annoyance
• Moderate annoyance
• Considerable annoyance
• Severe annoyance
(6) What is your opinion of the connection you have just been using?
• Excellent
• Good
• Fair Quality
• Poor Quality
• Bad Quality
(7) Would you please rate overall audio quality of conversation?

• Excellent
• Good
• Fair Quality
• Poor Quality
122
C.2 Conversational Tests Descriptive statistics
• Bad Quality
(8) How easy you think you feel to get the direction of conversational partner’s
speech in the listening environment?
• No special effort required
• Considerable effort required
• Severe effort required
(9) Would you please rate spatial quality of conversation?

• Excellent
• Good
• Fair Quality
• Poor Quality
• Bad Quality
123
Table C.1: Descriptive Statistics for Conversational test for Mono Audio Quality
124
Table C.2: Descriptive statistics for conversational test for stereo audio quality
125
Table C.3: Descriptive statistics for conversational test for spatial audio quality
126
Table C.4: Descriptive statistics for conversational test for spatial-HT
127
128
Bibliography
J. Ahrens, M. Geier, and S. Spors. The soundscape renderer: A unified spatial audio
reproduction framework for arbitrary rendering methods. In Audio Engineering
Society Convention 124, 5 2008.
J. Ahrens, M. Geier, A. Raake, and C. Schlegel. Listening and conversational quality
of spatial audio conferencing. In Audio Engineering Society Conference: 40th
International Conference: Spatial Audio: Sense the Sound of Space, 10 2010.
K. Al-Qeisi. Analyzing the use of utaut model in explaining an online behaviour: Internet
banking adoption. 2009.
V. Algazi, R. Duda, D. Thompson, and C. Avendano. The cipic hrtf database. In
Applications of Signal Processing to Audio and Acoustics, 2001 IEEE Workshop on
the, pages 99–102. IEEE, 2001.
J. Allen and D. Berkley. Image method for efficiently simulating small-room acoustics.
J. Acoust. Soc. Am, 65(4):943–950, 1979.
T. Auer and A. Pinz. The integration of optical and magnetic tracking for multi-user
augmented reality. Computers & Graphics, 23(6):805–808, 1999.
R. Azuma, B. Hof, H. Neely III, R. Sarfaty, M. Daily, G. Bishop, V. Chi, G. Welch,
U. Neumann, S. You, et al. Making augmented reality work outdoors requires hybrid
tracking. In Augmented Reality: placing artificial objects in real scenes: proceedings
of IWAR-98, pages 219–224, 1999a.
R. Azuma, B. Hoff, H. Neely III, and R. Sarfaty. A motion-stabilized outdoor augmented
reality system. IEEE Virtual Reality, 1999. Proceedings., pages 252–259, 1999b.
E. Bachmann, X. Yun, and C. Peterson. An investigation of the effects of magnetic
variations on inertial magnetic orientation sensors. IEEE Robotics and Automation
Magazine, pages 76–87, 2007.
Y. Bai and M. Ito. A study for providing better quality of service to VoIP users. 2006.
ISSN 1550-445X.
J. Baldis. Effects of spatial audio on memory, comprehension, and preference during
desktop conferences. In Proceedings of the SIGCHI conference on Human factors in
computing systems, pages 166–173. ACM, 2001.
129
Bibliography
N. Barrett and S. Berge. A new method for b-format to binaural transcoding. In 40th
AES International conference. Tokyo, Japan, pages 8–10, 2010.
T. Bayik and M. Akhan. 3d object database simplification using a vertex clustering

algorithm. In University of West Bohemia, Plzen, Czech Republic. Citeseer, 1999.
J. Beerends and J. Stemerdink. A perceptual speech-quality measure based on a

psychoacoustic sound representation. Journal of the Audio Engineering Society, 42
(3):115–123, 1994.
J. Beerends, E. Meijer, and A. Hekstra. Improvement of the P.861 perceptual speech

quality measure. ITU-T SG12 COM-34E, 1997.
D. Begault. 3-D sound for virtual reality and multimedia. Academic Press Professional,
Inc., San Diego, CA, USA, 1994. ISBN 0-12-084735-3.
D. Begault, E. M. Wenzel, and M. R. Anderson. Direct comparison of the impact of

head tracking, reverberation, and individualized head-related transfer functions on the
spatial perception of a virtual speech source. JAES, 49(10):904–916, Oct. 2001.
A. Bell. Improvement in telegraphy, Mar. 1876. US Patent 174,465.
A. Berkhout. A holographic approach to acoustic control. Journal of the Audio

Engineering Society, 36(12):977–995, 1988.
A. Berkhout, D. De Vries, and P. Vogel. Acoustic control by wave field synthesis. Journal
of Acoustical Society of America, 93:2764–2764, 1993.
V. Best, F. Gallun, A. Ihlefeld, and B. Shinn-Cunningham. The influence of spatial

separation on divided listening. The Journal of the Acoustical Society of America,
120:1506, 2006.
G. Beuth Verlag. Hörsamkeit von kleinen und mittleren räumen. (acoustical quality in
small to medium-sized rooms). Beuth Verlag GmbH, 2004.
J. Blauert. Spatial Hearing: The Psychophysics of Human Sound Localization. MIT

Press, 1997.
A. Blumlein. British patent specification 394325. Reprinted in John Eargle (ed.)

Stereophonic Techniques, pages 32–40, 1931.
A. Blumlein. Uk patent 394,325, 1931. Reprinted in Stereophonic Techniques, Audio

Eng. Soc., NY, 1986.
M. Boone, E. Verheijen, and P. Van Tol. Spatial sound-field reproduction by wave-field

synthesis. Journal of the Audio Engineering Society, 43(12):1003–1012, 1995.
130
Bibliography
J. Borish. Extension of the image model to arbitrary polyhedra. The Journal of the
Acoustical Society of America, 75:1827, 1984.
J. Bradley, R. Reich, and S. Norcross. On the combined effects of signal-to-noise ratio
and room acoustics on speech intelligibility. The Journal of the Acoustical Society of
America, 106:1820, 1999.
J. Bradley, H. Sato, and M. Picard. On the importance of early reflections for speech in
rooms. The Journal of the Acoustical Society of America, 113:3233, 2003.
K. Brandenburg, S. Brix, and T. Sporer. Wave field synthesis: From research
to applications. In Proceedings of 12th European Signal Processing Conference
(EUSIPCO), Vienna, Austria, 2004.
E. Britannica. Acoustics. Encyclopedia, Britannica, Oct. 2011. URL http://www.
britannica.com/EBchecked/topic/4044/acoustics.
A. Bronkhorst. Localization of real and virtual sound sources. The Journal of the
D. Brungart, A. J. Kordik, and B. D. Simpson. Effects of headtracker latency in virtual
audio displays. JAE, 54(1/2):32–44, Feb. 2006.
D. Brungart, B. Simpson, C. Bundesen, S. Kyllingsbaek, A. Burton, and A. Megreya.
Cocktail party listening in a dynamic multitalker environment. Perception and
Psychophysics, 69(1):79, 2007.
G. Burdea and P. Coiffet. Virtual reality technology. Presence: Teleoperators & Virtual
Environments, 12(6):663–664, 2003.
M. Burkhard and R. Sachs. Anthropometric manikin for acoustic research. The Journal
of the Acoustical Society of America, 58:214, 1975.
C. Cheng and G. Wakefield. Introduction to Head-Related Transfer Functions (HRTFs):
Representations of HRTFs in Time, Frequency, and Space. J Audio Eng Soc, 49(4):
231, 2001.
C. I. Cheng and G. H. Wakefield. Introduction to head-related transfer functions
(HRTFs): Representations of HRTFs in time, frequency, and space. In AES
Convention: 107,, Sept. 1999.
E. Cherry. Some experiments on the recognition of speech, with one and with two ears.
Journal of the acoustical society of America, 25(5):975–979, 1953.
Y. Chow. Low-cost multiple degrees-of-freedom optical tracking for 3d interaction
in head-mounted display virtual reality. International Journal of Recent Trends in
Engineering, 1(1):52–56, 2009.
131
Bibliography
A. Cooper, R. Reimann, and D. Cronin. About face 3: the essentials of interaction

design. Wiley-India, 2007. ISBN 8126513055.
K. Crispien and T. Ehrenberg. Evaluation of the "Cocktail Party Effect" for Multiple
Speech Stimuli within a Spatial Auditory Display. Journal of the Audio Engineering
Society, 43(11):932–941, 1995.
C. Darwin and R. Hukin. Auditory objects of attention: The role of interaural

time differences. Journal of Experimental Psychology: Human perception and
performance, 25(3):617, 1999.
F. Davis. A technology acceptance model for empirically testing new end-user

information systems: theory and results. Sloan School of Management, Massachusetts
Institute of Technology, 1986.
L. de Ipin A., P. Mendonça, and A. Hopper. TRIP: A low-cost vision-based location

system for ubiquitous computing. Personal and Ubiquitous Computing, 6(3):206–
219, 2002.
D. de Vries. Sound reinforcement by wavefield synthesis: Adaptation of the synthesis

operator to the loudspeaker directivity characteristics. J. Audio Eng. Soc, 44(12):1120–
1131, 1996.
D. De Vries and M. Boone. Wave field synthesis and analysis using array technology. In
Applications of Signal Processing to Audio and Acoustics, 1999 IEEE Workshop on,
pages 15–18. IEEE, 1999.
R. DeVellis. Scale development: Theory and applications. Applied Social Research

Methods Series, Vol 26, 1991.
A. Dey. Understanding and using context. Personal and ubiquitous computing, 5(1):4–7,
2001.
A. Dictionary. The american heritage science dictionary, July 2011. URL http://
dictionary.reference.com/.
S. Disch, C. Ertel, C. Faller, J. Herre, J. Hilpert, A. Hoelzer, P. Kroon, K. Linzmeier, and

C. Spenger. Spatial Audio Coding: Next-generation efficient and compatible coding
of multi-channel audio. In 117th AES Convention, San Francisco, USA, 2004.
R. Drullman and A. Bronkhorst. Multichannel speech intelligibility and talker

recognition using monaural, binaural, and three-dimensional auditory presentation.
The Journal of the Acoustical Society of America, 107:2224, 2000.
O. Ekiga. Open source, ekiga, webpage. http://ekiga.org, April 2010.
132
Bibliography
M. Ericson and R. McKinley. Binaural and Spatial Hearing in Real and Virtual
Environments, chapter The intelligibility of multiple talkers separated spatially in
noise, pages 701–724. Erlbaum, Mahwah, NJ, r. h. gilkey and t. r. anderson edition,
1997.
S. ETSI. European technical committee for speech, transmission, planning, and quality
of service, 2009.
J. Fajardo, F. Liberal, and N. Bilbao. Study of the impact of UMTS Best Effort
parameters on QoE of VoIP services. In Autonomic and Autonomous Systems, 2009.
ICAS’09. Fifth International Conference on, pages 142–147. IEEE, 2009.
H. Fisher and S. Freedman. The role of the pinna in auditory localization. Journal of
Auditory research, 1968. ISSN 0021-9177.
H. Fuchs, Z. Kedem, and B. Naylor. On visible surface generation by a priori tree

structures. ACM Siggraph Computer Graphics, 14(3):124–133, 1980.
T. Funkhouser, I. Carlbom, G. Elko, G. Pingali, M. Sondhi, and J. West. A beam tracing

approach to acoustic modeling for interactive virtual environments. In Proceedings
of the 25th Annual Conference on Computer Graphics and Interactive Techniques
(SIGGRAPH 98), pages 21–32, 1998.
R. G-1080. G. 1080. International Telecommunication Union, Dec. 2008.
K. Genuit. Ein modell zur beschreibung von außenohrübertragungseigenschaften, diss.

RWTH Aachen, 1984.
M. Gerzon. Ultra-directional microphones-part 1. Stuidio Sound, pages 434–437, 1970a.
M. Gerzon. Ultra-directional microphones: Applications of blumlein difference

technique: Part 2. Studio Sound, 12:501–504, 1970b.
M. Gerzon. Periphony: With-height sound reproduction. J. Audio Eng. Soc, 21(1):2–10,

1973.
M. Gerzon. Surround-sound psychoacoustics. Wireless World, 80(1468):483–486, 1974.
M. Gerzon. Ambisonics. part two: studio techniques. studio sound, 17(8):24–26, 1975.
M. Good and R. Gilkey. Sound localization in noise: The effect of signal-to-noise ratio.
The Journal of the Acoustical Society of America, 99:1108, 1996.
M. Guéguin, R. Le Bouquin-Jeannès, V. Gautier-Turbin, G. Faucon, and V. Barriac. On

the evaluation of the conversational speech quality in telecommunications. EURASIP
Journal on Advances in Signal Processing, 2008:1–15, 2008.
133
Bibliography
H. Haas. Uber den einfluss des einfachechos auf die horsamkeit von sprache. Acustica,
1(2):49–62, 1951.
J. Hair, R. Anderson, and W. Tatham, R.L.and Black. Multivariate Data Analysis with
Readings. Englewood Cliffs NJ: Prentice Hall, 1998.
D. Hallaway, S. Feiner, and T. Høllerer. Bridging the gaps: Hybrid tracking for adaptive
mobile augmented reality. Applied Artificial Intelligence, 18(6):477–500, 2004.
A. Harma, J. Jakka, M. Tikander, M. Karjalainen, T. Lokki, J. Hiipakka, and G. Lorho.

Augmented reality audio for mobile and wearable appliances. J. Audio Eng. Soc, 52
(6):618–639, 2004.
D. Hawkins and W. Yacullo. Signal-to-noise ratio advantage of binaural hearing aids and
directional microphones under different levels of reverberation. Journal of Speech and
Hearing Disorders, 49(3):278, 1984.
M. Hawley, R. Litovsky, and H. Colburn. Speech intelligibility and localization in a

multi-source environment. The Journal of the Acoustical Society of America, 105:
3436, 1999.
P. Heckbert and P. Hanrahan. Beam tracing polygonal objects. In ACM SIGGRAPH

Computer Graphics, volume 18, pages 119–127. ACM, 1984.
J. Herre, C. Falch, D. Mahne, G. del Galdo, M. Kallinger, and O. Thiergart. Interactive

teleconferencing combining spatial audio object coding and dirac technolog. Audio
Engineering Society, 128th Convention, May 2010.
S. Hillyard, R. Hink, V. Schwent, and T. Picton. Electrical signs of selective attention in

the human brain. Science, 182(4108):177, 1973.
T. Hirahara, Y. Sawada, and D. Morikawa. Impact of dynamic binaural signals on three

dimensional sound reproduction. In Proc. INTERNOISE, 2011.
M. Hodgson and E. Nosal. Effect of noise and occupancy on optimal reverberation

times for speech intelligibility in classrooms. The Journal of the Acoustical Society of
America, 111:931, 2002.
C. Hoene and M. Hyder. Optimally using the bluetooth subband codec. In Local
Computer Networks (LCN), 2010 IEEE 35th Conference on, pages 356–359. IEEE,
2010.
T. Houtgast. Predicting speech intelligibility in rooms from the modulation transfer

function. i. general room acoustics. Acustica, 46:60–72, 1980.
134
Bibliography
H. Hu, L. Chen, and Z. yang Wu. The estimation of personalized hrtfs in individual vas.
In Fourth International Conference on Natural Computation, pages 203–207. IEEE,
2008.
P. Hughes. Spatial audio conferencing. In ITU-T Workshop: "From Speech to
Audio:bandwidth extension,binaural perception", Lannion France, 2008.
C. Huygens. Traité de la lumière. published in Leyden, 1690.
M. Hyder, M. Haun, and C. Hoene. Measurements of sound localization performance and
speech quality in the context of 3D audio conference calls. In Internation Conference
on Acoustics, NAG/DAGA, Rotterdam, Netherlands, Mar. 2009.
M. Hyder, M. Haun, , O. Weidmann, and C. Hoene. Assessing virtual teleconferencing
rooms. In 129th Audio Engineering Society Convention, San Francisco, Ca, USA,
Nov. 2010a.
M. Hyder, M. Haun, and C. Hoene. Placing the participants of a spatial audio conference
call. In IEEE Consumer Communications and Networking Conference - Multimedia
Communication and Services (CCNC 2010), Las Vegas, USA, Jan. 2010b.
K. Inkpen, R. Hegde, M. Czerwinski, and Z. Zhang. Exploring spatialized audio &
video for distributed conversations. In Proceedings of the 2010 ACM conference on
Computer supported cooperative work, pages 95–98. ACM, 2010.
J. Irwin. Basic anatomy and physiology of the ear. Infection and hearing impairment,
page 1, 2006.
F. Itakura. Minimum prediction residual principle applied to speech recognition. IEEE
Transactions on Acoustics, Speech and Signal Processing, 23(1):67–72, 1975.
F. Itakura and S. Saito. Analysis synthesis telephony based on the maximum likelihood
method. Repts. 6th Int. Congr. Acoustics, pages 17–20.
T. ITU. Definition of quality of experience (qoe). TD 109rev2 (PLEN/12), Jan. 2007.
T. ITU. International telecommunication union, itu-t webpage, Dec. 2010. URL http:
//www.itu.int.
R. ITU-T. E.800: Terms and definitions related to quality of service and network
performance including dependability. ITU-T Recommendation, 1994a.
R. ITU-T. P.861: Objective quality measurement of telephone-band (300-3400 hz)
speech codecs. ITU-T Recommendation, 1994b.
R. ITU-T. P.800: Methods for subjective determination of transmission quality. ITU-T
Recommendation, 1996.
135
Bibliography
R. ITU-T. P.50: Appendix 1, test signals. ITU-T Recommendation, 1998.
R. ITU-T. Bs.1534: Method for the subjective assessment of intermediate sound quality
(mushra). ITU-T Recommendation, 2001a.
R. ITU-T. P.862: Perceptual evaluation of speech quality (pesq): an objective method for
end-to-end speech quality assessment of narrow-band telephone networks and speech
codecs. ITU-T Recommendation, 2001b.
R. ITU-T. Bs.1387: Method for objective measurements of perceived audio quality.

ITU-T Recommendation, Nov. 2001c.
R. ITU-T. G.107: The e model, a computational model for use in transmission planning.
ITU-T Recommendation, 2003a.
R. ITU-T. P.835: Subjective test methodology for evaluating speech communication

systems that include noise suppression algorithm. ITU-T Recommendation, 2003b.
R. ITU-T. P.563: Single-ended method for objective speech quality assessment in

narrow-band telephony applications. ITU-T Recommendation, 2004.
R. ITU-T. P.805: Subjective evaluation of conversational quality. ITU-T

Recommendation, 2007.
S. Iyer, N. P. Jouppi, and A. Slayden. A headphone-free head-tracked audio telepresence

system. In Audio Engineering Society Convention 117th, 10 2004.
U. Jekosch. Sprache hören und beurteilen. Qualitätsbeurteilung von Sprechtechnologien

als Forschung und Dienstleistungaufgabe. PhD thesis, Habilitation thesis, Essen
University, Germany, 2000, 2008.
M. Jeub, M. Schäfer, and P. Vary. A binaural room impulse response database for the
evaluation of dereverberation algorithms. In Proceedings of the 16th international
conference on Digital Signal Processing, pages 550–554. Institute of Electrical and
Electronics Engineers Inc., 2009.
L. Jiscdigital. Dummy head, Oct. 2011. URL http://www.jiscdigitalmedia.ac.

uk/audio/advice/stereo-recording-techniques.
R. Kajastila, S. Siltanen, P. Lunden, T. Lokki, and L. Savioja. A distributed real-time

virtual acoustic rendering system for dynamic geometries. In 122nd Convention of the
Audio Engineering Society (AES), Vienna, Austria, May 2007.
Y. Kanada. SIP/SIMPLE-based Conference Room Management Method for the Voice

Communication Medium, voiscape . 2008.
136
Bibliography
M. Karjalainen. A new auditory model for the evaluation of sound quality of audio
systems. In Proc. ICASSP, volume 85, pages 608–611, 1985.
M. Karjalainen. Structure and function of hearing chapter 05, Aug. 2011. URL http:
//www.acoustics.hut.fi/teaching.
M. Karjalainen, M. Tikander, and A. Harma. Head-tracking and subject positioning using
binaural headset microphones and common modulation anchor sources. In Acoustics,
Speech, and Signal Processing, 2004. Proceedings.(ICASSP’04). IEEE International
Conference on, volume 4, pages iv–101. IEEE, 2004.
M. Keir, C. Hann, J. Chase, and X. Chen. A new approach to accelerometer-based head

tracking for augmented reality & other applications. pages 603–608, 2007.
R. Kilgore. Simple Displays of Talker Location Improve Voice Identification

Performance in Multitalker, Spatialized Audio Environments. Human Factors, 51
(2):224, 2009.
R. Kilgore and M. Chignell. Listening to Unfamiliar Voices in Spatial Audio: Does

Visualization of Spatial Position Enhance Voice Identification. Human Factors in
Telecommunication, 2006.
R. Kilgore, M. Chignell, and P. Smith. Spatialized Audioconferencing: what are the

benefits? In Proceedings of the 2003 conference of the Centre for Advanced Studies
on Collaborative research, page 144. IBM Press, 2003.
K. Kilkki. Quality of experience in communications ecosystem. Journal of universal

computer science, 14(5):615–624, 2008.
C. Kim, S. Ahn, I. Kim, and H. Kim. 3-dimensional voice communication system for
two user groups. Advanced Communication Technology, 2005, ICACT 2005. The 7th
International Conference on, 1:100–105, 0-0 2005a.
D. Kim, M. Tarraf, L. Technol, and N. Whippany. Enhanced perceptual model for non-
intrusive speech quality assessment. volume 1, 2006.
H. Kim, D. Jee, M. Park, and S. Yoon, B.and Choi. The real-time implementation of 3D
sound system using DSP. In IEEE 60th Vehicular Technology Conference (VTC2004),
volume 7, pages 4798–4800, Sept. 2004.
J. Kim, S. Kim, Y. Kim, J. Lee, and S.-i. Park. New hrtfs (head related transfer functions)
for 3d audio applications. AES Convention: 118, May 2005b.
Y. Kitashima, K. Kondo, H. Terada, T. Chiba, and K. Nakagawa. Intelligibility of read

Japanese words with competing noise in virtual acoustic space. Acoustical science
and technology, 29(1):74–81, 2008. ISSN 1346-3969.
137
Bibliography
N. Kitawaki, H. Nagabuchi, and K. Itoh. Objective quality evaluation for low-bit-rate

speech coding systems. IEEE Journal on Selected Areas in Communications, 6(2):
242–248, 1988.
M. Kleiner, B. Dalenbäck, and P. Svensson. Auralization-an overview. Journal-Audio

Engineering Society, 41:861, 1993.
Y. Kobayashi, K. Kondo, and K. Nakagawa. Intelligibility of HE-AAC Coded Japanese

Words with Various Stereo Coding Modes in Virtual 3D Audio Space. Auditory
Display, pages 219–238, 2010.
M. Kourogi, T. Kurata, and K. Sakaue. A panorama-based method of personal

positioning and orientation and its real-time applications for wearable computers. In
Wearable Computers, IEEE Proceedings. Fifth International Symposium on, pages
107–114, 2001.
O. Kreylos. Environment-independent vr development. In ISVC (1), pages 901–912,

2008.
R. Krishnaswamy, G. S. Alijani, and S.-C. Su. On constructing binary space partitioning

trees. In CSC ’90: Proceedings of the 1990 ACM annual conference on Cooperation,
pages 230–235, New York, NY, USA, 1990. ACM. ISBN 0-89791-348-5.
A. Krokstad, S. Strom, and S. Sørsdal. Calculating the acoustical room response by the
use of a ray tracing technique. Journal of Sound and Vibration, 8(1):118–125, 1968.
R. Kürer, G. Plenge, and H. Wilkens. Correct spatial sound perception rendered by a

special 2-channel recording method. In Proceedings of the 37th AES Convention,(New
York, New York, USA), 1969.
H. Kuttruff. Room acoustics. Taylor & Francis, 2000.
K. Laghari, I. Ben Yahia, and N. Crespi. Analysis of telecommunication management

technologies. International Journal of Computer Science, 1, 2009.
K. Laghari, I. Yahya Ben, and N. Crespi. Towards a service delivery based on customer
experience ontology: shift from service to experience. In Proceedings of the 5th
IEEE international conference on Modelling autonomic communication environments,
pages 51–61. Springer-Verlag, 2010.
K. Laghari, B. Molina, N. Crespi, and C. Palau. QoE aware Service Delivery in

Distributed Environment. In Advanced Information Networking and Applications
Workshops, March 22 - 25, Biopolis, Singapore, 2011.
M. Laitinen. Binaural reproduction for directional audio coding. PhD thesis, HELSINKI
UNIVERSITY OF TECHNOLOGY, 2008.
138
Bibliography
H. Lee, H. Kim, M. Lee, P. Kim, S. Suh, and K. Kim. Development of 3D sound

generation system for multimedia application. In Computer Human Interaction, 1998.
Proceedings. 3rd Asia Pacific, pages 120–123. IEEE, 1998. ISBN 0818683473.
J. Lenz, H. Inc, and M. Minneapolis. A review of magnetic sensors. Proceedings of the

IEEE, 78(6):973–989, 1990.
L. Linkwitz. Blumlein microphone setup, Oct. 2011. URL http://www.linkwitzlab.

com/Recording/record-play-map.htm.
C. Low and L. Babarit. Distributed 3D audio rendering. Computer Networks and ISDN
Systems, 30:407–415, 1998.
H. Luinge. Inertial sensing of human movement. PhD Thesis, University of Twente,

Enschede, the Netherlands, 2002.
P. Mackensen. Auditive localization. Head movements, an additional cue in localization.

PhD thesis, Technical University Berlin, Germany, 2003.
D. Malham and A. Myatt. 3-d sound spatialization using ambisonic techniques.

Computer Music Journal, 19(4):58–70, 1995.
B. Mathews. Vector markup language (vml). World Wide Web Consortium Note 13-
May-1998, May 1998. URL http://www.w3.org/TR/1998/NOTE-VML-19980513.
McGraw-Hill, Dictionary. Virtual acoustics, Oct. 2011. URL http://encyclopedia2.

thefreedictionary.com/virtual+acoustics.
D. Mershon, W. Ballenger, A. Little, P. McMurtry, and J. Buchanan. Effects of room

reflectance and background noise on perceived auditory distance. Perception, 18(3):
403–416, 1989.
J. Middlebrooks. Individual differences in external-ear transfer functions reduced by

scaling in frequency. The Journal of the Acoustical Society of America, 106:1480,
1999.
J. Middlebrooks and D. Green. Sound localization by human listeners. Annual Review

of Psychology, 42(1):135–159, 1991. ISSN 0066-4308.
H. Møller. Fundamentals of binaural technology. Applied acoustics, 36(3):171–218,

1992.
H. Møller, M. Sørensen, C. Jensen, and D. Hammershøi. Binaural technique: do we

need individual recordings? Journal of the Audio Engineering Society, 44(6):451–
469, 1996. ISSN 1549-4950.
139
Bibliography
S. Möller. Assessment and prediction of speech quality in telecommunications. USA

Boston, Kluwer Academic Publishers, 2000.
W. Noble. Auditory localization in the vertical plane: Accuracy and constraint on bodily
movement. The Journal of the Acoustical Society of America, 82:1631, 1987.
J. Nunnally. Psychometric Theory. New York: McGraw-Hill, 1987.
M. Park, S. C. Choi, S. Kim, and K. Bae. Improvement of front-back sound localization

characteristics in headphone-based 3D sound generation. The 7th International
Conference on Advanced Communication Technology (ICACT 2005), 1:273–276,
2005. doi: 10.1109/ICACT.2005.245845.
K. Payton, R. Uchanski, and L. Braida. Intelligibility of conversational and clear speech

in noise and reverberation for listeners with normal and impaired hearing. Journal of
the Acoustical Society of America, 1994.
J. Pekonen. Microphone techniques for spatial sound. In Proceedings of the 2008

Acoustics Seminar on Spatial Sound Modeling. (Espoo, Finland), TKK Helsinki
University of Technology, Department of Signal Processing and Acoustics, 2008.
J. Pickles. An introduction to the physiology of hearing. academic Press London, 1988.
A. Pierce. Acoustics: an introduction to its physical principles and applications.

Acoustical Society of Amer, 1989.
V. Pulkki. Virtual sound source positioning using vector base amplitude panning. Journal
of the Audio Engineering Society, 45(6):456–466, 1997.
V. Pulkki. Localization of amplitude-panned virtual sources II: Two-and three-

dimensional panning. Journal of the Audio Engineering Society, 49(9):753–767,
2001a.
V. Pulkki. Spatial sound generation and perception by amplitude panning techniques.

Helsinki University of Technology, 2001b. ISBN 9512255316.
V. Pulkki. Spatial sound generation and perception by amplitude panning techniques.

Report, 62:1456–6303, 2001c.
V. Pulkki and M. Karjalainen. Localization of amplitude-panned virtual sources I:

Stereophonic panning. Journal of the Audio Engineering Society, 49(9):739–752,
2001.
V. Pulkki, J. Huopaniemi, and T. Huotilainen. Dsp tool for 8-channel audio mixing. In
Proc. Nordic Acoustical Meeting, volume 96, pages 307–314, 1996.
140
Bibliography
S. Quackenbush, T. Barnwell, and M. Clements. Objective measures of speech quality.

Prentice Hall, 1988.
A. Raake. 3cts-3 party conversational test scenarios for conference assessment. ITU-T,
Study Group 12-Contribution 201, Jan. 2011.
A. Raake, S. Spors, J. Ahrens, and J. Ajmera. Concept and evaluation of a downward-

compatible system for spatial teleconferencing using automatic speaker clustering. In
8th Annual Conference of the International Speech Communication Association, pages
1693–1696, Aug. 2007.
K. Radhakrishnan and H. Larijani. A study on QoS of VoIP networks: a random

neural network (RNN) approach. In Proceedings of the 2010 Spring Simulation
Multiconference, page 114. ACM, 2010.
C. Reynolds, M. Reed, and P. Hughes. Decentralized headphone based spatialized audio

conferencing for low power devices. In Multimedia and Expo, 2009. ICME 2009.
IEEE International Conference on, pages 778–781. IEEE, 2009.
D. Richards. Telecommunication by speech: The transmission performance of telephone

networks. Butterworths, 1973.
A. Rix and M. Hollier. Perceptual analysis measurement system for robust end-to-end
speech quality assessment. volume 3, pages 1515–1518, 2000.
A. Rix, J. Beerends, M. Hollier, and A. Hekstra. Perceptual evaluation of speech quality

(pesq)-a new method for speech quality assessment of telephone networks and codecs.
In Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001
IEEE International Conference on, volume 2, pages 749–752. IEEE, 2001.
D. Roetenberg, P. Slycke, A. Ventevogel, and P. Veltink. A portable magnetic position

and orientation tracker. Sensors and Actuators A: Physical, 135(2):426–432, 2007.
J. Roll, Y. Baillot, and A. Goon. A Survey of Tracking Technologies for Virtual

Environments. 2008.
F. Rumsey. Spatial audio. Focal Pr, 2001.
W. Ryu and D. Kim. Real-time 3D Head Tracking and Head Gesture Recognition. pages
169–172, 2007.
J. Sandvad. Dynamic aspects of auditory virtual environments. Preprints-Audio

Engineering Society, 1996.
S. Santarelli. Auditory localization of nearby sources in anechoic and reverberant

environments. PhD thesis, Boston University, 2001.
141
Bibliography
L. Savioja, J. Huopaniemi, T. Lokki, and R. Vaananen. Creating interactive virtual

acoustic environments. Journal of the Audio Engineering Society (AES), 47(9):675–
705, 1999.
H. Schepers, D. Roetenberg, and P. Veltink. Ambulatory human motion tracking

by fusion of inertial and magnetic sensing with adaptive actuation. Medical and
Biological Engineering and Computing, 48(1):27–37, 2010.
D. Schröder and T. Lentz. Real-Time Processing of Image Sources Using Binary Space
Partitioning. Journal of the Audio Engineering Society, 54(7/8):604–619, 2006. ISSN
0004-7554.
E. Shaw. External ear response and sound localization. Localization of sound: Theory
and applications, pages 30–41, 1982.
E. Shaw. Acoustical features of the human external ear. Mahwah, NJ: Lawrence
Erlbaum, 1997.
B. Sheppard, J. Hartwick, and P. Warshaw. The theory of reasoned action: A meta-

analysis of past research with recommendations for modifications and future research.
Journal of Consumer Research, pages 325–343, 1988.
H. Sherman. Binaural sound reproduction at home. Journal of the Audio Engineering

Society, 1(1):142–145, 1953.
B. Shinn-Cunningham. Learning reverberation: Considerations for spatial auditory

displays. In Proceedings of the ICAD, pages 126–134, 2000.
B. Shinn-Cunningham. Localizing sound in rooms. ACM/SIGGRAPH and Eurographics

Campfire: Acoustic Rendering for Virtual Environments, 2001.
B. Shinn-Cunningham and A. Ihlefeld. Selective and divided attention: Extracting

information from simultaneous sound sources. In Proc. ICAD, volume 4, 2004.
B. Shinn-Cunningham, L. Mraz, and N. Kopčo. Effect of reverberation on spatial

unmasking for nearby speech sources. The Journal of the Acoustical Society of
America, 109:2486, 2001.
H. Sinnreich and A. Johnston. Internet communications using SIP: delivering VoIP and
multimedia services with Session Initiation Protocol. John Wiley & Sons, Inc., 2001.
SpacePoint, PNI. Space point 9-axis sensor system, Oct. 2011. URL http://www.
pnicorp.com/products/spacepoint-gaming.
D. Spring. Selection of information in auditory virtual reality. PhD thesis, Otto-von-
Guericke-Universität Magdeburg, Universitätsbibliothek, 2007.
142
Bibliography
E. Steenberg. The verse homepage, Mar. 2010. URL http://verse.blender.org/.
R. Stiefelhagen, C. Fugen, R. Gieselmann, H. Holzapfel, K. Nickel, and A. Waibel.

Natural human-robot interaction using speech, head pose and gestures. In 2004
IEEE/RSJ International Conference on Intelligent Robots and Systems, 2004.(IROS
2004). Proceedings, volume 3, 2004.
L. Stifelman. The cocktail party effect in auditory interfaces: a study of simultaneous

presentation, 1994.
J. Strutt. On our perception of sound direction. Philosophical Magazine, 13:214–232,

1907.
U. Svensson and U. Kristiansen. Computational modelling and simulation of acoustic

spaces. In Proc. AES 22nd Conf. on Virtual, Synthetic and Entertainment Audio, pages
1–20, 2002.
A. Takahashi, D. Hands, and V. Barriac. Standardization activities in the itu for a qoe
assessment of iptv. Communications Magazine, IEEE, 46(2):78–84, 2008.
C. Tan and W. Gan. User-defined spectral manipulation of HRTF for improved

localisation in 3D sound systems. Electronics Letters, 34(25):2387–2389, Dec. 1998.
ISSN 0013-5194.
R. Tenmoku, M. Kanbara, and N. Yokoya. A wearable augmented reality system using

positioning infrastructures and a pedometer. 2003.
M. Tikander, A. Harma, and M. Karjalainen. Binaural positioning system for wearable

augmented reality audio. In Applications of Signal Processing to Audio and Acoustics,
2003 IEEE Workshop on., pages 153–156, 2003.
l. Torgny. Torgny, lab sound recording, Oct. 2011. URL http://www.torgny.biz/

Recording%20sound_2.htm.
C. Tsakostas, A. Floros, and Y. Deliyiannis. Binaural rendering for enhanced 3d audio
perception. In Proceedings of the AudioMostly 2007 2nd Conference on Interaction
with Sound, pages 27–29, 2007.
C. Uni-Verse. Uni-verse - fp6 project. Uni-Verse, Consortium, Mar. 2007. URL http:
//www.uni-verse.org/.
R. Vaananen, V. Valimaki, J. Huopaniemi, and M. Karjalainen. Efficient and parametric
reverberator for room acoustics modeling. In ICMC 97, pages 200–203, 1997.
M. Valente, H. Hosford-Dunn, and R. Roeser. Audiology. Thieme, 2008. ISBN

3131164220.
143
Bibliography
S. Vesa. Binaural sound source distance learning in rooms. IEEE Trans Audio Speech
Language Process, 17(8):1498–1507, 2009.
L. Vesterinen. Audio Conferencing Enhancements. 2006.
J. Vilkamo. Spatial sound reproduction with frequency band processing of b-format

audio signals. PhD thesis, Helsinki University of Technology, 2008.
E. Von Hornbostel and M. Wertheimer. Über die Wahrnehmung der Schallrichtung [On
the perception of the direction of sound]. Akademie der Wissenschaften, 1920.
S. Voran. Objective estimation of perceived speech quality-Part I: development of the

measuring normalizing block technique. IEEE Transactions on Speech and Audio
Processing, 7:371–382, 1999.
M. Vorländer. Auralization: fundamentals of acoustics, modelling, simulation,

algorithms and acoustic virtual reality. Springer Verlag, 2008. ISBN 3540488294.
H. Wallach. The role of head movements and vestibular and visual cues in sound
localization. Journal of Experimental Psychology, 27(4):339, 1940.
H. Wallach, E. Newman, and M. Rosenzweig. The precedence effect in sound

localization. The American journal of psychology, 62(3):315–336, 1949.
S. Wang, A. Sekey, A. Gersho, T. Syst, and C. Berkeley. An objective measure for

predicting subjective quality of speechcoders. IEEE Journal on selected areas in
communications, 10(5):819–829, 1992.
S. Wang, X. Xiong, Y. Xu, C. Wang, W. Zhang, X. Dai, and D. Zhang. Face-tracking as

an augmented input in video games: enhancing presence, role-playing and control. In
Proceedings of the SIGCHI conference on Human Factors in computing systems, page
1106. ACM, 2006.
S. Weinrich. The problem of front-back localization in binaural hearing. Scandinavian

audiology. Supplementum, 15:135, 1982. ISSN 0107-8593.
E. Wenzel. The impact of system latency on dynamic performance in virtual acoustic

environments. The Journal of the Acoustical Society of America, 103:3026, 1998.
E. Wenzel. Effect of increasing system latency on localization of virtual sounds.

In Proceedings of the Audio Engineering Society 16th International Conference on
Spatial Sound Reproduction, pages 42–50, 1999.
E. Wenzel, M. Arruda, D. Kistler, and F. Wightman. Localization using

nonindividualized head-related transfer functions. Journal-acoustical Society of
America, 94:111–111, 1993.
144
Bibliography
E. M. Wenzel. The role of system latency in multi-sensory virtual displays for space
applications. In Proceedings of HCI International 2001, New Orleans, LA, pages
619–623, Aug. 2001.
F. Wightman and D. Kistler. Monaural sound localization revisited. The Journal of the
F. Wightman and D. Kistler. Resolution of front–back ambiguity in spatial hearing by
listener and source movement. The Journal of the Acoustical Society of America, 105:
2841, 1999.
F. wikipedia. Signal-to-noise ratio. Wikipedia, Free Encyclopedia., Oct. 2011. URL
http://en.wikipedia.org/wiki/Signal-to-noise_ratio.
W. Yang and J. Bradley. Effects of room acoustics on the intelligibility of speech in
classrooms. Journal of the Acoustical Society of America, 125(2):1–12, 2009.
W. Yang and M. Hodgson. Auralization study of optimum reverberation times for speech
intelligibility for normal and hearing-impaired listeners in classrooms with diffuse
sound fields. The Journal of the Acoustical Society of America, 120:801, 2006.
W. Yang, M. Benbouchta, and R. Yantorno. Performance of the modified bark
spectral distortion as an objective speech quality measure. In IEEE International
Conference on Acoustics Speech and Signal Processing, volume 1. Institute of
Electrical Engineers-INC (IEE), 1998.
N. Yankelovich, W. Walker, P. Roberts, M. Wessler, J. Kaplan, and J. Provino. Meeting
central: making distributed meetings more effective. In Proceedings of the 2004 ACM
conference on Computer supported cooperative work, pages 419–428. ACM, 2004.
N. Yankelovich, J. Kaplan, J. Provino, M. Wessler, and J. M. DiMicco. Improving
audio conferencing: are two ears better than one? In Proceedings of the 2006
20th anniversary conference on Computer supported cooperative work (CSCW ’06),
pages 333–342, New York, NY, USA, 2006. ACM. ISBN 1-59593-249-6. doi:
http://doi.acm.org/10.1145/1180875.1180926.
J. Yim, E. Qiu, and T. Graham. Experience in the design and development of a game
based on head-tracking input. In Proceedings of the 2008 Conference on Future Play:
Research, Play, Share, pages 236–239. ACM, 2008.
X. Yun, E. Bachmann, and R. McGhee. A simplified quaternion-based algorithm for
orientation estimation from earth gravity and magnetic field measurements. IEEE
Transactions on Instrumentation and Measurement, 57(3):638–650, 2008.
P. Zahorik. Assessing auditory distance perception using virtual acoustics. The Journal
of the Acoustical Society of America, 111:1832, 2002.
145
Bibliography
V. Zeimpekis, G. Giaglis, and G. Lekakos. A taxonomy of indoor and outdoor

positioning techniques for mobile location services. ACM SIGecom Exchanges, 3
(4):19–27, 2002.
R. Zhu and Z. Zhou. A Real-Time Articulated Human Motion Tracking Using Tri-
Axis Inertial/Magnetic Sensors Package. IEEE Transactions on Neural Systems and
Rehabilitations Engineering, 12(2):295, 2004.
146

Optimizing Spatial Audio Telephony and Teleconferencing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimizing Spatial Audio Telephony and Teleconferencing

Uploaded by

Copyright:

Available Formats

Optimizing Spatial Audio Telephony and Teleconferencing

Optimizing Spatial Audio Telephony

2.7.5.1 Latency efficient head-tracker . . . . . . . . . . . . . 29

3 Design and Implementation of 3D Telephony 37

4 Experiments on the Placement of Teleconference Participants 53

5 Assessing Virtual Teleconference Rooms 69

7 Investigating Virtual Acoustic Environments & QoE Relationship 95

7.3.4 Technological Domain . . . . . . . . . . . . . . . . . . . . . . 105

A Research Papers 117

B Summary of Contributions 119

C Conversational Tests 121

2.1 Human Hearing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 Central conference bridge . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.12 Wiimote with six degrees of freedom . . . . . . . . . . . . . . . . . . . 50

4.1 Acoustic simulations: One listener and two sound sources . . . . . . . 56

5.1 Virtual room with three simultaneous talkers . . . . . . . . . . . . . . . 71

6.1 Test layout - three participants conversational test . . . . . . . . . . . . 90

7.1 A Framework for analyzing communication ecosystem lacking QoE . . 97

4.1 Summary of placement of participants test parameters . . . . . . . . . . 54

5.1 Listing of all used speech samples . . . . . . . . . . . . . . . . . . . . 72

6.1 Summary of test questions for conversational test . . . . . . . . . . . . 91

7.1 Summary of Cronbach Alpha test results . . . . . . . . . . . . . . . . . 108

• The implementation of a 3 dimensional telephony system (Chapter 3.5).

• The extension of VoIP soft-phone by a 3D sound processing software. A Verse

• The presentation of a conceptual and holistic Quality of Experience model

Figure 2.1: Human Hearing

hearing provides directional localization to help accurately localize sounds particularly in

2.2.1 Sound Localization

2.2.2 Sound Localization Cues

Figure 2.2: Interaural Time Differences (ITDs)

Figure 2.3: Interaural Intensity differences (IIDs)

2.2.3 Cone of Confusion

Figure 2.4: Cone of Confusion

Further, according to [Cheng and Wakefield, 1999], to estimate sound’s location in

2.3 Binaural Technology

Binaural technology is aimed at reproducing the human ability of binaural hearing in

Figure 2.5: Three dimensional hearing (Listener’s perspective)

Ambisonics is a technique for sound spatialization. Ambisonics system is a thorough

Figure 2.6: Blumlein Microphone Set - Figure-of-Eight Concept

2.3.3 Wave-Field Synthesis

WFS is a 3D audio rendering technique which is based on the concept of

Figure 2.7: Wave-Field-Synthesis

2.4 3D Audio recording

Figure 2.8: Dummy head

Figure 2.9: B-format microphone

2.5.1 Head Related Transfer Functions

Figure 2.10: Pinna Measurement

Figure 2.11: Head, Torso and Pinna Measurements

Figure 2.12: Vector Base Amplitude Panning (VBAP)

2.5.2 Amplitude Panning

Figure 2.13: Triplet-wise amplitude panning with five loudspeakers

Table 2.1: Basic algorithms of virtual acoustic simulation

Figure 2.14: Direct Sound, early reflections and late reverberation

In addition to the early reflections and reverberation, other important considerations

Figure 2.15: Image source method

2.6.1.1 Image Source Technique

A comparison between a ray tracing technique, developed by [Krokstad et al., 1968]

2.6.1.2 Beam Tracing Technique

Figure 2.16: Basic concept of beam-tracing

Figure 2.17: Automatically adjusted late reverberations

2.7.1 Virtual Acoustic Environment and 3D Sound Localization

to achieve externalization in order to properly be immersed in the virtual acoustic