P. 1
Ph.D_Thesis_2009_Hofbauer_K

Ph.D_Thesis_2009_Hofbauer_K

|Views: 79|Likes:
Published by shervinsh

More info:

Published by: shervinsh on Dec 08, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

11/11/2012

pdf

text

original

Graz University of Technology

Doctoral Thesis
Speech Watermarking
and Air Traffic Control
Konrad Hofbauer
Faculty of Electrical and Information Engineering
Graz University of Technology, Austria
Advisor: Prof. DI Dr. Gernot Kubin (TU Graz)
Co-Advisor: Prof. DDr. W. Bastiaan Kleijn (KTH Stockholm)
Graz, March 6, 2009
ii
iv
Abstract
Air traffic control (ATC) voice radio communication between aircraft pilots and con-
trollers is subject to technical and functional constraints owing to the legacy radio
system currently in use worldwide. This thesis investigates the embedding of digital
side information, so called watermarks, into speech signals. Applied to the ATC voice
radio, a watermarking system could overcome existing limitations, and ultimately
increase safety, security and efficiency in ATC. In contrast to conventional watermark-
ing methods, this field of application allows embedding of the data in perceptually
irrelevant signal components. We show that the resulting theoretical watermark ca-
pacity far exceeds the capacity of conventional watermarking channels. Based on
this finding, we present a general purpose blind speech watermarking algorithm that
embeds watermark data in the phase of non-voiced speech segments by replacing
the excitation signal of an autoregressive signal representation. Our implementation
embeds the watermark in a subband of narrowband speech at a watermark bit rate
of the order of magnitude of 500 bit/s. The system is evaluated experimentally using
an ATC speech corpus and radio channel measurement results, both of which were
produced specifically for this purpose and constitute contributions on their own. The
adaptive equalization based watermark detector is able to recover the watermark data
in the presence of channels with non-linear phase, time-variant bandpass filtering,
amplitude gain modulation, desynchronization and additive noise. In the aeronautical
application the scheme is highly robust and outperforms current state-of-the-art speech
watermarking methods. The theoretical results show that the performance could be
increased even further by incorporating perceptual masking.
Keywords: Information hiding, digital watermarking, speech watermarking, legacy
system enhancement, watermark capacity, watermark synchronization, air traffic
control, voice radio, analog radio channel.
v
vi
Zusammenfassung
Die Sprechfunk-Kommunikation zwischen Piloten und Fluglotsen unterliegt techni-
schen und funktionalen Einschränkungen, die sich aus dem althergebrachten und welt-
weit eingesetzten analogen Sprechfunk-System ergeben. Diese Arbeit beschäftigt sich
mit der Einbettung digitaler Seiteninformation, so genannter Wasserzeichen, in Sprach-
signale. Angewandt auf den analogen Flugfunk könnte ein solches Wasserzeichen-
System bestehende Einschränkungen aufheben und letztendlich die Effizienz sowie
die aktive und passive Sicherheit in der Flugsicherung steigern. Im Gegensatz zu
herkömmlichen Wasserzeichen-Verfahren erlaubt dieser Einsatzbereich ein Einbetten
der Daten in Signalanteilen, die vom menschlichen Gehör nicht wahrgenommen wer-
den können. Es zeigt sich, dass dies zu einer erheblichen Steigerung der Kapazität
des verdeckten Datenkanals führt. Basierend auf dieser Erkenntnis wird ein univer-
selles blindes Wasserzeichen-Verfahren vorgestellt, das die Daten in die Phase von
stimmlosen Sprachlauten einbettet, indem das Anregungssignal eines autoregressiven
Signalmodells durch ein Wasserzeichen-Signal ersetzt wird. Die Implementierung
bettet die Seiteninformationen mit einer Bitrate von circa 500 bit/s in einem Teilband
des schmalbandigen Sprachsignals ein. Zur experimentellen Validierung des Systems
wurden Flugfunkkanal-Messungen durchgeführt und ein für die Flugsicherung spe-
zifischer Sprachkorpus erstellt. Diese beiden Datensammlungen stellen jeweils einen
eigenständigen Beitrag dar. Der Wasserzeichen-Detektor basiert auf adaptiver Entzer-
rung und detektiert die eingebetteten Daten sogar bei einer Übertragung des Signals
über Kanäle mit zeitvarianter Kanalverstärkung, nicht-linearer Phase, zeitvarianter
Bandpass-Filterung, Desynchronisierung und additivem Rauschen. Das Verfahren ist
hochrobust und übertrifft in der Flugfunk-Anwendung leistungsmäßig die modernsten
existierenden Wasserzeichen-Verfahren. Die vorliegenden theoretischen Ergebnisse
zeigen, dass durch eine Berücksichtigung von gehörbezogenen Maskierungsmodellen
die Leistungsfähigkeit weiter gesteigert werden kann.
Stichwörter: Digitale Wasserzeichen, Einbettung digitaler Seiteninformationen, Sprach-
wasserzeichen, Wasserzeichenkanalkapazität, Wasserzeichensynchronisierung, Flug-
sicherung, Flugfunk, Sprechfunk, analoger Sprechfunkkanal
vii
viii
Acknowledgments
Many people have contributed to my journey towards this dissertation. It is with great
pleasure that I take this opportunity to acknowledge the support I have received.
I am most deeply indebted to my mentor Horst Hering at Eurocontrol. The basic
idea of this thesis, the application of watermarking to ATC, is a brainchild of yours,
and without your initiative and persistence this entire project would have never been
possible. Your keen interest and belief in my work and your continuing support and
encouragement has always meant a lot to me. Many thanks also to Prof. Vu Duong,
Marc Bourgois and Martina Jürgens of the former INO team at Eurocontrol for the
continuing support.
I am equally indebted to my advisors Prof. Gernot Kubin and Prof. Bastiaan Kleijn.
Your knowledge, creativity and keen intuitive insights, your dedication, profession-
alism and hard-working nature, as much as your hospitality, flexibility, patience and
placidness have truly impressed and inspired me. Your guidance not only had a major
impact on the outcome of this work, but also contributed greatly to my intellectual and
personal growth.
The journey towards my degree had also been a journey throughout Europe, and
I had the chance to work at and experience three different laboratories. A deserved
thanks is given to my colleagues of the SPSC lab at TU Graz, the SIP group at KTH
Stockholm, and the former INO team at Eurocontrol, for the meaningful discussions
and for creating an enjoyable working environment. The names are too many to
mention, but I am happy to give my warm thanks to all my former office mates,
namely Tuan, Sarwar, Erhard and many more at SPSC, Martin, Daniel, Mario and Eri
at INO, and Ermin and Gustav at SIP.
My deepest thanks to my parents Konrad and Annemarie who always encouraged
and supported me throughout the many years of my studies. Finally, my warmest
thanks to my beloved girlfriend Agata for your incredible love and support.
ix
x
Contents
1. Introduction 1
1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2. Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4. List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
I. Speech Watermarking 9
2. Background and Related Work 11
2.1. Digital Watermarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2. Related Work—General Watermarking . . . . . . . . . . . . . . . . . . . 12
2.3. Related Work—Speech Watermarking . . . . . . . . . . . . . . . . . . . . 14
2.4. Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3. Watermark Capacity in Speech 17
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2. Watermarking Based on Ideal Costa Scheme . . . . . . . . . . . . . . . . 18
3.3. Watermarking Based on Auditory Masking . . . . . . . . . . . . . . . . . 19
3.4. Watermarking Based on Phase Modulation . . . . . . . . . . . . . . . . . 23
3.5. Experimental Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4. Watermarking Non-Voiced Speech 47
4.1. Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
xi
Contents
5. Watermark Synchronization 69
5.1. Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3. Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . 78
5.4. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
II. Air Traffic Control 85
6. Speech Watermarking for Air Traffic Control 87
6.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2. High-Level System Requirements . . . . . . . . . . . . . . . . . . . . . . . 89
6.3. Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 91
7. ATC Simulation Speech Corpus 95
7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.2. ATCOSIM Recording and Processing . . . . . . . . . . . . . . . . . . . . 97
7.3. Orthographic Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.4. ATCOSIM Structure, Validation and Distribution . . . . . . . . . . . . . 105
7.5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8. Voice Radio Channel Measurements 111
8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.2. Measurement System for the Aeronautical Voice Radio Channel . . . . . 112
8.3. Conducted Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.4. The TUG-EEC-Channels Database . . . . . . . . . . . . . . . . . . . . . . . 121
8.5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9. Data Model and Parameter Estimation 125
9.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
9.2. Proposed Data and Channel Model . . . . . . . . . . . . . . . . . . . . . 128
9.3. Parameter Estimation Implementation . . . . . . . . . . . . . . . . . . . . 129
9.4. Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . 137
9.5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
10. Experimental Watermark Robustness Evaluation 147
10.1. Filtering Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
10.2. Gain Modulation Robustness . . . . . . . . . . . . . . . . . . . . . . . . . 149
10.3. Desynchronization Robustness . . . . . . . . . . . . . . . . . . . . . . . . 151
10.4. Noise Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
10.5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
11. Conclusions 157
xii
Contents
Appendix 161
A. ATCOSIM Transcription Format Specification 161
A.1. Transcription Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
A.2. Amendments to the Transcription Format . . . . . . . . . . . . . . . . . . 165
B. Aeronautical Radio Channel Modeling and Simulation 169
B.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
B.2. Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
B.3. Radio Channel Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
B.4. Aeronautical Voice Channel Simulation . . . . . . . . . . . . . . . . . . . 179
B.5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
B.6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
C. Complementary Figures 189
Bibliography 193
xiii
Contents
xiv
List of Figures
1.1. Speech watermarking system for the air/ground voice radio. . . . . . . 2
2.1. Generic watermarking model. . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2. Overview of different watermarking principles. . . . . . . . . . . . . . . 13
3.1. Quantization-based watermarking as an additive process. . . . . . . . . 19
3.2. Quantization-based watermarking as sphere packing problem. . . . . . 20
3.3. Masking threshold for a signal consisting of three sinusoids. . . . . . . . 21
3.4. Masking threshold power to signal power ratio for a single utterance. . 22
3.5. The joint density f
X
a
,X
b
(a, b) integrated over an interval ∆a. . . . . . . . 28
3.6. Comparison of the derived phase modulation watermark capacities. . . 38
3.7. Difference between the derived phase modulation watermark capacities. 38
3.8. Speech signal with original and randomized STFT phase coefficients. . 40
3.9. Randomization of the phase of masked DFT coefficients. . . . . . . . . . 41
3.10. Speech signal with original and randomized ELT coefficients. . . . . . . 43
3.11. Comparison of derived watermark capacities in speech. . . . . . . . . . 45
4.1. Watermark embedding in non-voiced speech. . . . . . . . . . . . . . . . 49
4.2. Frame and block structure of the embedded signal. . . . . . . . . . . . . 51
4.3. Watermark detector including synchronization, equalization and water-
mark detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4. Generation of a passband watermark signal for a telephony channel. . 58
4.5. Pulse shapes for zero ISI between data-carrying samples. . . . . . . . . 58
4.6. Magnitude response of the bandpass, IRS and aeronautical channel filter. 61
4.7. System robustness in the presence of various channel attacks. . . . . . 62
5.1. Synchronization system based on spectral line method. . . . . . . . . . 74
5.2. Second-order all digital phase-locked loop. . . . . . . . . . . . . . . . . 75
5.3. Timing phase synchronization sensitivity. . . . . . . . . . . . . . . . . . . 79
xv
List of Figures
5.4. Phase estimation error and bit error ratio for various types of node offsets. 80
5.5. Comparison between ideal and implemented synchronization. . . . . . 81
5.6. Overall system robustness in the presence of a timing phase offset. . . 82
5.7. Frame grid detection using short signal segments. . . . . . . . . . . . . 82
5.8. Robustness of active frame detection. . . . . . . . . . . . . . . . . . . . . 83
6.1. Identification of the transmitting aircraft. . . . . . . . . . . . . . . . . . . 91
6.2. Speech watermarking system for the aeronautical voice radio. . . . . . 92
6.3. Research focus within the field of data communications. . . . . . . . . . 93
7.1. Control room and controller working position (recording site). . . . . . 100
7.2. A short speech segment with push-to-talk signal. . . . . . . . . . . . . . 101
7.3. Screen-shot of the transcription tool TableTrans. . . . . . . . . . . . . . . 103
8.1. A basic audio channel measurement system. . . . . . . . . . . . . . . . . 112
8.2. Overview of the proposed baseband channel measurement system. . . 115
8.3. Complete voice channel measurement system on-board an aircraft. . . 116
8.4. Overview and circuit diagram of measurement system and interface box. 117
8.5. General aviation aircraft and onboard transceiver used for measurements.119
8.6. Ground-based transceivers used for measurements. . . . . . . . . . . . . 120
8.7. Frequency response of the static back-to-back voice radio channel. . . . 122
8.8. Evolution of different measured frequency bins over time. . . . . . . . . 123
8.9. Visualization of the aircraft track based on the recorded GPS data. . . . 123
9.1. Time-variation of the estimated impulse responses. . . . . . . . . . . . . 127
9.2. Separation between voice radio channel model and measurement errors. 128
9.3. Proposed data and channel model. . . . . . . . . . . . . . . . . . . . . . 129
9.4. Two periodicity measures as a function of the resampling factor
f
2
f
0
. . . . 131
9.5. LTI filter plus noise channel model. . . . . . . . . . . . . . . . . . . . . . 132
9.6. Estimation error to output ratio E
y
. . . . . . . . . . . . . . . . . . . . . . 135
9.7. Signal and noise estimation errors. . . . . . . . . . . . . . . . . . . . . . . 136
9.8. Overview of estimation procedure. . . . . . . . . . . . . . . . . . . . . . 137
9.9. Implementation of the gain normalization. . . . . . . . . . . . . . . . . . 137
9.10. Estimated frequency responses. . . . . . . . . . . . . . . . . . . . . . . . . 139
9.11. DFT magnitude spectrum of the estimation error signal. . . . . . . . . . 141
9.12. DC component of the received signal recording. . . . . . . . . . . . . . 142
9.13. DFT magnitude spectrum of the time-variant gain ˆ g. . . . . . . . . . . . 144
10.1. System robustness in the presence of various transmission channel filters. 149
10.2. Embedding scheme robustness against sinusoidal gain modulation. . . 150
10.3. System robustness in the presence of gain modulation and gain control. 150
10.4. Watermark embedding scheme robustness against AWGN. . . . . . . . 152
10.5. Enhanced robustness and intelligibility by dynamic range compression. 153
11.1. Relationship among the different chapters of this thesis. . . . . . . . . . 157
xvi
List of Figures
B.1. Signal spectra of the baseband and the HF signal. . . . . . . . . . . . . . 171
B.2. Multipath propagation in an aeronautical radio scenario. . . . . . . . . . 172
B.3. Probability density functions and PSDs for Rayleigh and Rice channels. 174
B.4. Tapped delay line channel model. . . . . . . . . . . . . . . . . . . . . . . 178
B.5. Sinusoidal AM signal in equivalent complex baseband. . . . . . . . . . 180
B.6. Received signal at different processing stages. . . . . . . . . . . . . . . . 182
B.7. Power of line-of-sight and scattered components. . . . . . . . . . . . . . 182
B.8. Received signal using the Generic Channel Simulator. . . . . . . . . . . . 183
B.9. Received signal using reference channel model with Jakes PSD. . . . . 184
B.10. Discrete Gaussian Doppler PSD. . . . . . . . . . . . . . . . . . . . . . . . 185
B.11. Received signal using reference channel model with Gaussian PSD. . . 185
B.12. Time-variant path gain of a flat fading channel. . . . . . . . . . . . . . . 185
B.13. Received signal before and after an automatic gain control. . . . . . . . 187
C.1. Noise robustness curves similar to Figure 10.4. . . . . . . . . . . . . . . . 190
C.2. Multiband compressor and limiter settings. . . . . . . . . . . . . . . . . 191
xvii
List of Figures
xviii
List of Tables
4.1. Transmission Channel and Hidden Data Channel Model . . . . . . . . . 50
4.2. Comparison With Reported Performance of Other Schemes . . . . . . . 65
4.3. Causes for Capacity Gap and Attributed Rate Losses . . . . . . . . . . . 66
7.1. Feature Comparison of Existing ATC-Related Speech Corpora . . . . . . 98
7.2. Examples of Controller Utterance Transcriptions . . . . . . . . . . . . . . 104
7.3. Key Figures of the ATCOSIM Corpus . . . . . . . . . . . . . . . . . . . . 106
9.1. Filter Estimation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
9.2. Filter and Noise Estimation Results . . . . . . . . . . . . . . . . . . . . . 140
9.3. Gain Modulation Frequency and Amplitude . . . . . . . . . . . . . . . . 145
xix
List of Tables
xx
Chapter 1
Introduction
1.1. Motivation
The advances in modern information technology revolutionized the way we com-
municate. Broadband Internet access and mobile wireless communication, be it for
voice, data, or video transmission, are ubiquitous and seemingly available anytime and
anywhere. Unfortunately, this is not entirely correct. The world is full of legacies, and
it is not always possible to deploy or use state-of-the-art communication technologies.
The problem addressed in this thesis is the voice radio communication between
aircraft and civil air traffic control (ATC) operators. Still today, this communication
is based on pre-World War II technology using amplitude-modulation analog radio
transmissions on shared broadcast channels. While technology has rapidly advanced
and satellite communication facilitates broadband Internet access in the passenger
cabin, the air/ground voice radio for pilots and controller communication has not been
changed since its introduction in the early forties of the last century.
The large technological gap between communication systems available in the cabin
for passenger services and in the cockpit for air traffic control purposes is not likely to
disappear in the foreseeable future. This is due to a wide range of reasons, including
an enormous base of airborne and ground legacy systems worldwide, long life-cycles
for aeronautical systems, difficult technological requirements (such as large system
scales and safety-critical compatibility, availability and robustness requirements), inter-
national regulations, and, last but not least, a lack of common political willingness or
determination.
The ATC voice radio lacks basic features that are taken for granted in any modern
communication system, such as caller identification or selective calling. It is commonly
agreed that such features could improve safety, security and efficiency in air traffic
control. Since, despite its advantages, there is no foreseeable time frame for the
1
Chapter 1. Introduction
Aircraft System
Watermark
Embedder
TX
Data
(i.e. Identification)
ATC System
RX
Data
(i.e. Identification)
Watermark
Detector
ATC Display
T
A
G
Figure 1.1.: Speech watermarking system for the air/ground voice radio.
introduction of digital ATC voice radio communication, it is attractive to retrofit or
extend the current analog system with additional features.
Many potential features, such as caller identification, authentication, selective calling,
call routing, or position reporting, can conceptually be reduced to a transmission
of digital side information that accompanies the transmitted analog speech signal.
In order to be legacy system compatible and to avoid the necessity of a separate
transmitter, such side information should be transmitted in-band within the speech
channel as shown in Figure 1.1. On the receiving end, such side information should
not be noticeable also with unequipped legacy receivers.
The imperceptible embedding of side information into another signal, often referred
to as information (or data) hiding or digital watermarking, constitutes the core of this
thesis. We deal with the general problem of imperceptibly embedding digital data into
speech, and consider the special case of embedding into ATC speech that is transmitted
over an analog aeronautical radio channel.
Watermarking speech signals with a high embedded data rate is a difficult problem
due to the narrow bandwidth of speech. The transmission of the watermarked speech
over a poor-quality aeronautical radio channel poses an additional difficulty due to
narrow bandwidth, noise corruption, fading and time-variance of the channel.
Compared to image, video and audio watermarking, there is relatively little prior
work in the field of speech watermarking. However, the production and perception of
speech is well understood, and one can draw from a large pool of knowledge gained
in the context of speech coding.
This thesis covers a wide range of aspects of the aforementioned problem. It
considers the full communication chain from source to sink, including the particular
characteristics of the input speech, the embedding of the watermark, the characteristics
of the aeronautical radio channel, and the detection of the watermark data in the
degraded signal. Solutions are provided for a wide range of sub-problems.
The first part of this thesis deals with speech watermarking in a general context.
It investigates the theoretical watermark capacity in speech signals and proposes a
new speech watermarking method and a complete system implementation, which
outperform the current state-of-the-art methods.
2
1.2. Thesis Outline
The second part presents contributions to the domain of air traffic control. In
particular, it presents an empirical study of the aeronautical voice radio channel, a
database of ATC operator speech, and an evaluation of the robustness of the proposed
watermarking system in the aeronautical application.
1.2. Thesis Outline
This thesis is divided into two parts.
Part I considers the general problem of embedding digital side information in speech
signals. After a short review of related work in Chapter 2, the following three
chapters address the issues of watermark capacity estimation, robust and high-rate
watermarking, and watermark synchronization.
Chapter 3 investigates the theoretical watermark capacity in speech given an additive
white Gaussian noise transmission channel. Starting from the general capacity of the
ideal Costa scheme, it shows that a large improvement in theoretical watermark capacity
is possible if the application at hand allows watermarking in perceptually irrelevant
signal components. Different ways to derive approximate and exact expressions for
the watermark capacity when modulating the signal’s DFT phase are presented and
compared to an estimation of the capacity when modulating host signal frequency
components that are perceptually masked. Parameters required for the experimental
comparison of both methods are estimated from a database of ATC speech signals,
which is presented in Chapter 7.
Chapter 4 presents a blind speech watermarking algorithm that embeds the water-
mark data in the phase of non-voiced speech by replacing the excitation signal of an
autoregressive speech signal representation. The watermark signal is embedded in a
frequency subband, which facilitates robustness against bandpass filtering channels.
We derive several sets of pulse shapes that prevent intersymbol interference and that al-
low the creation of the passband watermark signal by simple filtering. A marker-based
synchronization scheme robustly detects the location of the embedded watermark data
without the occurrence of insertions or deletions.
In light of the potential application to analog aeronautical voice radio communication,
we present experimental results for embedding a watermark in narrowband speech
at a bit rate of 450 bit/s. The adaptive equalization-based watermark detector not
only compensates for the vocal tract filtering, but also recovers the watermark data in
the presence of non-linear phase and bandpass filtering, amplitude modulation and
additive noise, making the watermarking scheme highly robust.
Chapter 5 discusses different aspects of the synchronization between watermark em-
bedder and detector. We examine the issues of timing recovery and bit synchronization,
the synchronization between the synthesis and the analysis systems, as well as the
3
Chapter 1. Introduction
data frame synchronization. Bit synchronization and synthesis/analysis synchroniza-
tion are not an issue when using the adaptive equalization-based watermark detector
of Chapter 4. For the simpler linear prediction-based detector we present a timing
recovery mechanism based on the spectral line method which achieves near-optimal
performance.
Using a fixed frame grid and the embedding of preambles, the information-carrying
frames are detected in the presence of preamble bit errors with a ratio of up to 10 %.
Evaluated with the full watermarking system, the active frame detection performs near-
optimal with the overall bit error ratio increasing by less than 0.5 %-points compared
to ideal synchronization.
Part II of this thesis contains contributions to the domain of air traffic control (ATC).
After a brief overview of the application of speech watermarking in ATC and related
work in Chapter 6, the subsequent chapters present a radio channel measurement sys-
tem, database and model, an ATC speech corpus, and an evaluation of the robustness
of the proposed watermarking application.
Chapter 7 presents the ATCOSIM Air Traffic Control Simulation Speech corpus, a
speech database of ATC operator speech. ATCOSIM is a contribution to ATC-related
speech corpora. It consists of ten hours of speech data, which were recorded during
ATC real-time simulations. The database includes orthographic transcriptions and
additional information on speakers and recording sessions. The corpus is publicly
available and provided free of charge. Possible applications of the corpus are, among
others, ATC language studies, speech recognition and speaker identification, the design
of ATC speech transmission systems, as well as listening tests within the ATC domain.
Chapter 8 presents a system for measuring time-variant channel impulse responses
and a database of such measurements for the aeronautical voice radio channel. Maxi-
mum length sequences (MLS) are transmitted over the voice channel with a standard
aeronautical radio and the received signals are recorded. For the purpose of synchro-
nization, the transmitted and received signals are recorded in parallel to GPS-based
timing signals. The flight path of the aircraft is accurately tracked. A collection of
recordings of MLS transmissions is generated during flights with a general aviation
aircraft. The measurements cover a wide range of typical flight situations as well as
static back-to-back calibrations. The resulting database is available under a public
license free of charge.
Chapter 9 proposes a data model to describe the data in the TUG-EEC-Channels
database, and a corresponding estimation method. The model is derived from various
effects that can be observed in the database, such as different filter responses, a
time-variant gain, a sampling frequency offset, a DC offset and additive noise. To
estimate the model parameters, we compare six well-established FIR filter identification
techniques and conclude that best results are obtained using the method of Least
4
1.3. Contributions
Squares. We also provide simple methods to estimate and compensate the sampling
frequency offset and the time-variant gain.
The data model achieves a fit with the measured data down to an error of -40 dB,
with the modeling error being smaller than the channel’s noise. Applying the model
to select parts of the database, we conclude that the measured channel is frequency-
nonselective. The data contains a small amount of gain modulation (flat fading). Its
source could not be conclusively established, but several factors indicate that it is not a
result of radio channel fading. The observed noise levels are in a range from 40 dB to
23 dB in terms of SNR.
Chapter 10 shows the robustness of the proposed watermarking method in the aero-
nautical application using the channel model derived in Chapter 9. We experimentally
demonstrate the robustness of the method against filtering, desynchronization, gain
modulation and additive noise. Furthermore we show that pre-processing of the speech
signal with a dynamic range controller can improve the watermark robustness as well
as the intelligibility of the received speech.
Chapter 11 concludes both parts of this thesis and suggests directions for future
research.
1.3. Contributions
The main contributions of this thesis are:
• a novel speech watermarking scheme that outperforms current state-of-the-art
methods and is robust against AWGN, gain modulation, time-variant filtering
and desynchronization
– presented in Chapter 4 and Chapter 10
– published in [1][2][3]
• methods for watermark synchronization on the sampling, bit and frame level
– presented in Chapter 5
– published in [1][2]
• watermark capacity estimations for phase modulation based and frequency
masking based speech watermarking
– presented in Chapter 3
– recent results, unpublished
• a battery-powered portable measurement system for mobile audio channels
– presented in Chapter 8
– published in [4]
• a database of aeronautical voice radio channel measurements
5
Chapter 1. Introduction
– presented in Chapter 8
– published in [4]
– publicly available at http://www.spsc.tugraz.at/TUG-EEC-Channels
• a data model and an estimation technique for measured aeronautical voice radio
channel data
– presented in Chapter 9
– recent results, unpublished
• an air traffic control simulation speech corpus
– presented in Chapter 7
– published in [5]
– publicly available at http://www.spsc.tugraz.at/ATCOSIM
1.4. List of Publications
First-Author Publications
For the following papers, the author of this thesis did the major part of the theoretical
work, conducted all experiments, and did the major part in writing the paper.
[1] K. Hofbauer, G. Kubin, and W. B. Kleijn, “Speech watermarking for analog
flat-fading bandpass channels,” IEEE Transactions on Audio, Speech, and Language
Processing, 2009, revised and resubmitted.
[2] K. Hofbauer and H. Hering, “Noise robust speech watermarking with bit syn-
chronisation for the aeronautical radio,” in Information Hiding, ser. Lecture Notes
in Computer Science. Springer, 2007, vol. 4567/2008, pp. 252–266.
[3] K. Hofbauer and G. Kubin, “High-rate data embedding in unvoiced speech,” in
Proceedings of the International Conference on Spoken Language Processing (INTER-
SPEECH), Pittsburgh, PY, USA, Sep. 2006, pp. 241–244.
[4] K. Hofbauer, H. Hering, and G. Kubin, “A measurement system and the TUG-
EEC-Channels database for the aeronautical voice radio,” in Proceedings of the
IEEE Vehicular Technology Conference (VTC), Montreal, Canada, Sep. 2006, pp. 1–5.
[5] K. Hofbauer, S. Petrik, and H. Hering, “The ATCOSIM corpus of non-prompted
clean air traffic control speech,” in Proceedings of the International Conference on
Language Resources and Evaluation (LREC), Marrakech, Morocco, May 2008.
[6] K. Hofbauer and G. Kubin, “Aeronautical voice radio channel modelling and
simulation—a tutorial review,” in Proceedings of the International Conference on
Research in Air Transportation (ICRAT), Belgrade, Serbia, Jul. 2006.
[7] K. Hofbauer, H. Hering, and G. Kubin, “Speech watermarking for the VHF radio
channel,” in Proceedings of the EUROCONTROL Innovative Research Workshop (INO),
Brétigny-sur-Orge, France, Dec. 2005, pp. 215–220.
6
1.4. List of Publications
[8] K. Hofbauer and H. Hering, “Digital signatures for the analogue radio,” in
Proceedings of the NASA Integrated Communications Navigation and Surveillance
Conference (ICNS), Fairfax, VA, USA, 2005.
[9] K. Hofbauer, “Advanced speech watermarking for secure aircraft identification,”
in Proceedings of the EUROCONTROL Innovative Research Workshop (INO), Brétigny-
sur-Orge, France, Dec. 2004.
Co-Author Publications
For the following paper, the author of this thesis helped with and guided the theoretical
work, the experiments, and the writing of the paper.
[10] M. Gruber and K. Hofbauer, “A comparison of estimation methods for the VHF
voice radio channel,” in Proceedings of the CEAS European Air and Space Conference
(Deutscher Luft- und Raumfahrtkongress), Berlin, Germany, Sep. 2007.
The work of the following two papers is not included in this thesis. The author helped
with the implementation of the experiments and the writing of the papers.
[11] H. Hering and K. Hofbauer, “From analogue broadcast radio towards end-to-end
communication,” in Proceedings of the AIAA Aviation Technology, Integration, and
Operations Conference (ATIO), Anchorage, Alaska, USA, Sep. 2008.
[12] H. Hering and K. Hofbauer, “Towards selective addressing of aircraft with voice
radio watermarks,” in Proceedings of the AIAA Aviation Technology, Integration, and
Operations Conference (ATIO), Belfast, Northern Ireland, Sep. 2007.
7
Chapter 1. Introduction
8
Part I.
Speech Watermarking
9
Chapter 2
Background and Related Work
2.1. Digital Watermarking
Watermarking is the process of altering a work or data stream to embed an additional
message that is not or hardly perceptible. In digital watermarking, the work or data
stream to be altered is called the ‘host signal’, and may be an image, video, audio, text
or speech signal. A generic model for digital watermarking is shown in Figure 2.1. The
‘watermark’ is the embedded message and is expected to cause minimal perceptual
degradation to the host signal.
1
Any modification to the watermarked signal in
between the watermark embedding and the watermark detection constitutes a ‘channel
attack’ and might render the watermark undetectable. The watermark capacity is
the maximum information rate of the embedded message given the host signal, the
allowed perceptual distortion induced by the watermark, and the watermark detection
errors induced by the channel attack.
A brief and very accessible introduction to watermarking including a presentation
of early schemes can be found in [14]. An exhaustive treatment of the underlying
concepts of watermarking is provided in [13].
Practical applications for digital watermarking range from copy prevention and
traitor tracing to broadcast monitoring, archiving, and legacy system enhancement.
We only consider the case of blind watermarking, where the original host signal is
not available at the watermark detector. Most watermarking methods are subject to a
fundamental trade-off between watermark capacity, perceptual fidelity and robustness
against intentional or unintentional channel attacks. The selection of an appropriate
operating point is highly application-dependent.
1
In [13], the term ‘watermarking’ is used only if the embedded message is related to the host signal.
Otherwise, the term ‘overt embedded communications’ is used if the existence of an embedded
message is known, or ‘steganography’ if the existence of the embedded message is concealed.
11
Chapter 2. Background and Related Work
Payload
Message
Host Signal
Watermark
Embedder
Channel
Attacks
Received
Host Signal
Watermark
Detector
Detected
Payload
Figure 2.1.: Generic watermarking model.
2.2. Related Work—General Watermarking
In recent years, digital watermarking techniques achieved significant progress [13, 15].
Early methods considered watermarking as a purely additive process (Figure 2.2a),
with potential spectral shaping of the watermark signal (Figure 2.2b), or an additive
embedding of the watermark in a transform domain of the host signal (Figure 2.2c).
As an example, spread spectrum watermarking adds a pseudo-random signal vector
representing a watermark message to a transformation of the original host signal. This
type of system was used in many practical implementations and proved to be highly
robust, but suffered from the inherent interference between the watermark and the
host signal [16, 17, 18, 19].
Costa’s work on communication with side information [20] enabled a breach with
traditional additive watermarking. It allows to consider watermarking as a commu-
nication problem, with the host signal being side information that is available to the
embedder but not to the detector. The watermark channel capacity then becomes
C
W, Costa
=
1
2
log
2
_
1 +
σ
2
W
σ
2
N
_
(2.1)
and depends solely on the watermark-to-attack-noise power ratio
σ
2
W
σ
2
N
, assuming in-
dependent identically distributed (IID) host data, a power-constrained watermark
signal and an additive white Gaussian noise (AWGN) channel attack [20]. The capacity
is independent of the host signal power σ
2
H
, and thus watermarking without host
signal interference is possible [21, 22]. Striving to approach this capacity with practical
schemes, quantization-based methods embed the information by requantizing the
signal (Figure 2.2d) or a transform domain representation (Figure 2.2e) using different
vector quantizers or dither signals that depend on the watermark data. These methods
are often referred to as quantization index modulation (QIM). Reducing the embedding
dimensionality of distortion-compensated QIM to one (sample-wise embedding) and
using scalar uniform quantizers in combination with a watermark scale factor results in
the popular scalar Costa scheme (SCS) [23]. SCS is optimally robust against the AWGN
attack, but is vulnerable to amplitude scaling. A number of methods, for example
the rational dither modulation scheme [24], have been proposed to counteract this
vulnerability. However, quantization-based methods are still sensitive to many simple
signal processing operations such as linear filtering, non-linear operations, mixing or
resampling, which are the subjects of numerous recent investigations (e.g. [25], [26]).
12
2.2. Related Work—General Watermarking
w
Q
ŝ s
(d)
s ŝ
w
T 1/T
s ŝ e
w
ê
H
s ŝ
w
(a) (c) (b)
(e)
T 1/T
s ŝ
Q
e
w
ê
(g) (f)
ŝ e
T 1/T
s
w
ê
Mod.
w
ŝ s
1/T T
ê
Figure 2.2.: A watermarked signal ˆ s(n) is generated from the original speech signal
s(n) and the watermark data signal w(n) by (a) adding the data to s(n), (b)
adding adaptively filtered data to s(n), (c) adding the data to a transform
domain signal representation e(n), (d) quantizing s(n) according to the
data, (e) quantizing the transform e(n) according to the data, (f) modulating
e(n) with the data, and, as proposed in this work, (g) replacing parts of
e(n) by the data.
13
Chapter 2. Background and Related Work
2.3. Related Work—Speech Watermarking
Most watermarking methods use a perceptual model to determine the permissible
amount of embedding-induced distortion. Many audio watermarking algorithms use
auditory masking, and most often frequency masking, as the perceptual model for
watermark embedding. While this is a valid approach also for speech signals, we
claim that capacity remains unused by not considering the specific properties of speech
and the way it is perceived. It is thus indispensable to tailor a speech watermarking
algorithm to the specific signal.
In speech watermarking, as in audio watermarking, spread-spectrum type systems
used to be the method of choice for robust embedding [27, 28]. Various quantization-
based methods have been proposed in recent years, applying QIM or SCS to autore-
gressive model parameters [29, 30], the pitch period [31], discrete Hartley transform
coefficients [32], and the linear prediction residual [33]. The achievable bit rates range
from a few bits to a few hundred bits per second, with varying robustness against
different types of attacks. In a generalization of QIM, some methods modulate the
speech signal or one of its components according to the watermark data (Figure 2.2f)
by estimating and adjusting the polarity of speech syllables [34], by modulating the
frequency of selected partials of a sinusoidal speech or audio model [35, 36], or by
modifying the linear prediction model coefficients of the speech signal using long anal-
ysis and synthesis windows [37]. Taking quantization and modulation one step further,
one can even replace certain speech components by a perceptually similar watermark
signal (Figure 2.2g). Methods have been proposed that exchange time-frequency com-
ponents of audio signals above 5 kHz that have strong noise-like properties by a spread
spectrum sequence [38], or replace signal components that are below a perceptual
masking threshold by a watermark signal [39, 40].
The above-mentioned algorithms are limited either in terms of their embedding
capacity or in their robustness against the transmission channel attacks expected in
the aeronautical application. This is due to the fact that the methods are designed
for particular channel attacks (such as perceptual speech coding), focus on security
considerations concerning hostile channel attacks, or fail to thoroughly exploit state-of-
the-art watermarking theory or the characteristic features of speech perception.
2.4. Our Approach
In our approach, we combine the watermarking theory presented in the preceding
sections with a well-known principle of speech perception, leading to a substantial
improvement in theoretical capacity. From this we develop in Chapter 4 a practical
watermarking scheme that is based on the above concept of replacing signal compo-
nents (see Section 2.3) and using common speech coding and digital communications
techniques.
The watermark power σ
2
W
in (2.1) can be interpreted as a mean squared error (MSE)
host signal distortion constraint and essentially determines the watermark capacity.
However, it has previously been shown that MSE is not a suitable distortion measure for
14
2.4. Our Approach
speech signals [41], as for example flipping the polarity of the signal results in a large
MSE but in zero perceptual distortion. It is a long-known fact that, in particular for
non-voiced speech (denoting all speech that is not voiced, comprising unvoiced speech
and pauses) and blocks of short duration, the ear is insensitive to the signal’s phase
[42]. This effect is also exploited in audio coding for perceptual noise substitution [43].
Thus, instead of the MSE distortion (or watermark power σ
2
W
) constraint, we propose
to constrain the watermarked signal to have the same power spectral density (PSD) as
the host signal (with power σ
2
H
) but allow arbitrary modification (or replacement) of
the phase. Compared to MSE this PSD constraint is far less restrictive and results in a
watermark channel capacity
C
W, Phase

1
4
log
2
_
σ
2
H
σ
2
N
_
+
1
4
log
2
_

e
_
. (2.2)
This high SNR approximation is derived in Section 3.4.2. Note that in contrast to (2.1)
the capacity is no longer determined by the MSE distortion to channel noise ratio
σ
2
W
σ
2
N
,
but by the host signal to channel noise ratio or SNR
σ
2
H
σ
2
N
, and the watermark signal has
the same power as the host signal.
While watermarking in the phase domain is not a completely new idea, previously
proposed methods either require the availability of the original host signal at the
detector [44], are not suitable for narrowband speech [45, 46], or are restricted to
relatively subtle phase modifications [14, 47, 48]. Also, the large theoretical watermark
capacity given by (2.2) has not been recognized before.
To obtain a perceptually transparent and practical implementation of our phase
embedding approach, we assume an autoregressive (AR) speech signal model and
constrain the AR model spectrum and the temporal envelope to remain unchanged.
There is a one-to-one mapping between the signal phase and the model’s excitation,
which allows us to modify the signal phase by modifying the excitation. Applying the
concept of replacing certain signal components of Section 2.3, we exchange the model
excitation by a watermark signal. We do so in non-voiced speech only, since in voiced
speech certain phase spectrum changes are in fact audible [49, 50].
15
Chapter 2. Background and Related Work
16
Chapter 3
Watermark Capacity in Speech
This chapter investigates the theoretical watermark capacity in speech given
an additive white Gaussian noise transmission channel. Starting from the
general capacity of the ideal Costa scheme, it shows that a large improve-
ment in theoretical watermark capacity is possible if the application at
hand allows watermarking in perceptually irrelevant signal components.
Different ways to derive approximate and exact expressions for the water-
mark capacity when modulating the signal’s DFT phase are presented and
compared to an estimation of the capacity when modulating host signal
frequency components that are perceptually masked. Parameters required
for the experimental comparison of both methods are estimated from a
database of ATC speech signals.
This chapter presents recent results that have not yet been published.
17
Chapter 3. Watermark Capacity in Speech
3.1. Introduction
In this chapter, we aim to estimate the watermark capacity in speech given an ad-
ditive white Gaussian noise (AWGN) channel attack. Using well-established speech
perception properties, we show that the watermark capacity in speech far exceeds the
watermark capacity of conventional quantization-based watermarking. We achieve this
improvement by proposing a clear distinction between watermarking in perceptually
relevant signal components (or domains) and watermarking in perceptually irrelevant
signal components.
Given that in the past most watermarking research was driven by the need of the
multimedia content industry for copyright protection systems, many algorithms aim at
a robustness against perceptual coding attacks. Since perceptual coding aims to remove
perceptually irrelevant signal components, watermarking must occur in perceptually
relevant components. However, there is a limit to how much perceptually relevant
components can be altered without degrading perceptual quality to an unacceptable
level. It is within this context where quantization-based watermarking and its capacity
evolved, and its capacity is reviewed in Section 3.2.
In watermarking for analog legacy system enhancement, robustness against per-
ceptual coding is not a concern. Besides watermarking in perceptually relevant
components, watermarking in perceptually irrelevant signal components is possible,
which opens a door to significant capacity improvements. A similar discussion can
be found in [13], but otherwise the important distinction between watermarking in
perceptually relevant or irrelevant components is seldom brought forward.
Given the aeronautical application presented in Chapter 6, we focus on watermarking
in perceptually irrelevant components and derive the watermark capacities in speech
on the basis of frequency masking (Section 3.3) and phase perception (Section 3.4). An
experimental comparison of the different capacities is performed in Section 3.5.
In the remainder of this chapter, all capacities C are given in bits per sample (or
bits per independent symbol, equivalently). Thus, the capacity specifies how many
watermark bits can be embedded in an independent host signal sample or symbol. For
a band-limited channel, the maximum number of independent symbols per second
(the maximum symbol rate or Nyquist rate) is twice the channel bandwidth (e.g., [51]).
Consequently, the maximum number of watermark bits transmitted per second is C
times twice the channel bandwidth in Hz.
3.2. Watermarking Based on Ideal Costa Scheme
As a basis for comparison, we start with a brief review of the capacity of watermarking
in perceptually relevant signal components and the Ideal Costa Scheme (ICS) [13]. The
ICS, also termed ‘dirty paper watermarking’, ‘watermarking with side information’
or ‘informed embedding’ is the underlying principle for most quantization-based
methods, and it is possible to derive an achievable upper bound for the watermark
capacity.
In quantization-based watermarking the host signal is quantized using different vec-
18
3.3. Watermarking Based on Auditory Masking
Q
ŝ s
Watermark
Signal Gen.
s ŝ
m
w
m
Figure 3.1.: Quantization-based watermarking as a host-signal-adaptive additive pro-
cess.
tor quantizers that depend on the watermark data. Figure 3.1 depicts the quantization
as an addition of a host signal dependent watermark, which has a power constraint σ
2
W
that corresponds to an MSE criterion for the distortion of the host signal. We review in
the following the watermark capacity derivation given this MSE distortion criterion.
Assuming a white Gaussian host signal vector s of length L with variance σ
2
H
, the
watermarked signal ˆ s must lie within an (L −1)-sphere W centered around s with
volume
V
W
=
π
L/2
r
L
W
Γ(
L
2
+1)
and radius r
W
=
_

2
W
. The Gamma function Γ is defined as
Γ(z +1) = zΓ(z) = z
ˆ

0
t
z−1
e
−t
dt.
The transmitted signal is subject to AWGN with variance σ
2
N
, and in high dimensions
the received signal vector lies with high probability inside a hypersphere N with radius
r
N
=
_

2
N
centered at ˆ s [52]. Consequently, all possible watermarked signals within
W lie after transmission within a hypersphere Y with radius r
Y
=
_
L(σ
2
W
+ σ
2
N
) (see
Figure 3.2). The number of distinguishable watermark messages given the channel
noise is thus
V
Y/V
W
, which roughly means the number of ‘noise spheres’ N that fit into
the ‘allowed-distortion sphere’ W. The watermark capacity evaluates to [51, 52, 13]
C
ICS
=
1
L
log
2
_
V
Y
V
W
_
=
1
2
log
2
_
1 +
σ
2
W
σ
2
N
_
bit/sample (3.1)
and is inherently limited by the MSE distortion constraint.
3.3. Watermarking Based on Auditory Masking
Auditory masking describes the psychoacoustical principle that some sounds are not
perceived in the temporal or spectral vicinity of other sounds [53]. In particular,
frequency masking (or simultaneous masking) describes the effect that a signal is not
19
Chapter 3. Watermark Capacity in Speech
ŝ
s
w
s
i
s
i+1
Sphere Y with radius r
Y
Sphere W with radius r
W
Sphere N with radius r
N
Reconstruction point
of quantization cell
Figure 3.2.: Quantization-based watermarking and the Ideal Costa Scheme shown as
sphere packing problem. Each reconstruction point ( ) within the sphere
W denotes a different watermark message.
audible in the presence of a simultaneous louder masker signal at nearby frequencies.
Figure 3.3 shows the spectrum of a signal consisting of three sinusoids, and the
masking threshold (or masking curve) derived using the van-de-Par masking model
[54]. According to the model, sinusoidal components below the masking threshold are
not audible because they are masked by the components above the masking threshold.
Perceptual models in general, and auditory masking in particular, are commonly
used in practical audio watermarking schemes, and the principles are well understood
[13, Ch. 8]. Besides more traditional approaches such as the spectral shaping of spread
spectrum watermarks (e.g., [55, 28]), the principle of auditory masking is exploited
either by varying the quantization step size or embedding strength in one way or the
other (e.g., [56, 57]), or by removing masked components and inserting a watermark
signal in replacement (e.g., [39]).
The remainder of this subsection derives an estimate of the theoretical watermark
capacity that can be achieved based on frequency masking.
3.3.1. Theory
In quantization-based watermarking, the watermark signal corresponds to the quanti-
zation error induced by the watermarking process, and the watermark power σ
2
W
in
(3.1) represents a MSE distortion criterion on the speech signal. If the watermark signal
(or quantization error) is white, the watermark is perceived as background noise and
the permissible level (the watermark power σ
2
W
) is application-dependent. However,
the watermark signal is inaudible if it is kept below the masking threshold. One can in-
crease the watermark power to σ
2
W,mask
, and thus the watermark capacity, by spectrally
shaping the watermark signal according to the signal-dependent masking threshold.
According to the masking model, the watermark signal is not audible as long as the
spectral envelope of the watermark signal is at or below the masking threshold. The
same approach is frequently applied in perceptual speech and audio coding: The
20
3.3. Watermarking Based on Auditory Masking
0 500 1000 1500 2000 2500 3000 3500 4000
0
10
20
30
40
50
60
70
Frequency (in Hz)
D
F
T

m
a
g
n
i
t
u
d
e

(
i
n

d
B
S
P
L
)


Signal spectrum
Masking curve
Figure 3.3.: Masking threshold for a signal consisting of three sinusoids, the second of
them being masked and not audible.
quantization step-size is chosen adaptively such that the resulting quantization noise
remains below the masking threshold [58, 59]. Using (3.1), the watermark capacity
results in
C
Mask
=
1
2
log
2
_
1 +
σ
2
W,mask
σ
2
N
_
bit/sample (3.2)
with typically σ
2
W,mask
¸ σ
2
W
.
3.3.2. Experimental Results
We estimate the permissible watermark power σ
2
W,mask
and, thus, the watermark
capacity, with a small experiment. Using ten randomly selected utterances (one per
speaker) of the ATCOSIM corpus (see Chapter 7), we calculate the masking thresholds
in frames of 30 ms (overlapping by 50%). In a frequency band from 100 Hz to 8000 Hz
we measure the average power σ
2
H
of the speech signal as well as the permissible power
σ
2
W,mask
of the watermark signal, with the spectral envelope of the watermark signal set
to a factor κ below the signal-dependent masking threshold. The masking thresholds
are calculated based on [54], using a listening-threshold-in-quiet as provided in [60],
and a listening level of 82.5 dB
SPL
, which corresponds to a comfortable speech signal
level for normal listeners [61].
While the used masking model describes the masking of sinusoids by sinusoidal
or white noise maskers, in this experiment the maskee is a wideband noise signal,
which would not be entirely masked. To compensate for this shortcoming, the maskee
signal is scaled to be a factor κ below the calculated masking threshold. The maximum
value of κ at which the maskee signal is still inaudible was determined in an informal
experiment. We generated a noise signal that is uncorrelated to the speech signal
21
Chapter 3. Watermark Capacity in Speech
(a) Time-Domain Speech Signal
0 1 2 3 4 5 6
−1
0
1
A
m
p
l
i
t
u
d
e
Time (in s)
(b) Masked-to-signal power ratio
σ
2
W,mask/σ
2
H
0 1 2 3 4 5 6
−16
−14
−12
−10
−8
M
a
s
k
e
d

t
o

s
i
g
n
a
l
p
o
w
e
r

r
a
t
i
o

(
i
n

d
B
)
Time (in s)
Figure 3.4.: Scaled masking threshold power to signal power ratio
σ
2
W,mask/σ
2
H
using
κ = −12 dB for the utterance “swissair eight zero six contact rhein radar
one two seven decimal three seven”.
and has a spectral envelope that corresponds to the masking threshold of the speech
signal. This can be done by either setting the DFT magnitude of the watermark signal
to the masking threshold and randomizing the DFT phase, or by generating in the
DFT domain a complex white Gaussian noise with unit variance and multiplying this
with the masking threshold. The resulting noise signal was added to the speech signal
with varying gains κ, and the maximum value of κ at which the noise is still masked
was found to be κ ≈ −12 dB, corresponding to one fourth in amplitude or one 16th in
terms of power.
The masked-to-signal power ratio
σ
2
W,mask/σ
2
H
for all frames of an utterance is shown in
Figure 3.4. Averaged over all frames and utterances, the permissible watermark power
is σ
2
W,mask
≈ κσ
2
H
. The estimated average watermark capacity using (3.2) then results in
C
Mask

1
2
log
2
_
1 + κ
σ
2
H
σ
2
N
_

1
2
log
2
_
1 +
1
16
σ
2
H
σ
2
N
_
. (3.3)
Compared to (3.1), C
Mask
does not depend on an MSE distortion criterion but is given
by the transmission channel’s signal-to-noise ratio (SNR)
σ
2
H
σ
2
N
.
22
3.4. Watermarking Based on Phase Modulation
3.4. Watermarking Based on Phase Modulation
We consider the case of watermarking by manipulating the phase spectrum of unvoiced
speech, while keeping the power spectrum unchanged. This means that instead of an
MSE distortion constraint we constrain ˆ s to have the same power spectrum as s, but
allow arbitrary modification of the phase.
The following subsections present four different ways to derive the watermark
capacity: First, two high SNR approximations are presented, one using the information
theoretic definition of capacity, and one using sphere packing. Then, two exact solutions
are derived, one based on the joint probability density function (PDF) between input
and output phase angle, and one based on the conditional PDF of the complex output
symbol given the complex input symbol.
3.4.1. Preliminaries
3.4.1.1. Definitions of Phase
The notion of a signal’s phase is frequently used in literature. However, there is no
standard definition, and the intended meaning of phase widely varies depending
on the underlying concepts. In general, the phase of a signal is a quantity that is
closely linked to a certain signal representation. Depending on the most suitable
representation, we deal in the remainder of this thesis with three different ‘phases’,
which are interrelated but nevertheless represent different quantities. The three phase
definitions have in common that they represent a signal property that can be modified
without modifying some sort of a spectral envelope of the signal.
AR Model Phase Primarily, we consider a particular realization of a white Gaus-
sian excitation signal of an autoregressive (AR) speech signal model as defined in
Section 4.1.1 as the phase of a signal. It was previously shown that for unvoiced
speech the ear is not able to distinguish between different realizations of the Gaussian
excitation process as long as the spectral and temporal envelope of the speech signal is
maintained [42]. This notion of phase is used in our practical watermarking scheme of
Chapter 4.
DFT Phase The DFT phase denotes the argument (or phase angle) of the complex
coefficient of discrete Fourier transform (DFT) domain signal model as defined in
Section 3.4.1.2. A mapping of an AR model phase to a DFT phase requires a short-term
Fourier transform (STFT) of the signal using a well-defined transform size, window
length and overlap. The following capacity derivations consider only a single DFT
frame of length L of the speech signal.
MLT and ELT Phase In Section 3.4.8 modulated and extended lapped transform
(MLT and ELT) signal representations are applied. In contrast to the DFT coefficients,
the lapped transform coefficients are real and no direct notion of phase exists. Inter-
preting lapped transforms as filterbanks, we consider the variances of the subband
23
Chapter 3. Watermark Capacity in Speech
signals within a short window as the quantities that determine the spectral envelope,
and consider the particular realizations of the subband signals as the phase of the
signal. Thus, modifying the MLT or ELT phase of a signal means replacing the orig-
inal subband signals by different subband signals with identical short-term power
envelopes.
3.4.1.2. DFT Domain Signal Model
For all capacity derivations of Section 3.4 (except Section 3.4.8) we consider the discrete
Fourier transform (DFT) of a single frame of length L of an unvoiced speech signal,
and modify the DFT phase while keeping the DFT magnitude constant. We use a
1

L
weighting for the DFT and its inverse, which makes the DFT a unitary transform. Each
frame is assumed to have L complex coefficients
˜
X, of which only
L
2
are independent.
1
The tilde mark (˜) denotes complex variables, e.g., ˜ x = re

= a + jb, whereas boldface
letters denote real valued vectors, with s, ˆ s and n representing a single frame of the
host signal, the watermarked signal and the channel noise, respectively. Considering
the real and imaginary parts of
˜
X as separate dimensions, the DFT representation has
2L real coefficients R, where the L odd-numbered dimensions represent the real part
R
rl
and and the L even-numbered dimensions represent the imaginary part R
im
of the
complex DFT coefficients
˜
X = R
rl
+ jR
im
. Assuming a white host signal with variance
σ
2
H
, each complex DFT coefficient
˜
X has variance σ
2
H
, and each real coefficient R has
variance
σ
2
H
2
.
3.4.1.3. Watermarking Model
In order to facilitate theoretical watermark capacity derivations, we model the AR
model phase replacement performed in [42] and Chapter 4 by replacing the phase of
a DFT domain signal representation. An input phase angle Φ, which represents the
watermark signal, is imposed onto a DFT coefficient of the host signal s. While its orig-
inal magnitude r
H
is maintained, the phase angle is replaced by the watermark phase
angle Φ, resulting in the watermarked DFT coefficient
˜
X = r
H
e

. The watermarked
signal is subject to AWGN n with DFT coefficient
˜
N, phase angle Ψ and variance σ
2
N
,
resulting in the noisy watermarked signal ˆ s +n with DFT coefficient
˜
Y =
˜
X +
˜
N and
output phase angle Θ.
3.4.1.4. PDFs of the Random Variables
In the following paragraphs, we derive the PDFs of the involved random variables
(RV).
1
For L even, the coefficients representing DC and Nyquist frequency are real-valued, and
L
2
+1 coeffi-
cients are independent. For odd L, the DC coefficient is real, no Nyquist frequency coefficient exists,
and
L+1
2
coefficients are independent. As a simplification we assume in the following
L
2
independent
complex coefficients, which for large L leads to a negligible difference in the capacity estimates.
24
3.4. Watermarking Based on Phase Modulation
Input Phase Φ (scalar RV) Given the DFT domain signal model, the input phase
angle is restricted to the interval [0, 2π[. To maximize its differential entropy h, which
maximizes the watermark channel capacity C, Φ is assumed to be uniformly distributed
with PDF
f
Φ
(ϕ) =
_
1

0 ≤ ϕ < 2π
0 else
and differential entropy
h(Φ) = log
2
(2π).
Noise Coefficient
˜
N (complex RV) Each complex DFT coefficient
˜
N
c
, the subscript
c denoting Cartesian coordinates, of a complex zero-mean WGN signal with variance
σ
2
N
is characterized by
˜
N ∼ (^(0, σ
2
N
),
f
˜
N
c
( ˜ x) =
1
πσ
2
N
exp
_

[ ˜ x[
2
σ
2
N
_
with [ ˜ x[
2
= a
2
+ b
2
= r
2
, and (3.4)
h(
˜
N) = log
2
(πeσ
2
N
).
But f
˜
N
( ˜ x) is really just a short-hand notation for the joint PDF f
N
a
,N
b
(a, b) of the
independent marginal densities
N
a
∼ ^(0,
σ
2
N
2
) =
1
_
πσ
2
N
exp
_

a
2
σ
2
N
_
N
b
∼ ^(0,
σ
2
N
2
) =
1
_
πσ
2
N
exp
_

b
2
σ
2
N
_
f
˜
N
c
( ˜ x) = f
N
a
,N
b
(a, b) = f
N
a
(a) f
N
b
(b)
and h(N
a
) = h(N
b
) =
1
2
log
2
(πeσ
2
N
). Note that h(
˜
N) = h(N
a
) + h(N
b
), since the
joint differential entropy is the sum of the differential entropies if the variables are
independent.
The density f
˜
N
c
( ˜ x) is not the density in polar coordinates, i.e. f
˜
N
p
(r, ϕ) ,= f
˜
N
c
( ˜ x),
because a transformation of variables is required. This results in [62, p. 201 ff.][63]
f
N
ϕ
(ϕ) =
_
1

0 ≤ ϕ < 2π
0 else
f
N
r
(r) =
r
σ
2
exp
_

r
2

2
_
=
2r
σ
2
N
exp
_

r
2
σ
2
N
_
(3.5)
25
Chapter 3. Watermark Capacity in Speech
where f
N
r
(r) is the Rayleigh distribution with parameter σ (in our case with σ
2
=
σ
2
N
2
)
and mean and variance
µ
N
r
= σ
_
π
2
=
σ
N
2

π
σ
2
N
r
= (2 −
π
2

2
= (1 −
π
4

2
N
.
In this special case, comparing (3.4) and (3.5) results in f
N
r
(r) = 2πr f
˜
N
c
( ˜ x). Also, N
r
and N
ϕ
are independent, and their joint distribution is [62, p. 258]
f
˜
N
p
( ˜ x) = f
N
r
,N
ϕ
(r, ϕ) = f
N
r
(r) f
N
ϕ
(ϕ) =
r
πσ
2
N
exp
_

r
2
σ
2
N
_
(3.6)
and f
˜
N
p
( ˜ x) = r f
˜
N
c
(r). This polar representation can be derived from (3.4) by consid-
ering the polar coordinates (r, ϕ) as a vector function (or transform) of the Cartesian
coordinates (a, b), with
r =
_
a
2
+ b
2
ϕ = arctan
b
a
and the inverse transformation
a = r cos ϕ
b = r sin ϕ.
The Jacobian determinant J of the transform is
J
N
=
∂(a, b)
∂(r, ϕ)
=
¸
¸
¸
¸
cos ϕ −r sin ϕ
sin ϕ r cos ϕ
¸
¸
¸
¸
= r
and the joint density is
f
N
r
,N
ϕ
(r, ϕ) = J
N
f
N
a
,N
b
(r cos ϕ, r sin ϕ) =
r
πσ
2
N
exp
_

r
2
σ
2
N
_
.
Signal Coefficient
˜
X (complex RV) We consider a single complex DFT coefficient
˜
X = r
H
e

with r
H
being a deterministic constant. In polar coordinates, the PDFs of the indepen-
dent RVs X
ϕ
and X
r
are
f
X
ϕ
(ϕ) =
_
1

0 ≤ ϕ < 2π
0 else
f
X
r
(r) = δ(r −r
H
)
f
X
r
,X
ϕ
(r, ϕ) =
_
1

δ(r −r
H
) 0 ≤ ϕ < 2π
0 else
26
3.4. Watermarking Based on Phase Modulation
and describe a two-dimensional circularly symmetric distribution [63].
With the transform
a = r cos ϕ
b = r sin ϕ
and the inverse transform
r =
_
a
2
+ b
2
ϕ = arctan
b
a
and its Jacobian
J
X
=
∂(r, ϕ)
∂(a, b)
=
¸
¸
¸
¸
¸
∂r
∂a
∂r
∂b
∂ϕ
∂a
∂ϕ
∂b
¸
¸
¸
¸
¸
=
1

a
2
+ b
2
the joint density in Cartesian coordinates is
f
X
a
,X
b
(a, b) = J
X
f
X
r
,X
ϕ
_
_
a
2
+ b
2
, tan
b
a
_
=
1


a
2
+ b
2
δ
_
_
a
2
+ b
2
−r
H
_
=
=
1
2πr
H
δ
_
_
a
2
+ b
2
−r
H
_
(3.7)
and the dependent marginal densities in Cartesian coordinates are
f
X
a
(a) =
_
_
_
1
πr
H
_
1 −(
a
r
H
)
2
_

1
2
for [a[ ≤ r
H
0 else
(3.8)
f
X
b
(b) =
_
_
_
1
πr
H
_
1 −(
b
r
H
)
2
_

1
2
for [b[ ≤ r
H
0 else.
The marginal densities are derived as follows. The arc length ∆c within an infinites-
imally small strip of width ∆a as shown in Figure 3.5 can be approximated by the
length of the tangent within the strip. Using
sin β =
A
r
H
=
f
∆c
=
_
c
2
−(∆a)
2
∆c
the arc length results in
∆c =
_
1 −
_
A
r
H
_
2
_

1
2
∆a.
The probability T that a point lies within the strip ∆a is the arc length ∆c times the
density
1
2πr
H
of the arc, times a factor of 2 to also account for the lower semicircle,
T =
1
πr
H
∆c,
27
Chapter 3. Watermark Capacity in Speech
A
Δa
Δa
f
Δc
r
H
β
β
a
b
Figure 3.5.: The joint density f
X
a
,X
b
(a, b) integrated over an interval ∆a.
and the probability density
T
∆a
gives in the limit ∆a → 0 the marginal PDF
f
X
a
(a) =
ˆ

b=−∞
f
X
a
,X
b
(a, b)db =
_
¸
_
¸
_
1
πr
H
_
1 −
_
a
r
H
_
2
_

1
2
for [a[ ≤ r
H
0 else.
The marginal density integrates to 1 since
ˆ

a=−∞
f
X
a
(a)da =
1
πr
H
ˆ
r
H
a=−r
H
_
1 −
_
a
r
H
_
2
_

1
2
da =
=
1
πr
H
_
r
H
arcsin
_
a
r
H
__
r
H
−r
H
= 1.
Noisy Signal Coefficient
˜
Y (complex RV) The noise-corrupted watermarked coeffi-
cient
˜
Y is a sum of two independent complex RV, with
˜
Y =
˜
X +
˜
N = r
Y
e

, and
f
˜
Y
( ˜ x) = f
˜
Y
(r) = f
˜
X
( ˜ x) ∗∗f
˜
N
( ˜ x)
where ∗∗ denotes 2D-convolution. In Cartesian coordinates the summation of the two
complex RVs can be carried out on a by component basis, with
Y
a
= X
a
+ N
a
Y
b
= X
b
+ N
b
.
28
3.4. Watermarking Based on Phase Modulation
Using (3.8), the density of the first sum is
f
Y
a
(a) = f
X
a
(a) ∗ f
N
a
(a) =
ˆ

m=−∞
f
X
a
(m) f
N
a
(a −m)dm =
=
ˆ

m=−∞
f
X
a
(a −m) f
N
a
(m)dm,
which evaluates to
f
Y
a
(a) =
1
πr
H
1
_
πσ
2
N
ˆ
m
_
1 −
_
m
r
H
_
2
_

1
2
exp
_

(m−a)
2
σ
2
N
_
dm.
No analytic solution to this integral is known. The expression for f
Y
b
(b) is equivalent.
Since X
a
and X
b
are dependent, Y
a
and Y
b
are also dependent, and the joint density
is not the product of the marginal densities. It appears difficult to obtain closed-form
expressions for the joint and conditional PDFs of
˜
X and
˜
Y. However, as shown in
Section 3.4.5 and Section 3.4.6, the conditional PDF f
˜
Y[
˜
X
( ˜ y[ ˜ x) can be easily stated given
the circular arrangement of
˜
X and
˜
Y.
Output Phase Θ (scalar RV) The output phase Θ of the observed watermarked
signal
˜
Y is
Θ = arg(
˜
Y) = arctan
Y
b
Y
a
= Y
ϕ
,
and is again uniformly distributed, because the input phase is uniformly distributed
and the channel noise is Gaussian. Thus,
f
Θ
(ϕ) =
_
1

0 ≤ ϕ < 2π
0 else
and
h(Θ) = log
2
(2π).
3.4.1.5. Approaches to Capacity Calculation
Given the DFT-based signal and watermarking model, two watermark capacities can
be calculated, one based on the channel input/output phase angles Φ and Θ, and
one based on the channel input/output complex symbols
˜
X and
˜
Y. Even though
information is only embedded in the phase angle Φ of
˜
X, the two capacities are
not identical, since in the later case the watermark detector can also respond to the
amplitude of the received signal.
The capacity is defined as the mutual information between the channel input and
the channel output, maximized over all possible input distributions, and
C =
1
2
sup
f
˜
X
( ˜ x)
I(
˜
X,
˜
Y) or C =
1
2
sup
f
Φ
(ϕ)
I(Φ, Θ). (3.9)
The additional factor
1
2
compared to the usual definition results from only half of the
complex DFT coefficients
˜
X being independent. We calculate the capacity based on:
29
Chapter 3. Watermark Capacity in Speech
1. the sphere packing analogy of Section 3.2 (used in Section 3.4.3);
2. the difference in differential entropy between
˜
Y and
˜
N (used in Section 3.4.2);
3. the formal definition of I based on the input’s PDF and the conditional PDF of
the output given the input (used in Section 3.4.4, 3.4.5 and 3.4.6).
3.4.1.6. Colored Host Signal Spectrum
While most of the following derivations assume a white host signal spectrum, the
extension to a colored host signal is straightforward and shown for the two high SNR
approximations of Section 3.4.2 and 3.4.3. The capacity can be calculated individually
for each DFT channel, and the overall capacity is the sum of the subchannel capacities.
We introduce band power weights p
i
, with
L
/2

i=1
p
i
=
L
2
(3.10)
, which effectively scale the SNR of the i’th subchannel from
σ
2
H
σ
2
N
to p
i
σ
2
H
σ
2
N
. Using a white
host signal, ∀i : p
i
= 1. The discrete band power weights p
i
could also be considered as
the samples of a scaled power spectral density p( f ) of the host signal. Substituting L
by f
s
and considering (3.10) as Riemann sum results equivalently in
ˆ
f
s
/2
f =0
p( f )df =
f
s
2
. (3.11)
For a white host signal, p( f ) ≡ 1 as in the discrete case.
3.4.2. High SNR Capacity Approximation Using Mutual Information
We first consider the complex DFT coefficient
˜
X of a white host signal with power σ
2
H
.
The Gaussian channel noise
˜
N with power σ
2
N
changes both the amplitude and the
phase of the received signal. The ‘displacement’ N of
˜
X in angular direction caused
by
˜
N corresponds to a phase change tan(ϕ) =
N
σ
H
. In the case of high SNR, the noise
component of
˜
N in angular direction is a scalar Gaussian random variable N with
power
σ
2
N
2
, and we can also assume ϕ ≈ tan(ϕ). The phase noise Ψ induced by the
channel is then also Gaussian, with
Ψ ∼ ^
_
0,
σ
2
N/2
σ
2
H
_
.
Given a transmitted phase Φ, the observed phase Θ is then Θ = Φ+Ψ, wrapped into
0 . . . 2π.
The channel capacity C
Phase, MI-angle
is the maximum mutual information I(Φ; Θ) over
all distributions of Φ. The mutual information is maximized when Φ (and consequently
30
3.4. Watermarking Based on Phase Modulation
also Θ since Ψ is Gaussian) is uniformly distributed between 0 and 2π as defined
above. The maximum mutual information is then given by
I(Φ; Θ) = h(Θ) −h(Θ[Φ)
= h(Θ) −h(Θ−Φ[Φ) = h(Θ) −h(Ψ[Φ) = h(Θ) −h(Ψ)
since Φ and Ψ are independent. The mutual information I(Φ, Θ) between the received
phase Θ and the transmitted phase Φ then evaluates to
I(Φ, Θ) = h(Θ) −h(Ψ) = log
2
(2π) −
1
2
log
2
(2πe
σ
2
N

2
H
)
=
1
2
log
2
_
σ
2
H
σ
2
N
_
+
1
2
log
2
_

e
_
,
and with only half of the complex coefficients
˜
X being independent we obtain
C
Phase, MI-angle, high-SNR
=
1
2
I(Φ, Θ) =
1
4
log
2
_
σ
2
H
σ
2
N
_
+
1
4
log
2
_

e
_
, (3.12)
with
1
4
log
2
_

e
_
≈ 0.55 bit. The subscript
Phase, MI-angle, high-SNR
denotes a high-SNR
approximation of the phase modulation watermark capacity derived using the mutual
information (MI) between the input and output phase angles. The same result is
obtained elsewhere for phase shift keying (PSK) in AWGN, but requiring a longer
proof [64].
For a non-white host signal with band power weights p
i
, the capacity results in
C
Phase, MI-angle, high-SNR, colored
=
1
2L
log
2
_
L/2

i=1
p
i
σ
2
H
σ
2
N
_
+
1
4
log
2
_

e
_
=
=
1
4
log
2
_
σ
2
H
σ
2
N
_
+
1
4
log
2
_

e
_
+
1
2L
L/2

i=1
log
2
(p
i
) .
The factor
1
2L
results from the factor
1
4
of (3.12) and a factor
1
L/2
given by the number of
product or summation terms. Using the scaled host signal PSD p( f ) of Section 3.4.1.6
instead of the discrete band power weights p
i
results in
C
Phase, MI-angle, high-SNR, colored
=
=
1
4
log
2
_
σ
2
H
σ
2
N
_
+
1
4
log
2
_

e
_
+
1
2f
s
ˆ
f
s
/2
f =0
log
2
(p( f )) df .
3.4.3. High SNR Capacity Approximation Using Sphere Packing
An estimate of the watermark capacity can also be obtained using the same line of
arguments as in Section 3.2. Instead of the watermarked signal ˆ s being constrained
to a hypersphere volume around s, it is now constrained to lie on an
L
2
-dimensional
31
Chapter 3. Watermark Capacity in Speech
manifold which contains all points in the L-dimensional space that have the same
power spectrum as s.
In the DFT domain, the power spectrum constraint corresponds to a fixed amplitude
of the complex DFT coefficients
˜
X. In terms of the real and imaginary coefficients R
rl
and R
im
this means that for every 2-dimensional subspace of corresponding (R
rl
; R
im
)
the watermarked signal must lie on a circle with radius r
H
=
_
p
i
σ
2
H
, with p
i
as defined
above (for a white signal p
i
≡ 1). The watermarked signal ˆ s must thus lie on the
L
2
-dimensional manifold that is defined by these circles. The surface of this manifold is
A
H
=
L/2

i=1
2πr
H
. (3.13)
The transmitted signal is subject to AWGN with variance σ
2
N
. We need to determine
the constellation of the observed signal and the number of distinguishable watermark
messages that fit onto the manifold.
2
Averaged over all dimensions and given the
circular symmetry of the distributions, the noise can be split up into two components
of equal power
σ
2
N
2
, one being orthogonal to the above manifold, and one being parallel
to the manifold. Separating the noise into a parallel and an orthogonal component
requires that the manifold is locally flat with respect to the size of the noise sphere,
which is equivalent to a high SNR assumption. The orthogonal component effectively
increases the radii of the circles from r
H
to r
Y
=
_
p
i
σ
2
H
+
σ
2
N
2
.
3
The noise component
parallel to the manifold (with power
σ
2
N
2
) determines the number of distinguishable
watermark messages. The ‘size’ of a watermark cell on the
L
2
dimensional manifold is
the volume V
N
of the
L
2
-dimensional hypersphere with radius r
N
=
_
L
2
σ
2
N
2
, and
V
N
=

L
2
σ
2
N
2
)
L/4
Γ(
L
4
+1)
. (3.14)
The capacity is
C
Phase, SP, high-SNR
=
1
L
log
2
_
A
Y
V
N
_
(3.15)
which evaluates for a white host signal (p
i
≡ 1) to
C
Phase, SP, high-SNR
=
1
4
log
2
_
1
2
+
σ
2
H
σ
2
N
_
+
1
4
log
2
(16π) +
1
L
log
2
_
Γ(
L
4
+1)
L
L/4
_
. (3.16)
The subscript
SP
denotes a watermark capacity derived using the sphere packing
approach. For large L and using Sterling’s approximation, Γ(
L
4
+1) can be expressed
2
The projection of the L-dimensional noise (with spherical normal PDF) onto the
L
2
-dimensional subspace
would result in the volume of the
L
2
-dimensional hypersphere with radius r
N
=
_
L
2
σ
2
N
. A projection
of the noise onto a
L
2
dimensional subspace (line, plane, hyperplane, . . . ) is something else than a
projection onto a curved manifold.
3
This also reflects the fact that the channel noise changes the power spectrum of the received signal.
32
3.4. Watermarking Based on Phase Modulation
as
Γ(
L
4
+1) ≈
_
πL
2
_
L
4e
_L
4
and (3.16) evaluates to
C
Phase, SP, high-SNR
=
1
4
log
2
_
1
2
+
σ
2
H
σ
2
N
_
+
1
4
log
2
_

e
_
+
1
2L
[log
2
(πL) +1] . (3.17)
Sterling’s approximation is asymptotically accurate, and numerical simulations show
that for L > 70 the error in C induced by the Sterling approximation is below 10
−4
.
Again for large L, the last term on the right-hand side in (3.17) converges to zero
and (3.16) simplifies to
C
Phase, SP, high-SNR
=
1
4
log
2
_
1
2
+
σ
2
H
σ
2
N
_
+
1
4
log
2
_

e
_
. (3.18)
The difference between (3.17) and (3.18) is smaller than 10
−2
bit for L > 480. Given
σ
2
H
σ
2
N
¸ 1, the augend
1
2
in (3.18) is negligible and the result is identical to (3.12).
For a non-white host signal, that means ∃i : p
i
,= 1, substituting (3.13) and (3.14) into
(3.15) results in
C
Phase, SP, high-SNR, colored
=
1
L
log
2
_
¸
¸
_

L/2
i=1

_
p
i
σ
2
H
+
σ
2
N
2
_
π
L
2
σ
2
N
2
_
L
4
Γ
_
L
4
+1
_
_
¸
¸
_
=
=
1
2L
log
2
_
L/2

i=1
1
2
+ p
i
σ
2
H
σ
2
N
_
+
1
4
log
2
(16π) +
1
L
log
2
_
Γ(
L
4
+1)
L
L/4
_
(3.19)
which for large L again using Sterling’s approximation simplifies to
C
Phase, SP, high-SNR, colored
=
1
2L
L/2

i=1
log
2
_
1
2
+ p
i
σ
2
H
σ
2
N
_
+
1
4
log
2
_

e
_
. (3.20)
Using the scaled host signal PSD p( f ) of Section 3.4.1.6 results in
C
Phase, SP, high-SNR, colored
=
=
1
2f
s
ˆ
f
s
/2
f =0
log
2
_
1
2
+ p( f )
σ
2
H
σ
2
N
_
df +
1
4
log
2
_

e
_
.
3.4.4. Exact Capacity Using Input/Output Phase Angles
Assuming a deterministic input signal
˜
X = r
H
e
j0
, the joint PDF of
˜
Y is [65, p. 413]
f
˜
Y
p
( ˜ x)[
˜
X=r
H
e
j0 = f
Y
r
,Y
ϕ
(r, ϕ)[
˜
X=r
H
e
j0 =
r
πσ
2
N
exp
_

r
2
+ r
2
H
−2rr
H
cos(ϕ)
σ
2
N
_
33
Chapter 3. Watermark Capacity in Speech
and the marginal PDF f
Y
ϕ
(ϕ)[
˜
X=r
H
e
j0 is [66]
f
Y
ϕ
(ϕ)[
˜
X=r
H
e
j0 =
ˆ

r=0
f
Y
r
,Y
ϕ
(r, ϕ)dr =
=
1

exp
_

r
2
H
σ
2
N
_
+
r
H
cos(ϕ)
_
4πσ
2
N
exp
_

r
2
H
σ
2
N
sin
2
(ϕ)
_
erfc
_

r
H
σ
N
cos(ϕ)
_
.
Provided the circular symmetry of the arrangement, changing the assumption of a
deterministic
˜
X = r
H
e
j0
to a deterministic
˜
X = r
H
e

/
corresponds to replacing (ϕ) by
(ϕ − ϕ
/
) in the above two equations and represents a simple rotation. We also note
that f
Y
ϕ
(which is the same as f
Θ
) given a deterministic input ϕ
/
is the conditional PDF
f
Θ[Φ
(ϕ[ϕ
/
) and write
f
Θ[Φ
(ϕ[ϕ
/
) =
1

exp
_

r
2
H
σ
2
N
_
+
+
r
H
cos(ϕ − ϕ
/
)
_
4πσ
2
N
exp
_

r
2
H
σ
2
N
sin
2
(ϕ − ϕ
/
)
_
erfc
_

r
H
σ
N
cos(ϕ − ϕ
/
)
_
.
With f
Θ[Φ
(ϕ[ϕ
/
) f
Φ

/
) = f
Φ,Θ
(ϕ, ϕ
/
) we also obtain the joint PDF
f
Φ,Θ
(ϕ, ϕ
/
) =
_
1

f
Θ[Φ
(ϕ[ϕ
/
) if 0 ≤ ϕ, ϕ
/
< 2π
0 else.
The mutual information I(Φ, Θ) can then be calculated using [64, p. 244]
I(Φ, Θ) =
ˆ

ϕ=0
ˆ

ϕ
/
=0
f
Φ,Θ

/
, ϕ) log
2
f
Φ,Θ

/
, ϕ)
f
Φ

/
) f
Θ
(ϕ)

/
dϕ.
Using (3.9) with only half of the DFT coefficients being independent, the capacity
evaluates to
C
Phase, MI-angle
=
1
2
I(Φ, Θ). (3.21)
The required integrations can be carried out numerically. A similar derivation leading
to the same result can be found in [66].
3.4.5. Exact Capacity Using Discrete Input/Output Symbols
Assume the input phase Φ consists of M discrete and uniformly spaced values. This
corresponds to M-ary phase shift keying (PSK) using a constellation with M symbols,
and the capacity is derived in [67, 68]. We briefly restate this derivation since it is the
basis for the capacity derivation in Section 3.4.6.
We assume M discrete complex channel inputs ˜ x
m
. The complex output sample ˜ y
has the conditional PDF
f
˜
Y[
˜
X
( ˜ y[ ˜ x
m
) =
1
2πσ
2
exp
_

[ ˜ y − ˜ x
m
[
2

2
_
34
3.4. Watermarking Based on Phase Modulation
with σ
2
=
σ
2
N
2
and ˜ x
m
= r
H
exp
_
j
2πm
M
_
. The capacity of the discrete-input continuous-
output memoryless channel is
C
Phase, MI-c.sym., discrete
=
1
2
sup
f
˜
X
( ˜ x)
M−1

m=0
f
˜
X
( ˜ x
m
)

¨
˜ y=−∞
f
˜
Y[
˜
X
( ˜ y[ ˜ x
m
)
log
2
_
f
˜
Y[
˜
X
( ˜ y[ ˜ x
m
)

M−1
n=0
f
˜
X
( ˜ x
n
) f
˜
Y[
˜
X
( ˜ y[ ˜ x
n
)
_
d˜ y,
again with only half of the DFT coefficients
˜
X being independent and where
˜

˜ y=−∞
. . . d˜ y
denotes an improper double integral over the two dimensions of the two-dimensional
variable ˜ y, i.e.,
˜

˜ y=−∞
. . . d˜ y =
´

y
a
=−∞
´

y
b
=−∞
. . . dy
b
dy
a
. The subscript
Phase, MI-c.sym., discrete
denotes a phase modulation watermark capacity derived using the mutual information
(MI) between a set of discrete complex input symbols and the complex output symbols.
Given the desired signal constellation, the capacity is maximum if the channel inputs
˜ x
m
are equiprobable with probability f
˜
X
( ˜ x
m
) =
1
M
and the supremum in the above term
can be omitted. We further substitute
˜
Y :=
˜
W +
˜
X
m
with
˜
W ∼ (^(0, σ
2
N
), resulting in
f
˜
Y[
˜
X
( ˜ y[ ˜ x
m
) = f
˜
W
( ˜ w) =
1
πσ
2
N
exp
_

[ ˜ w[
2
σ
2
N
_
and
C
Phase, MI-c.sym., discrete
=
1
2
log
2
(M) −
1
2M
M−1

m=0

¨
˜ w=−∞
log
2
_
M−1

n=0
exp
_

[ ˜ w + ˜ x
m
− ˜ x
n
[
2
−[ ˜ w[
2
σ
2
N
_
_
f
˜
W
( ˜ w) d ˜ w
where the integration can be replaced by the expected value with respect to
˜
W and
C
Phase, MI-c.sym., discrete
=
1
2
log
2
(M) −
1
2M
M−1

m=0
E
˜
W
_
log
2
M−1

n=0
exp
_

[ ˜ w + ˜ x
m
− ˜ x
n
[
2
−[ ˜ w[
2
σ
2
N
_
_
. (3.22)
The capacity can be calculated numerically using Monte Carlo simulations by gen-
erating a large number of realizations for
˜
W and calculating and averaging C. The
variables ˜ x
m
and ˜ x
n
are discrete and uniformly distributed, and one can sum over all
its M values.
3.4.6. Exact Capacity Using Input/Output Complex Symbols
We can generalize the capacity derivation of Section 3.4.5 from a discrete set of input
symbols to a continuous input distribution with
˜
X = r
H
e

as defined in Section 3.4.1
35
Chapter 3. Watermark Capacity in Speech
with a complex channel input ˜ x and
f
˜
X
( ˜ x
n
) = f
X
a
,X
b
(a, b) =
1
2πr
H
δ
_
_
a
2
+ b
2
−r
H
_
as derived in (3.7). The complex output sample ˜ y has the conditional PDF
f
˜
Y[
˜
X
( ˜ y[ ˜ x) =
1
2πσ
2
exp
_

[ ˜ y − ˜ x[
2

2
_
with σ
2
=
σ
2
N
2
and ˜ x = r
H
e

. The capacity of the continuous-input continuous-output
memoryless channel is [64, p. 252]
C
Phase, MI-c.sym.
=
1
2
sup
f
˜
X
( ˜ x)
I(
˜
X,
˜
Y) =
1
2
sup
f
˜
X
( ˜ x)

¨
˜ x
m
=−∞
f
˜
X
( ˜ x
m
)

¨
˜ y=−∞
f
˜
Y[
˜
X
( ˜ y[ ˜ x
m
) log
2
_
f
˜
Y[
˜
X
( ˜ y[ ˜ x
m
)
˜

˜ x
n
=−∞
f
˜
X
( ˜ x
n
) f
˜
Y[
˜
X
( ˜ y[ ˜ x
n
) d ˜ x
n
_
d˜ y d ˜ x
m
.
Given the desired signal constellation, the capacity is maximum if the channel inputs
˜ x are uniformly distributed on the circle with radius r
H
(with a density f
˜
X
( ˜ x) as
described above), and the supremum in the above term can be omitted. We further
substitute
˜
Y :=
˜
W +
˜
X
m
with
˜
W ∼ (^(0, σ
2
N
), resulting in
f
˜
Y[
˜
X
( ˜ y[ ˜ x
m
) = f
˜
W
( ˜ w) =
1
πσ
2
N
exp
_

[ ˜ w[
2
σ
2
N
_
and
C
Phase, MI-c.sym.
= −
1
2

¨
˜ x
m
=−∞
f
˜
X
( ˜ x
m
)

¨
˜ w=−∞
log
2
_
_

¨
˜ x
n
=−∞
f
˜
X
( ˜ x
n
) exp
_

[ ˜ w + ˜ x
m
− ˜ x
n
[
2
−[ ˜ w[
2
σ
2
N
_
d ˜ x
n
_
_
f
˜
W
( ˜ w) d ˜ wd ˜ x
m
where the integration over ˜ w can be replaced by the expected value with respect to
˜
W
and
C
Phase, MI-c.sym.
= −
1
2

¨
˜ x
m
=−∞
f
˜
X
( ˜ x
m
)
E
˜
W
_
_
_
log
2
_
_

¨
˜ x
n
=−∞
f
˜
X
( ˜ x
n
) exp
_

[ ˜ w + ˜ x
m
− ˜ x
n
[
2
−[ ˜ w[
2
σ
2
N
_
d ˜ x
n
_
_
_
_
_
d ˜ x
m
36
3.4. Watermarking Based on Phase Modulation
and the same for ˜ x
n
C
Phase, MI-c.sym.
= −
1
2

¨
˜ x
m
=−∞
f
˜
X
( ˜ x
m
)
E
˜
W
_
log
2
_
E
˜ x
n
_
exp
_

[ ˜ w + ˜ x
m
− ˜ x
n
[
2
−[ ˜ w[
2
σ
2
N
____
d ˜ x
m
and the same for ˜ x
m
C
Phase, MI-c.sym.
= −
1
2
E
˜ x
m
_
E
˜
W
_
log
2
_
E
˜ x
n
_
exp
_

[ ˜ w + ˜ x
m
− ˜ x
n
[
2
−[ ˜ w[
2
σ
2
N
_____
.
(3.23)
The capacity C can be calculated numerically using Monte Carlo simulations by
generating a large number of realizations for ˜ x
m
, ˜ w and ˜ x
n
, and calculating and
averaging C, or by numerical integration. To simplify the numerical integration, given
that f
˜
X
( ˜ x) is zero for all ˜ x in the complex plane except on a circle, we can replace the
two double integrals over ˜ x by single integrals along the circles, resulting in
C
Phase, MI-c.sym.
= −
1
4πr
H

ˆ
ϕ
m
=0

¨
˜ w=−∞
f
˜
W
( ˜ w)
log
2
_
¸
_

ˆ
ϕ
n
=0
1
2πr
H
exp
_

[ ˜ w + r
H
e

m
−r
H
e

n
[
2
−[ ˜ w[
2
σ
2
N
_

n
_
¸
_ d ˜ wdϕ
m
.
For numerical integration it is further possible to replace the integrals over ϕ by
summations, i.e.

ˆ
ϕ=0
f (ϕ) dϕ ≈
S−1

i=0
f (iϕ

) ϕ

with ϕ

=

S
and, e.g., S = 256.
3.4.7. Comparison of Derived Phase Modulation Capacities
The derived phase modulation capacities of Section 3.4.2 to Section 3.4.6 are shown
for different channel SNR
σ
2
H
σ
2
N
in Figure 3.6. The differences of the capacities relative to
C
Phase, MI-c.sym.
of (3.23) are shown in Figure 3.7. The reason for C
Phase, MI-c.sym.
being
slightly larger than C
Phase, MI-angle
of (3.21) is that in the former case the detector has
access to both the amplitude and the phase angle of the received symbol, whereas in the
later case only the phase angle is known. The figures also include the general capacity
of the power-constrained AWGN channel given by the Shannon–Hartley theorem [64]
C
Shannon
=
1
2
log
2
_
1 +
σ
2
H
σ
2
N
_
.
It is the upper bound on the information rate when ignoring any watermarking-induced
constraint.
37
Chapter 3. Watermark Capacity in Speech
−20 −10 0 10 20 30 40
0
0.5
1
1.5
2
2.5
3
3.5
4
Channel SNR (in dB)
W
a
t
e
r
m
a
r
k

c
a
p
a
c
i
t
y

C

(
i
n

b
i
t
s
/
s
a
m
p
l
e
)


C
Shannon
C
Phase, MI−angle, high−SNR
C
Phase, SP, high−SNR, L=200
C
Phase, SP, high−SNR, Sterling
C
Phase, MI−angle
C
Phase, MI−c.sym., discrete (M=8)
C
Phase, MI−c.sym.
Figure 3.6.: Comparison of the derived phase modulation watermark capacity expres-
sions at different channel SNR
σ
2
H
σ
2
N
(see also Figure 3.7).
−20 −10 0 10 20 30 40
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
Channel SNR (in dB)
W
a
t
e
r
m
a
r
k

c
a
p
a
c
i
t
y

C
(
d
i
f
f
e
r
e
n
c
e

t
o

C
P
h
a
s
e
,

M
I

c
.
s
y
m
.

i
n

b
i
t
s
/
s
a
m
p
l
e
)


C
Shannon
C
Phase, MI−angle, high−SNR
C
Phase, SP, high−SNR, L=200
C
Phase, SP, high−SNR, Sterling
C
Phase, MI−angle
C
Phase, MI−c.sym., discrete (M=8)
C
Phase, MI−c.sym.
Figure 3.7.: Watermark capacity difference between the derived expressions and
C
Phase, MI-c.sym.
.
38
3.4. Watermarking Based on Phase Modulation
3.4.8. Phase Watermarking in Voiced Speech
Given the large capacity of phase modulation based watermarking in unvoiced speech,
it would be attractive to extend this approach also to voiced speech. In voiced speech,
however, perceptually transparent phase manipulations are much more difficult to
achieve, since the signal’s phase is indeed perceivable [49, 50, 69, 70]. It is important to
maintain 1) the phase continuity of the harmonic components across the analysis/syn-
thesis frames, and 2) the spectral envelope of the signal, including the spectral peaks
of the harmonics and the deep valleys in-between them, independent of the position of
the analysis window.
We explored possibilities to manipulate or randomize the signal phase in speech
while maintaining the two aforementioned perceptual requirements for voiced speech.
It follows a brief outline of the performed experiments, which will show that even under
idealistic assumptions it is not possible to fully randomize the phase of voiced speech
without severely degrading perceptual quality or resorting to perceptual masking.
3.4.8.1. Phase Modulation with the Short-Time Fourier Transform (STFT)
In this experiment, a fully periodic synthetic speech signal with constant pitch of
146 Hz was transformed into DFT domain using a pitch-synchronous short-time Fourier
transform (STFT) with a ‘square root Hann’ analysis window with 50 % overlap and
a window length that is an even multiple of the pitch cycle length. Using the same
window as synthesis window, the sum of the inverse DFT transforms of the individual
frames (‘overlap/add synthesis’) results in perfect reconstruction of the original signal
[71]. To evaluate the suitability of the signal’s phase for watermarking, we randomized
the phase angle of the complex DFT/STFT coefficients and examined the perceptual
degradation as a function of the window length.
Using a short analysis/synthesis window with a window length of 137 ms or 20
pitch cycles, the phase randomization leads to significant perceptual distortion and
makes the speech signal sound like produced in a ‘rain barrel’. Figure 3.8 shows the
long-term high-frequency-resolution DFT spectrum (using a measurement window
length of 1 s) of the original signal and the phase-randomized signal created using the
137 ms window. It is apparent that the phase randomization leads to a widening of
the spectral peaks due to the limited frequency resolution of the analysis/synthesis
window. For example, a pure sinusoidal signal is typically represented by several DFT
bins due to the spectral shape of the analysis window. If the phase relation between
the different DFT bins is not maintained, the inverse transform is not a pure sinusoid
anymore but a mixture of several sinusoids with adjacent frequencies and arbitrary
phase relations.
The perceptual degradation decreases when using much longer analysis/synthesis
windows in the order of magnitude of 200 pitch cycles. Using a long window corre-
sponds to changing the phase less frequently in time, and also results in an increased
frequency resolution. However, the window length determines the time/frequency
resolution trade-off of the STFT. For real speech signals such long windows lead to
unacceptable smearing in the time domain, and also the pitch would not be constant
39
Chapter 3. Watermark Capacity in Speech
0 100 200 300 400 500 600 700 800 900 1000
0
10
20
30
40
50
60
70
Frequency (in Hz)
L
o
n
g

t
e
r
m

D
F
T

m
a
g
n
i
t
u
d
e

(
i
n

d
B
)


Original signal
Signal with randomized phase
Figure 3.8.: Long-term high-frequency-resolution DFT spectrum with a measurement
window length of 1000 ms of 1) a constant pitch periodic speech signal and
2) the same signal with randomized STFT-domain phase coefficients (using
an analysis/synthesis window length of 137 ms).
for such long periods.
The ‘rain barrel effect’ can be eliminated by modulating only those coefficients that
are perceptually less important. To verify this, we calculated for every frame of the
STFT representation (using a window length of 27 ms or approximately four pitch
cycles, and 50 % overlap) the frequency masking curve using the van-de-Par masking
model [54] (see also Section 3.3). Then, we randomized only the phase of those DFT
coefficients whose magnitude is a certain threshold below the masking curve. We used
as test signal a male speech sustained vowel ‘a’ at a sampling frequency of 16 kHz
and a listening level of 82.5 dB
SPL
. A threshold of 0 dB (randomizing all coefficients
below the masking curve) resulted in 83 % of the coefficients being randomized and
a certain roughness in the sound, which again results from a widening of spectral
peaks (measured over a longer window), namely those peaks where a part of the
initial peak is masked. With a threshold of 6 dB (randomizing all coefficients 6 dB
below the masking curve, shown in Figure 3.9) 64 % of the phase coefficients were
randomized, and the previous roughness in the sound was barely noticeable anymore.
However, if a DFT coefficient is masked anyhow, then there is no need to maintain
the magnitude of the coefficient and one can proceed with the general masking-based
approach discussed in Section 3.3.
Independent of window length and masking, with using a 50% overlap to avoid
discontinuities at the frame boundaries, the speech signal is oversampled by a factor
of two, and the STFT representation contains twice as many coefficients as the time
domain signal. As a consequence for watermarking, the frequency coefficients are not
40
3.4. Watermarking Based on Phase Modulation
0 1000 2000 3000 4000 5000 6000 7000 8000
0
10
20
30
40
50
60
70
80
Frequency (in Hz)
S
h
o
r
t

t
e
r
m

D
F
T

m
a
g
n
i
t
u
d
e

(
i
n

d
B
S
P
L
)


Signal spectrum
Masking curve
Masking curve − 6dB
Randomized coefficients
Figure 3.9.: Randomization of the phase of those coefficients DFT that are masked and
have a magnitude of less than 6 dB below the masking curve.
independent, and it is not possible to independently randomize all phase coefficients
and re-obtain the same coefficients after resynthesis and analysis. An approach to
solve this problem is the use of a critically sampled orthogonal transform instead of
the STFT. This is discussed in the following subsection.
3.4.8.2. Phase Modulation with Lapped Orthogonal Transforms
The use of a critically sampled orthogonal transform for analysis and synthesis allows
to obtain a perfect reconstruction of the embedded watermark data. In the following,
we discuss the use of extended and modulated lapped transforms (ELT and MLT,
[72, 73, 74, 75]) for watermarking by phase modulation.
Extended Lapped Transform (ELT) The ELT is a critically sampled orthogonal
lapped transform with an arbitrary window length L [73, 74]. Just as the more
popular MLT, the ELT is a perfect reconstruction cosine-modulated filterbank defined
by the same basis functions, but generalized from a fixed window length L = 2M
to a variable window length that is typically larger than 2M, where M denotes the
transform size or number of filter channels.
The idea for watermarking is to use an ELT with few coefficients (filter channels)
and long windows, and to modify again the signal’s phase in the frequency domain.
As discussed in Section 3.4.1.1, we denote with modifying the MLT or ELT phase
the replacement of the original transform domain subband signals by watermark
subband signals with identical short-term power envelopes, since this preserves the
spectral envelope of the host signal. The orthogonality of the ELT and MLT assures
41
Chapter 3. Watermark Capacity in Speech
that the embedded information is recoverable. A low number of coefficients (which
equals the update interval or the hop size) and long windows implicate that at any
given time there is a large number of overlapping windows. The hypothesis for this
experiment is that if each window has the desired magnitude spectrum, then also the
sum of the overlapping windows has the desired magnitude spectrum, no matter of
the position the spectrum being measured at. As such, it should be possible to recreate
the magnitude spectrum of voiced speech while continuously manipulating the phase
spectrum. We evaluate this approach with a small experiment.
We transformed a synthetic stationary constant pitch speech signal into the frequency
domain using an ELT as described in [74] with M filter channels and with a prototype
filter (window shape) with stopband attenuation A
s
, designed using the Kaiser window
approach of [76]. The window length L is a result of M and A
s
. Since there is no direct
notion of phase in the frequency domain of the ELT—it is a real-valued transform—we
measured the variance in each filter channel and replaced the original signal in each
frequency channel by Gaussian noise with equal variance. The modified signal was
then transformed back to time domain.
There are fundamentally contradicting requirements on the parameter M. On the
one hand, the number of filter channels determines the frequency resolution of the filter
bank, and in order to accurately maintain the sharp spectral peaks in voiced speech,
M must be large. For example, to obtain approximately ten subbands in-between
two harmonics of a speech signal with a pitch of 150 Hz, one must set M = 256 at
a sample rate of f
s
= 8 kHz (resulting in a subband width of
f
s
2M
≈ 15.6 Hz). With
A
s
= 50 dB this results in a window length of 768 ms (L = 6144). But even with a
frequency resolution as high as this, and even though there is an overlap of 24 different
windows at any given time, there is still a significant widening of the spectral peaks
(see Figure 3.10). This widening shows itself with the same ‘rain barrel effect’ as
when using the STFT. Additionally, such a long window length results in unacceptable
smearing in the time domain. Decreasing M, for example to M = 64 and a window
length of 112 ms (A
s
= 30 dB), makes the speech signal more similar to colored noise.
Just as in the STFT case, the ‘rain barrel effect’ at large M can be eliminated by not
randomizing the perceptually most important subchannels. This was confirmed with a
small experiment that kept 10 % of the M = 256 coefficients unmodified, namely those
with the largest power.
Modulated Lapped Transform (MLT) Reducing the window length L of the ELT to
twice the transform size M, L = 2M, and using a sine window results in the modified
lapped transform (MLT) [74, 75]. Applying the same phase randomization as with the
ELT, the results are very similar. For identical M, the randomized MLT sounds slightly
more noisy than the randomized ELT. For large M, the perceptual results become
almost identical for equal effective window lengths (considering at the ELT window
only its time-domain ‘main lobe’).
At a sampling frequency of 8 kHz, an MLT with M = 64 has a window length of
16 ms, which is a commonly used window length for speech signal coding. Using
this window length and again randomizing the coefficients results in a very noise-like
42
3.4. Watermarking Based on Phase Modulation
0 100 200 300 400 500 600 700
0
10
20
30
40
50
60
70
80
Frequency (in Hz)
L
o
n
g

t
e
r
m

D
F
T

m
a
g
n
i
t
u
d
e

(
i
n

d
B
)


Original signal
Signal with randomized ELT subbands
Figure 3.10.: Constant pitch speech signal with original and randomized ELT coeffi-
cients using 256 filter channels and a 768 ms-window. The randomization
leads to a widening of the spectral peaks. (The plotted DFT spectrum is
measured using a 1000 ms-window.)
speech signal. This effect can again be mitigated by randomizing only the perceptually
less important coefficients. However, even when keeping 50 % of the coefficients
unmodified (again those with the largest power), a perceptual distortion is still audible.
Because of the short window length, the distortion is audible as noise. It is expected
that the use of a masking model instead of the power criterion to select the relevant
coefficients would decrease the audibility of the distortion.
3.4.8.3. Conclusion
We conclude that in voiced speech, in contrast to unvoiced speech, it is not possible to
fully randomize the signal’s phase without 1) severely degrading perceptual quality
or 2) resorting to perceptual masking. Time-frequency transforms allow direct access
to the signal’s phase, but in order to avoid discontinuities at frame boundaries, block
transforms can only be used with overlapping windows. While the STFT has the most
intuitive notion of phase, it is an oversampled representation whose phase coefficients
are not independent and, thus, difficult to embed and re-extract. This problem can be
solved with ELTs and MLTs, but they are subject to the same fundamental trade-off be-
tween time and frequency resolution as the STFT. Perceptual artifacts can be eliminated
by restricting the phase modulations to masked signal components. However, this is
not a useful approach for watermarking, since masked components can be replaced
altogether instead of only their phase being modulated.
Note that phase modifications are possible if performed at a much lower rate than
43
Chapter 3. Watermark Capacity in Speech
presented herein, for example by randomizing only the initial phase of each voiced
segment. Several such methods were previously proposed but are limited in the
achievable data rate [14, 47, 48, 45, 46].
3.5. Experimental Comparison
In this section, we compare the derived watermark capacities of the ideal Costa scheme
(ICS), watermarking based on frequency masking, and watermarking based on phase
modulation.
3.5.1. Experimental Settings
The experimental settings for each method were chosen as follows.
Ideal Costa Scheme The capacity in (3.1) was evaluated using two different scenar-
ios, 1) assuming a fixed mean squared error distortion,
C
ICS, MSE=-25dB
:
σ
2
W
σ
2
H
¸
¸
¸
¸
dB
= −25 dB,
and 2) assuming a fixed watermark to channel noise ratio (WCNR),
C
ICS, WCNR=-3dB
:
σ
2
W
σ
2
N
¸
¸
¸
¸
dB
= −3 dB.
While the first case corresponds to a fixed perceptual distortion of the clean water-
marked signal (before channel), the second case corresponds to a fixed perceptual
distortion of the noise-corrupted watermarked signal (after channel, assuming that the
watermark signal is masked by the channel noise).
Frequency Masking For watermarking based on frequency masking, the watermark
capacity C
Mask
of (3.3) is used, with the watermark power σ
2
W,mask
being κ = −12 dB
below the masking threshold of the masking model.
Phase Modulation In principle, the capacity of watermarking by phase modulation
is given by (3.23) and can be expressed as a function C
Phase

2
H
, σ
2
N
), with σ
2
H
being
equivalent to r
2
H
. However, two additional factors need to be taken into consideration
when dealing with real-world speech signals: First, as we showed in Section 3.4.8, the
phase modulation is possible in unvoiced speech only, and second, the energy in a
speech signal is unevenly distributed between voiced and unvoiced regions. For the
experiment, it was assumed that a fraction γ of the speech signal is unvoiced speech,
the average power (or variance) of which is a multiplicative factor ρ above or below
the average speech signal power σ
2
H
. The watermark capacity then evaluates to
C
Phase, speech
= γ C
Phase
(ρσ
2
H
, σ
2
N
). (3.24)
44
3.5. Experimental Comparison
−20 −10 0 10 20 30 40
0
0.5
1
1.5
2
2.5
3
Channel SNR (in dB)
W
a
t
e
r
m
a
r
k

c
a
p
a
c
i
t
y

C

(
i
n

b
i
t
s
/
s
a
m
p
l
e
)


C
Shannon
C
ICS, MSE=−25dB
C
ICS, WCNR=−3dB
C
Mask
C
Phase, ATCOSIM
C
Phase, TIMIT
Figure 3.11.: Watermark capacity for quantization, masking and phase modulation
based speech watermarking. The abbreviations are explained in Sec-
tion 3.5.1.
We determined the parameters γ and ρ with a small experiment, again using ten
randomly selected utterances (one per speaker) of the ATCOSIM corpus (see Chapter 7).
After stripping silent regions (defined by a 43ms-Kaiser-window-filtered intensity curve
being more than 30 dB below its maximum for longer than 100ms), each utterance
was individually normalized to unit variance, i.e., σ
2
H
= 1. Then, a voiced/unvoiced
segmentation was performed using PRAAT [77], and the total length and average
power measured separately for all voiced and all unvoiced segments. The result is a
fraction of γ = 47 % being unvoiced, and the average power in unvoiced being −4.6 dB
below the overall signal power (ρ = 0.35). As a consequence, the unvoiced segments
contain 16 % of the overall signal energy. Performing the same analysis on the 192
sentences of the TIMIT Core Test Set [78] results in γ = 38 %, ρ = 0.16 (−8 dB), and
only 6 % of the total energy located in unvoiced segments. This difference can be
attributed to the slower speaking style in TIMIT compared to the ATCOSIM corpus.
The resulting watermark capacities based on the two databases and using (3.24) are
denoted C
Phase,ATCOSIM
and C
Phase,TIMIT
.
3.5.2. Results
The watermark capacities C
ICS
, C
Mask
and C
Phase
for quantization, masking, and phase
modulation based speech watermarking are shown in Figure 3.11. As in Figure 3.6, the
plot includes for comparison the Shannon capacity C
Shannon
of the AWGN channel.
45
Chapter 3. Watermark Capacity in Speech
3.6. Conclusions
Watermarking in perceptually relevant speech signal components, as it is required
in certain applications that need robustness against lossy coding, is best performed
using variants of the ideal Costa scheme, and its capacity C
ICS
is shown in Figure 3.11.
In contrast, in the application of interest in this work—speech watermarking for
analog legacy system enhancement—watermarking in perceptually irrelevant speech
signal components is possible. Then, perceptual masking and the auditory system’s
insensitivity to phase in unvoiced speech can be exploited to increase the watermark
capacity (see C
Mask
and C
Phase
in Figure 3.11).
Using auditory masking leads to a significant capacity gain compared to the ideal
Costa scheme. For high channel SNRs the masking-based capacity is larger than all
other watermark capacities in consideration.
The watermark capacity in the phase of unvoiced speech can be derived using
the sphere packing analogy or using the related concept of mutual information. In
unvoiced speech and for high channel SNRs, the capacity is roughly half of the Shannon
capacity plus half a bit per independent sample. The overall watermark capacity is
signal-dependent. For fast-paced air traffic control speech at low and medium channel
SNR, which is the application of interest in this work, the phase modulation approach
outperforms the masking-based approach in terms of watermark capacity.
The remainder of this thesis focuses on the phase modulation approach. Some
form of auditory masking is used in many, if not even most, state-of-the art audio
watermarking algorithms, and is a well-explored topic [13]. In contrast, a complete
randomization of the speech signal’s phase has, to our best knowledge, not been
previously proposed in the context of watermarking. It offers an opportunity for
a novel contribution to the area of speech watermarking and opens a window to
implementations that could possibly outperform current state-of-the-art methods.
Note that the phase modulation approach and the conventional masking-based
approach do not contradict each other in any significant way. In fact, it is expected
that a combination of the two approaches will lead to a further increase in watermark
capacity.
46
Chapter 4
Watermarking Non-Voiced Speech
We present a blind speech watermarking algorithm that embeds the water-
mark data in the phase of non-voiced speech by replacing the excitation
signal of an autoregressive speech signal representation. The watermark
signal is embedded in a frequency subband, which facilitates robustness
against bandpass filtering channels. We derive several sets of pulse shapes
that prevent intersymbol interference and that allow the creation of the
passband watermark signal by simple filtering. A marker-based synchro-
nization scheme robustly detects the location of the embedded watermark
data without the occurrence of insertions or deletions.
In light of the potential application to analog aeronautical voice radio com-
munication, we present experimental results for embedding a watermark
in narrowband speech at a bit rate of 450 bit/s. The adaptive equalization-
based watermark detector not only compensates for the vocal tract filtering,
but also recovers the watermark data in the presence of non-linear phase
and bandpass filtering, amplitude modulation and additive noise, making
the watermarking scheme highly robust.
Parts of this chapter have been published in K. Hofbauer, G. Kubin, and W. B. Kleijn, “Speech
watermarking for analog flat-fading bandpass channels,” IEEE Transactions on Audio, Speech, and
Language Processing, 2009, revised and resubmitted.
47
Chapter 4. Watermarking Non-Voiced Speech
As shown in Chapter 3, the combination of established watermarking theory with
a well-known principle of speech perception leads to a substantial improvement in
theoretical capacity. Based on this finding, the following chapter develops a practical
watermarking scheme that is aimed at analog legacy system enhancement. Although
the method is of general value, many of our design choices as well as the choice of
attacks that we consider are motivated by the application of watermarking to the
aeronautical voice radio communication between aircraft pilots and air traffic control
(ATC) operators as described in detail in Chapter 6.
In the remainder of this chapter, Section 4.1 proposes the watermarking scheme, also
addressing many practical issues such as synchronization and channel equalization.
After a discussion of certain implementation aspects in Section 4.2, experimental results
are presented in Section 4.3. Finally, we discuss our results and draw conclusions in
Section 4.4 and 4.5.
4.1. Theory
In Chapter 3 we have shown the large theoretical capacity of watermarking by replacing
the phase of the host signal. Motivated by this finding, this section presents a practical
speech watermarking scheme that is based on this principle. The new method also
addresses the two major difficulties with the theoretical approach: Firstly, the human
ear is only partially insensitive to phase modifications in speech, and secondly, the
transmission channel depends on the application and is in most cases not only a simple
AWGN channel. We assume in the following all signals to be discrete-time signals with
a sample rate f
s
.
4.1.1. Speech Signal Model
Our method is based on an autoregressive (AR) speech signal model. This is a common-
place speech signal representation, which models the resonances of the vocal tract
and is widely used in speech coding, speech synthesis, and speech recognition (cf.
[58, 60, 79]). Consequently, we consider certain temporal sections of a speech signal
s(n) as an outcome of an order P autoregressive signal model with time-varying
predictor coefficients c(n) = [c
1
(n), . . . , c
P
(n)]
T
and with a white Gaussian excitation
e(n) with time-variant gain g(n), such that
s(n) =
P

m=1
c
m
(n)s(n −m) + g(n)e(n).
4.1.2. Transmission Channel Model
We focus on analog legacy transmission channels such as the aeronautical radio or the
PSTN telephony channel. As an approximation, we assume the channel to be an analog
AWGN filtering channel with a limited passband width. The transmission channel
attacks are listed in the left column of Table 4.1.
48
4.1. Theory
Watermark
Signal Gen.
H
c H
c
-1
Gain
Extraction
LP
Analysis
Non-Voiced
Detection
s
FB
S
NV
S
NV
r
w
g
m
WM
s
PB
s
SB
ˆs
F
s
PB
c
ˆs s
FB
Figure 4.1.: Watermark embedding: Non-voiced speech (or a subband thereof) is ex-
changed by a watermark data signal.
4.1.3. Watermark Embedding Concept
The presented speech signal model facilitates the embedding of a data signal by
replacing the AR model phase of the host speech signal (see Section 3.4.1.1). An
important property of the signal model is that there is no perceptual difference between
different realizations of the Gaussian excitation signal e(n) for non-voiced speech [42].
It is possible to exchange the white Gaussian excitation signal e(n) by a white Gaussian
data signal ˆ e(n) that carries the watermark information. The signal thus forms a hidden
data channel within the speech signal. We denote with ‘non-voiced speech’ all speech
that is not voiced, comprising unvoiced speech and pauses.
The hidden data channel is restricted both in the temporal and in the spectral domain.
In the temporal domain, the replacement of e(n) by ˆ e(n) can take place only when the
speech signal is not voiced, since the model as defined in Section 4.1.1 is accurate only
for non-voiced speech. The speech parts that are voiced and should remain unmodified
are detected based on their acoustic periodicity using an autocorrelation method [80].
In the spectral domain, only a subband bandpass component e
PB
(n) of e(n) can be
replaced by the data signal ˆ e(n). This accommodates the bandpass characteristic of the
transmission channel, and the embedding band width and position must be selected
such that they fully lie within the passband of the transmission channel. The stopband
component e
SB
(n) = e(n) −e
PB
(n) must be kept unchanged, since in many practical
applications the exact channel bandwidth is not known a priori. This is realized by
adding an unmodified secondary path for the stopband component s
SB
(n).
Figure 4.1 shows an overview of the embedding scheme. First, the voiced and the
non-voiced time segments of the fullband input speech signal s
FB
(n) are identified and
indicated by a status signal S
NV
. Then, the predictor coefficients c(n) are calculated
and regularly updated using linear prediction (LP) analysis. The signal s
FB
(n) is
decomposed into two subbands (without downsampling), and a LP error filter (H
−1
c
)
49
Chapter 4. Watermarking Non-Voiced Speech
Table 4.1.: Transmission Channel and Hidden Data Channel Model
Transmission Channel Attacks
DA/AD conversion (resampling)
Desynchronization
AWGN
Bandpass filtering
Magnitude and phase distortion
System-Inherent Attacks
Random channel availability S
NV
Time-variant gain g
Time-variant all-pole filtering H
c
Additive stopband signal s
SB
with predictor coefficients c(n) is applied on the passband component s
PB
(n), resulting
in the passband error signal r(n) = g(n)e(n). The LP analysis is based on s
FB
(n), since
otherwise the LP error filter would try to compensate the bandpass characteristic of
s
PB
(n). In voiced speech segments, the passband signal is resynthesized using the
original error signal r(n) and the LP synthesis filter H
c
. In contrast, the non-voiced
speech segments (including unvoiced speech and inactive speech) are resynthesized
from a spectrally shaped watermark data signal w(n), on which the original gain g(n)
of the error signal and the LP synthesis filter H
c
is applied. The unmodified stopband
component s
SB
(n) is added to the watermarked passband signal ˆ s
PB
(n), resulting in
the watermarked speech ˆ s
FB
(n).
4.1.4. Hidden Data Channel Model
The watermark detector is ultimately interested in the payload watermark m
WM
.
However, the watermark channel carrying the hidden data signal is subject to a number
of channel attacks, which are listed in Table 4.1. They are either inherent to the
embedding procedure of Section 4.1.3, or induced by the transmission channel. While
the system-inherent attacks are deterministic and known to the watermark embedder,
they are randomly varying quantities and unknown to the watermark detector.
The following sections further refine the embedding concept of Section 4.1.3 to
counteract the channel attacks defined in Table 4.1. In principle, the payload watermark
m
WM
must be transformed into a watermark signal w(n) using a modulation scheme
that is either robust against the channel attack or enables the detector to estimate and
revert the attack.
4.1.5. Watermark Signal Generation
In the following, we describe the composition of the real-valued watermark signal
w(n). It is subject to conditions that result from the perceptual transparency, data
transmission rate and robustness requirements.
50
4.1. Theory
L
D
Inactive Frame Active Frame
Non-Voiced Voiced Voiced
Tr Tr Tr WM WM Ac ID
a
b,p
a
b-1,p
a
b+1,p
N M
m
Sh
m
Sh
m
Sh
W
M
Figure 4.2.: Frame and block structure including watermark (WM), training (Tr), active
frame (Ac) and frame identification (ID) regions, which carry the data
symbols a
b,p
interleaved with the spectral shaping symbols m
Sh
.
4.1.5.1. Bit Stream Generation
The encoded payload data is interleaved with additional auxiliary symbols in order
to facilitate the subsequent processing steps. As will be discussed in Section 5.1.3,
we use a fixed frame grid with frames of equal length L
D
. Each frame that lies
entirely within a non-voiced segment of the input signal is considered as active frame
and consists of L
WM
watermark symbols m
WM
representing encoded payload data,
L
Tr
training symbols m
Tr
for signal equalization in the receiver (Section 4.1.7), L
Ac
preamble symbols m
Ac
for marking active frames (Chapter 5), L
ID
preamble symbols
m
ID
for sequentially numbering active frames with a consecutive counter (Chapter 5),
and L
Sh
spectral shaping symbols m
Sh
for pulse shaping and fulfilling the bandpass
constraint (Section 4.1.5.2). The symbols are real-valued, and interleaved with each
other as shown in Figure 4.2. While m
WM
, m
Tr
, m
Ac
and m
ID
could in principle be
multi-level and multi-dimensional symbols, trading data rate against robustness, we
use in the implementation of Section 4.2 single-level bipolar one-dimensional symbols,
and denote with A the sequence of these interleaved data symbols and with a
b,p
the
individual terms of A as defined in the next section. The spectral shaping symbols m
Sh
are multi-level real scalars and not part of the data sequence A.
When using the signal model presented in Section 4.1.1, the set of pseudo-random
data, training and preamble symbols must consist of independent symbols, and have
an amplitude probability distribution that is zero mean, unit variance, and Gaussian.
However, we previously showed that violating the requirement of Gaussianity does
not severely degrade the perceptual quality [3].
4.1.5.2. Pulse Shaping
The aim of pulse shaping is to transform the data sequence A into a sequence of
samples w(n) that
51
Chapter 4. Watermarking Non-Voiced Speech
1. has the same predefined and uniform sampling frequency f
s
as the processed
speech signal,
2. has no intersymbol interference (ISI) between the data symbols A,
3. has a defined boxcar (bandpass) power spectral density with lower and upper
cut-off frequencies f
L
and f
H
that lie within the passband of the transmission
channel and with f
H

f
s
2
, and
4. carries as many data symbols as possible per time interval (i.e., the symbol
embedding rate f
a
is maximum).
With a simple one-to-one mapping from A to w(n) the signal w(n) does not have the
desired spectral shape and the data is not reliably detectable since the Nyquist rate,
which is twice the available transmission channel bandwidth, is exceeded. To solve
this problem, we interleave the signal w(n) with additional samples m
Sh
to reduce
the rate f
a
at which the data symbols A are transmitted to below the Nyquist rate.
Additionally, the values of the samples m
Sh
are set such that the resulting signal w(n)
has the desired spectral shape.
Finding the correct quantities, distributions and values for the spectral shaping
samples m
Sh
given specified cut-off and sampling frequencies is a non-trivial task.
However, transforming a data sequence into a band-limited signal is in many aspects
the inverse problem to representing a continuous-time band-limited signal by a se-
quence of independent and possibly non-uniformly spaced samples, and the theories of
non-uniform sampling and sampling and interpolation of band-limited signals can be
applied to this problem (cf. [81, 82, 83]). We consider the desired watermark signal w(t)
as the band-limited signal, the given data symbol sequence A with an average symbol
rate f
a
as the non-uniformly spaced samples of this signal, and the required pulse
shape y(t) as the interpolation function to reconstruct the band-limited signal from its
samples. The wanted symbols m
Sh
are obtained by sampling the reconstructed signal
w(t) at defined time instants. The above four requirements on w(n) then translate to
1. constraining the sampling instants onto a fixed grid with spacing 1/f
s
or a subset
thereof,
2. requiring the samples of the signal to be algebraically independent, i.e., any set
of samples A defines a unique signal w(t) with the desired properties,
3. requiring the signal to be band-limited to f
L
and f
H
, and
4. requiring the symbol rate f
a
to be close to the Nyquist rate of 2( f
H
− f
L
).
It is known from theory that these requirements can be fulfilled only for certain
combinations of f
s
, f
L
and f
H
(e.g., [83]). The permission of non-uniform sampling
loosens the constraints on the permissible frequency constellations.
A common approach for a practical implementation is to choose the parameters
such that in a block of N samples on a sampling grid with uniform spacing T
s
= 1/f
s
52
4.1. Theory
the first M samples are considered as independent samples and fully describe the
band-limited signal (e.g., [84]). The remaining N − M samples are determined by the
bandwidth-constraint and can be reconstructed with suitable interpolation functions.
Analogously, the first M samples are the information-carrying data samples, and
the remaining N − M samples are the spectral shaping samples, determined by the
interpolation function.
To formalize the problem, let A be the sequence of interleaved watermark, training
and preamble symbols as defined in Section 4.1.5.1 and a denote the terms of this
sequence. A is divided into non-overlapping and consecutive blocks of M terms, and
a
b,p
denotes the p’th term (p = 1 . . . M) in the b’th block. The terms a
b,p
of A are
considered as the samples of the band-limited signal w(t), which is non-uniformly
sampled at the time instants τ
b,p
= t
p
+ bNT
s
with t
p
= (1 − p)T
s
and T
s
= 1/f
s
. The
desired signal w(t) can then be constructed with [85]
w(t) =


b=−∞
M

p=1
a
b,p
y
b,p
(t), (4.1)
where y
b,p
(t) is the reconstruction pulse shape that is time-shifted to the b’th block and
used for the p’th sample in each block, and with subsequent resampling of w(t) at the
time instants τ = nT
s
.
To have all previously stated requirements fulfilled, N, M, f
s
, f
L
, f
H
and y
p
must be
chosen accordingly. We provide suitable parameter sets and pulse shapes for different
applications scenarios in Section 4.2.2.
4.1.6. Synchronization
We deal with the different aspects of synchronization between the watermark embedder
and the watermark detector extensively in Chapter 5. In summary, sampling timing
and bit synchronization as well as signal synthesis and analysis synchronization are
inherently taken care of by the embedded training sequences and the adaptive equalizer
in the watermark detector. Data frame synchronization is achieved by using a fixed
frame grid and the embedding of preambles.
4.1.7. Watermark Detection
Figure 4.3 gives an overview on the components of the watermark detector, which
follows a classical front-end–equalization–detection design [86].
4.1.7.1. Detector Front End
In the first step the received analog watermarked speech signal is sampled. In the
presence of an RLS adaptive equalizer as proposed in the next subsection, the sampling
operation does not need to be synchronized to the watermark embedder. The channel
delay is estimated and the signal is realigned with the frame grid. These different
aspects of synchronization are discussed in Chapter 5. In the remainder of this chapter
we assume for the received signal path that perfect synchronization has been achieved.
53
Chapter 4. Watermarking Non-Voiced Speech
Sampling
Act. Frame
Synchroniz.
Frame Grid
Synchroniz.
Adaptive
Filtering
Watermark
Detection
Timing+Bit
Synchroniz.
e' s' m'
WM
s'
PB
s'
FB
z
-∆
Figure 4.3.: Watermark detector including synchronization, equalization and water-
mark detection.
From the discrete-time speech signal s
/
FB
(n) the out-of-band components s
/
SB
(n) are
again removed by an identical bandpass filter as in the embedder, resulting in the
passband signal s
/
PB
(n).
4.1.7.2. Channel Equalization
The received signal is adaptively filtered to compensate for the all-pole filtering in
the embedder as well as for the unknown transmission channel filtering. We make
use of the training symbols m
Tr
embedded in the watermark signal w(n) in order to
apply a recursive least-squares (RLS) equalization scheme. The RLS method is chosen
over other equalization techniques such as LMS because of its rate of convergence and
tracking behavior, and the availability of efficient recursive implementations that do
not require matrix inversion [87].
We use an exponentially weighted and non-uniformly updated RLS adaptive filter
in an equalization configuration, which tracks the joint time-variation of the vocal tract
filter and the transmission channel [87]: Let K be the number of taps and λ (0 < λ ≤ 1)
be the exponential weighting factor or forgetting factor of the adaptive filter and α be a
fixed constant that depends on the expected signal-to-noise ratio in the input signal.
Also let u(n) = [s
/
PB
(n + K −1), . . . , s
/
PB
(n)]
T
be the K-dimensional tap input vector
and w(n) the K-dimensional tap-weight vector at time n, with w(0) = 0. We initialize
the inverse correlation matrix with P(0) = δ
−1
I, using the regularization parameter
δ = σ
2
s
(1 −λ)
α
with σ
2
s
being the variance of the input signal s
/
PB
(n). For each time
instant n = 0, 1, 2, . . . , ∞ we iteratively calculate
q(n) = P(n)u(n)
k(n) =
q(n)
λ +u
H
(n)q(n)
˜
P(n) = λ
−1
(P(n) −k(n)q
H
(n))
P(n +1) =
1
2
(
˜
P(n) +
˜
P
H
(n))
where

denotes complex conjugation and
H
denotes Hermitian transposition. We
introduced the last equation to assure that P(n) remains Hermitian for numerical
stability. The same effect can be achieved computationally more efficient by calculating
only the upper or lower triangular part of
˜
P(n) and filling in the rest such that
54
4.2. Implementation
Hermitian symmetry is preserved. If no training symbol is available, the tap-weights
are not updated, and w(n +1) = w(n). If a training symbol is available for the current
time instant, the error signal e
R
and the tap weights are updated with
e
R
(n) = m
Tr
(n) −w
H
(n)u(n)
w(n +1) = w(n) +ke

R
(n).
The output e
/
(n) of the RLS adaptive filter is
e
/
(n) = w
H
(n)u(n).
After performing the active frame detection (see Section 5.1.3) an equalizer retraining
is performed. The active frame markers m
Ac
can be used as additional training symbols,
which improves the detection of the subsequent frame ID symbols m
ID
(see Figure 4.2).
4.1.7.3. Detection
An equalization scheme as presented in the previous subsection avoids the necessity of
an expensive signal detection scheme, effectively shifting complexity from the detection
stage to the equalization stage [86]. Consequently, the embedded watermark data is
detected after synchronization and equalization using a simple minimum Euclidean
distance metric.
4.2. Implementation
This section presents details of our signal analysis and pulse shaping implementation,
which are required to reproduce the experimental results presented in Section 4.3. In
this section a number of equations are acausal to simplify notation. For implementation,
the relationships can be made causal by the addition of an input signal buffer and
appropriate delays.
4.2.1. Signal Analysis
We first address the decomposition of the input signal into the model components. The
perfect-reconstruction subband decomposition of the discrete-time speech signal s
FB
(n)
with sample rate f
s
into the embedding-band and out-of-band speech components
s
PB
(n) and s
SB
(n) is obtained with a Hamming window design based linear phase
finite impulse response (FIR) bandpass filter of order N
BP
(even) with filter coefficients
h = [h
0
, . . . , h
N
BP
]
T
, resulting in
s
PB
(n) =
N
BP

m=0
h
m
s
FB
(n −m +
N
BP
2
)
and s
SB
(n) = s
FB
(n) −s
PB
(n). The filter bandwidth W
PB
= f
H
− f
L
must be chosen
in accordance to the transmission channel and the realizable spectral shaping config-
uration as described in Section 4.1.5.2. Complete parameter sets for three different
transmission channel types are provided in Section 4.2.2.
55
Chapter 4. Watermarking Non-Voiced Speech
The linear prediction coefficient vector c(n) is obtained from solving the Yule-Walker
equations using the Levinson-Durbin algorithm, with
c(n) = R
−1
ss
(n)r
ss
(n), (4.2)
where R
ss
(n) is the autocorrelation matrix and r
ss
(n) is the autocorrelation vector of
the past input samples s
FB
(n −m) using the autocorrelation method and an analysis
window length of 20 ms, which is common practice in speech processing [79]. We
update c(n) every 2 ms in order to obtain a smooth time-variation of the adaptive filter.
If computational complexity is of concern, the update interval can be increased and
intermediate values can be obtained using line spectral frequency (LSF) interpolation
[58].
The LP error signal r(n) is given by the LP error filter H
−1
c
with
r(n) = s
PB
(n) −
P

m=1
c
m
(n)s
PB
(n −m) = g(n)e(n), (4.3)
where we define the gain factor g(n) as the root mean square (RMS) value of r(n)
within a window of length L
g
samples, and
g(n) =
_
1
L
g
L
g
−1

m=0
r
2
_
n −m +
L
g
−1
2
_
_
1
2
.
We use a short window duration of 2 ms (corresponding to L
g
samples), which main-
tains the perceptually important short-term temporal waveform envelope.
To determine the segmentation of the speech signal into voiced and non-voiced
components, we use the PRAAT implementation of an autocorrelation-based pitch
tracking algorithm, which detects the pitch pulses in the speech signal with a local
cross-correlation value maximization [77].
4.2.2. Watermark Signal Spectral Shaping
The pulse shaping requirements described in Section 4.1.5.2 and the constraints on N,
M, f
L
, f
H
and y
p
given a certain transmission channel passband and a certain sampling
frequency f
s
are theoretically demanding and difficult to meet in practice. Even though
the topic is well explored in literature (e.g., [81, 82, 83]), there is at present no automatic
procedure to obtain ‘good’ parameter sets which fulfill all stated requirements such
as being bandwidth-efficient and free of inter-symbol interference. Thus, we now
provide suitable parameter sets and interpolation filters for three application scenarios
of practical relevance, namely a lowpass channel, a narrowband telephony channel,
and an aeronautical radio channel scenario. We do so for a sampling frequency
f
s
= 8000 Hz.
The derived reconstruction pulse shapes y
b,p
(t) are continuous-time and of infinite
length. To implement the pulse shapes as digital filters, y
b,p
(t) is sampled at the time
instants t = nT
s
, n ∈ Z, and truncated or windowed.
56
4.2. Implementation
4.2.2.1. Lowpass Channel
For a lowpass channel with bandwidth W
C
and choosing M, N and f
s
according to
W
C
=
M
N
f
s
2
, one can obtain the desired signal w(t) that meets all requirements with the
interpolation function
y
b,p
(t) =
(−1)
bM
πf
s
N
(t −t
p
−bNT
s
)


M
q=1
sin
_
πf
s
N
(t −t
q
)
_

M
q=1,q,=p
sin
_
πf
s
N
(t
p
−t
q
)
_,
which is a general formula for bandwidth-limited lowpass signals and achieves the
Nyquist rate [85]. In contrast, using for example a simple sinc function for y
b,p
(t)
would result in either ISI or the sample rate not being equal to f
s
.
4.2.2.2. Narrowband Telephony Channel
For a narrowband telephony channel with a passband from 300 Hz to 3400 Hz we select
M = 4 and N = 6 and define the reconstruction pulse shape
y
b,p
(t) = y
p
(t −bNT
s
) = y
p
(t
(b)
) =
sinc
_

f
s
12
_
t
(b)
−(p −1)T
s
_
_
sin
_

f
s
12
_
t
(b)

_
p −1 +2
_
p
2
__
T
s
_
_
cos
_

f
s
4
_
t
(b)
−(p −3)T
s
_
_
,
t
(b)
:= t −bNT
s
, sinc(x) =
sin(x)
x
,
(4.4)
to obtain a watermark signal w with a bandwidth from 666 Hz to 3333 Hz. Figure 4.4
shows this by illustration. This passband width W
C
= 2666 Hz is as small as the
Nyquist bandwidth that is required for the chosen symbol rate
M
N
f
s
, which is only
achievable for distinct parameter combinations. Figure 4.5 shows the M pulse shapes
and demonstrates how each pulse creates ISI only in the N − M non-information
carrying samples. This is achieved by carefully selecting the frequencies and phases
of the terms in (4.4) such that at each information-carrying sampling position a zero-
crossing of one of the terms occurs.
Ayanoglu et al. [84] define similar parameter and signal sets for channels band-
limited to 0 Hz–3500 Hz (M = 7, N = 8) and 500 Hz–3500 Hz (M = 6, N = 8).
4.2.2.3. Aeronautical Radio Channel
Given the particular passband width and position of the aeronautical radio channel
(300 Hz to 2500 Hz), using the aforementioned approach did not lead to a configuration
that would achieve the Nyquist rate.
57
Chapter 4. Watermarking Non-Voiced Speech
cos cos sin sin
1 2 3 -1 -2 -3 0 -4
(a) |X(f)|
f in kHz
sinc
1 2 3 -1 -2 -3 0 -4
(b) |X(f)|
f in kHz
sinc·sin sinc·sin
1 2 3 -1 -2 -3 0 -4
(c) |X(f)|
f in kHz
sinc·sin·cos sinc·sin·cos
Figure 4.4.: Generation of a passband watermark signal for a telephony channel using
(4.4). Frequency domain representation (a) of the individual sinc, sin, and
cos terms, (b) of the product of the first two terms, and (c) of the product
of all three terms in (4.4), resulting in the desired bandwidth and position.
A
m
p
l
i
t
u
d
e
−6 −5 −4 −3 0 1 2 3 6 7 8 9 12 13 14 15
−1
0
1


a
0,p
w(t)
w(n)


y
0,1
y
0,2
y
0,3
y
0,4
Time t (in multiples of T
s
) and sample index n
Figure 4.5.: Pulse shapes for zero ISI between the data-carrying samples a
b,p
. The
example data a
0,p=1...4
= 1, −
1
2
, 1, −1 is located at t
p
= (p − 1)T
s
. By
design, the ISI induced by the band-limitation falls into the spectral shaping
samples at t = b 6T
s
+4T
s
and t = b 6T
s
+5T
s
, with b ∈ Z.
58
4.3. Experiments
However, an efficient reconstruction pulse set with a watermark bandwidth from
500 Hz to 2500 Hz can be created using Kohlenberg’s second-order sampling [88]. We
omit the exact description of the method and only show its practical application: Using
the notation of [88], we define W
0
= 500 Hz and W = 2000 Hz, which results in the
parameter r = 1 and fulfills
2W
0
W
< r <
2W
0
W
+1 [81]. We select a = a
1
= a
2
=
1
W
, which
represents every fourth sample in terms of the sample rate f
s
= 8000 Hz = 4W. The
phase shift for the second-order sampling is chosen to be k =
1
4W
, which corresponds
to a shift of one sample in terms of f
s
. This results in M = 2 consecutive information
carrying samples (indexed by p = 1, 2) in each block of N = 4 samples, and the symbol
rate achieves the Nyquist rate. The desired passband signal w is then given similar to
(4.1) and (4.4) by
w(t) =


b=−∞
_
a
b,1
s(t
(b)
) + a
b,2
s(k −t
(b)
)
_
(4.5)
with t
(b)
= t −bNT
s
and
s(t) =
cos [2π(W
0
+W)t −(r +1)πWk] −cos [2π(rW−W
0
)t −(r +1)πWk]
2πWt sin [(r +1)πWk]
+
cos [2π(rW−W
0
)t −rπWk] −cos [2πW
0
t −rπWk]
2πWt sin [rπWk]
as defined in [88, Eq. 31, with k ≡ K].
4.3. Experiments
This section presents experimental results demonstrating the feasibility, capacity, and
robustness of the proposed method.
4.3.1. Experimental Settings
Motivated by legacy telephony and aeronautical voice radio applications, the sys-
tem was evaluated using a narrowband speech configuration and various simulated
transmission channels.
4.3.1.1. Watermark Embedding
As input signal we used ten randomly selected speech utterances from the ATCOSIM
corpus [5]. They were chosen by a script such that there was one utterance per speaker
(six male and four female) and such that each utterance had a length between 5 and 7
seconds, resulting in a total of 57 s of speech. The signal was resampled to f
s
= 8000 Hz,
analyzed with an LP order P = 10, and the watermark embedded in a frequency band
from 666 Hz to 3333 Hz using M = 4 and N = 6 (Section 4.2.2.2). In each active frame
of length L
D
= 180 samples, L
WM
= 34 symbols were allocated for watermark data,
L
Tr
= 72 for training symbols, L
Ac
= 7 for active frames markers, L
ID
= 7 for frame
59
Chapter 4. Watermarking Non-Voiced Speech
indices, and L
Sh
= 60 symbols for spectral shaping. For all but the spectral shaping
symbols we used unit length binary symbols with alphabet ¦−1; 1¦.
There is a large number of possibilities for the above parameter choices, which form a
trade-off between data rate and robustness. The parameters were picked manually such
that the corresponding subsystems (equalization, frame synchronization, detection,
. . . ) showed satisfactory performance given the channel model in Table 4.1. The
experimental settings are as such not carefully optimized and serve as illustrative
example, only.
4.3.1.2. Transmission Channels
The watermarked speech signal ˆ s
FB
(n) = ˆ s
PB
(n) + s
SB
(n) was subjected to various
channel attacks. Motivated by the application to telephony and aeronautical voice
radio communication, we simulated the following transmission channels:
1. Ideal channel (no signal alteration).
2. Filtering with a linear phase FIR digital bandpass filter of order N = 200 and a
passband from 300 Hz to 3400 Hz.
3. Sinusoidal amplitude modulation (flat fading) with a modulation frequency of
f
AM
= 3 Hz and a modulation index (depth) of h
AM
= 0.5.
4. Additive white Gaussian noise (AWGN) with a constant segmental SNR of 30 dB,
using a window length of 20 ms.
5. Filtering with an FIR linear phase Intermediate Reference System (IRS) trans-
mission weighting filter of order N = 150 as specified in ITU-T P.48 [89] and
implemented in ITU-T STL G.191 [90].
6. Filtering with a measured aeronautical voice radio channel response from the
TUG-EEC-Channels database, an FIR filter with non-linear and non-minimum
phase [4].
7. Filtering with an IIR allpass filter with non-linear and non-minimum phase and
z-transform
H(z) =
1 −2z
−1
1 −0.5z
−1
.
8. A combination of AWGN, sinusoidal amplitude modulation and aeronautical voice
radio channel filtering, each as described above.
The magnitude responses of the bandpass, IRS and aeronautical channel filters are
shown in Figure 4.6 with respect to the watermark embedding band. While for the
aeronautical channel the applied embedding bandwidth is too large and should instead
be chosen according to the specified channel bandwidth as in Section 4.2.2.3, the shown
configuration is a worst-case scenario and tests if the watermark detector can handle the
high attenuation within the embedding band. The simulated transmission channels are
60
4.3. Experiments
0 500 1000 1500 2000 2500 3000 3500 4000
−25
−20
−15
−10
−5
0
5
Frequency (in Hz)
M
a
g
n
i
t
u
d
e

r
e
s
p
o
n
s
e

(
i
n

d
B
)


Embedding band
Bandpass filter
IRS filter
Aeron. channel
Figure 4.6.: Magnitude responses of the bandpass, IRS and aeronautical transmission
channel filters, in comparison with the watermark embedding band.
in the digital domain. Robustness against desynchronization and D/A-A/D conversion
is evaluated independently in Chapter 5.
We set the SNR of the AWGN channel to a level that results in an acceptable overall
BER, and the SNR is as such related to the embedding parameters of Section 4.3.1.1.
It is likely that a real-world aeronautical channel has at times a worse SNR, and
different parameter settings, for example a higher data symbol dimensionality, might
be preferable in the aeronautical application.
4.3.1.3. Watermark Detection
We used an adaptive RLS equalizer filter with K = 11 taps, λ = 0.7 and α = 0.5. The
number of filter taps is a trade-off between the channel’s memory and the adaptive
filter’s tracking ability in light of the fast time-variation of the vocal tract filter and
the radio channel. In the noise-less and time-invariant case, an (RLS) FIR filter with
K = 11 taps can constitute the perfect inverse of a LP all-pole filter of order P = 10. In
practice, the LP filter is time-variant and the observation distorted and noise-corrupted
by the radio transmission channel, which results in imperfect inversion. In contrast to
the description in Section 4.1.7.2, in the experiments presented herein we updated the
inverse correlation matrix P(n) only when a training symbol was available.
4.3.2. Overall System Evaluation
4.3.2.1. Robustness
We evaluated the overall watermarking system robustness (including frame synchro-
nization) in the presence of the channel attacks listed in the previous subsection and
using the input signal defined therein. We measured the robustness in terms of raw bit
61
Chapter 4. Watermarking Non-Voiced Speech
3
,
5
%
3
,
5
%
3
,
6
%
4
,
8
%
4
,
9
%
5
,
2
%
7
,
7
%
6
,
6
%
[540] [537] [499] [494] [486] [420] [449] [540]
0,00
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0,08
0,09
I
d
e
a
l
c
h
a
n
n
e
l
B
a
n
d
p
a
s
s
F
l
a
t

f
a
d
i
n
g
3
H
z
A
W
G
N
3
0
d
B
s
e
g
I
R
S

f
i
l
t
e
r
A
e
r
o
n
a
u
t
.
c
h
a
n
n
e
l
A
l
l
p
a
s
s
f
i
l
t
e
r
A
e
r
o
+
N
+
F
F
B
i
t

e
r
r
o
r

r
a
t
i
o

(
B
E
R
)
Figure 4.7.: Overall system robustness in the presence of various transmission channel
attacks at an average uncoded bit rate of 690 bit/s. The numbers in the top
row denote the corresponding BSC capacity C
BSC
in bit/s.
error ratio (BER) in the detected watermark data, without any forward error correction
coding being applied. The results are shown in Figure 4.7.
4.3.2.2. Data Rate
Out of the 57 s of speech a total of 31 s or γ = 53 % were classified as non-voiced,
resulting with the bit allocation of Section 4.3.1.1 in a measured average payload
data rate of R ≈ 690 bit/s in terms of uncoded binary symbols m
WM
. The rate R is
approximately
f
s
γL
WM
L
D
times a factor that compensates for the unused partial frames in
non-voiced segments as shown in Figure 4.2.
Given the transmission rate as well as the BERs shown in Figure 4.7, it is possible to
express the achievable watermark capacity at which error-free transmission is possible
by considering the payload data channel as a memoryless binary symmetric channel
(BSC). The channel capacity C
BSC
as a function of the given rate R and BER p
e
is
(e.g. [91])
C
BSC
(R, p
e
) = R(1 + p
e
log
2
(p
e
) + (1 − p
e
) log
2
(1 − p
e
)) (4.6)
in bit/s and asymptotically achievable with appropriate channel coding.
4.3.2.3. Listening Quality
The listening quality of the watermarked speech signal ˆ s
FB
(n) was evaluated objectively
using ITU-T P.862 (PESQ) [92], resulting in a mean opinion score MOS-LQO of 4.05 (on
a scale from 1 to 4.5). Audio files with the original and the watermarked speech signals
62
4.4. Discussion
s
FB
(n) and ˆ s
FB
(n) are available online [93]. Informal listening tests showed that the
watermark is hardly perceptible in the presence of an ideal transmission channel, and
not perceptible in the presence of the simulated radio channel with AWGN, amplitude
modulation and channel filtering as defined above.
4.3.3. Evaluation of Alternative Detection Schemes
To gain insight into the proposed RLS channel equalization, we compared its perfor-
mance to two alternative adaptive filtering strategies by evaluating their robustness in
line with Section 4.3.2.1.
4.3.3.1. Detection Using LP Analysis
We previously proposed to determine the adaptive filter coefficients in the detector
using LP analysis, like in the embedding [3]. To achieve sufficiently similar coefficients
in the embedder and the detector, in the presence of a bandpass channel the LP analysis
in the embedder must then be based on s
PB
(n) instead of s
FB
(n).
1
In comparison to
Figure 4.7 the BER approximately triples for the ideal, bandpass, noise and flat fading
channels, and for the IRS and the non-linear phase channels the BER is 50 % and
watermark detection fails altogether.
4.3.3.2. Detection Using RLS Equalization
Using the proposed RLS equalization scheme, the watermark data can be detected for
all channels. The BERs are shown in Figure 4.7.
4.3.3.3. Detection Using Embedder’s Predictor Coefficients
Using the embedder’s predictor coefficients c(n) for adaptive filtering in the detector
is a purely hypothetical experiment, since in most applications the coefficients of the
embedder are not available in the detector. However, it serves as an upper bound
on how well the RLS adaptive filter would theoretically do when neglecting the
transmission channel and perfectly inverting the time-variant all-pole filtering H
c
. The
obtained BERs are lower by approximately one magnitude compared to the RLS results
in Figure 4.7, but only so if the transmission channel is linear phase. In the case of the
non-linear phase channels the BER is 50 %, and watermark detection fails.
4.4. Discussion
This section presents a qualitative summary and interpretation of the experimental
findings.
1
Given that s
/
PB
(n) is a passband signal, it might seem beneficial to use selective linear prediction (SLP)
[94] to estimate the coefficients. SLP is, however, only a spectral estimation technique that does not
yield the actual filter coefficients required in this application.
63
Chapter 4. Watermarking Non-Voiced Speech
4.4.1. Comparison with Other Schemes
First and foremost, no other schemes are known at present that are complete (in the
sense of most practical issues such as synchronization being addressed), are robust with
respect to the transmission channel model of Section 4.1.2, and have an embedding rate
that is comparable to the one shown in Section 4.3.2.2. It is the combination of these
three aspects that differentiates our proposal from current state-of-the-art methods and
our own previous work.
For a numerical comparison of our results with the reported performance of other
methods we use as measure
C
BSC/W, based on the number of embedded watermark bits
R (in terms of C
BSC
using the reported BER and (4.6)) and per kHz of embedding or
channel bandwidth W. Table 4.2 indicates that our method outperforms most of the
current state-of-the-art speech watermarking methods. It is, however, difficult to obtain
an objective ranking, because almost all methods are geared towards different channel
attacks and applications, and in general a large number of factors contribute to the
‘goodness’ of each system. While the method presented in [32] appears to perform
similar to our method, it assumes a time-invariant transmission channel and requires
before every transmission a dedicated non-hidden 4 s long equalizer training signal.
This makes the method impractical for the considered aeronautical application.
Table 4.2 also includes our previous work and shows that there are multiple options
for adapting our approach to different channel conditions. We have previously applied
various measures, such as the use of a different data symbol dimensionality, the
addition of a watermark floor (that is, a fixed minimum watermark signal gain g), or a
pre-processing of the host signal, to adjust the trade-off between perceptual fidelity,
capacity and robustness [2].
4.4.2. Comparison with Channel Capacity
In the following we compare the achieved embedding rate with the theoretical capacity
derived in Chapter 3. In our practical scheme, we embed 690 bit/s at 4.79 % BER, which
corresponds to a BSC capacity of 500 bit/s. With a maximum symbol transmission
rate or Nyquist rate 2W
C
= 5333 symbols/s and C
W, Phase
≈ 3 bit/symbol (using (2.2)
with an SNR of 30 dB as applied in Section 4.3.2.1), the theoretical capacity evaluates to
C
W
= 16000 bit/s, or C
W
= 8500 bit/s considering embedding in non-voiced speech,
only.
In non-voiced speech (comprising unvoiced speech and pauses), we achieve the
Nyquist rate 2W
C
with all methods presented in Section 4.2.2. However, we embed
only one bit per symbol instead of three, because our current method does not account
for the time-variation and the spectral non-uniformity of the SNR that result from the
spectral characteristics of the host signal. Table 4.3 summarizes these and other factors
that contribute to the capacity gap. The given rate losses are multiplicative factors
derived from the experimental settings of Section 4.3.1.1 and the observed system
performance. The table shows where there is room for improvement, but also shows
that (2.2) is an over-estimation since it does not account for the perceptually required
temporally constrained embedding, the time-variant and colored spectrum of the host
64
4.4. Discussion
Table 4.2.: Comparison With Reported Performance of Other Schemes
Method Ref. R BER C
BSC
W
C
BSC/W SNR Simultan.
(bit/s) (bit/s) (kHz) (bit/(s kHz)) ch. attacks
L
P
o
r
B
P
N
o
n
l
i
n
.
p
h
s
.
F
a
d
i
n
g
c
h
.
D
A
-
A
D
c
n
v
.
Proposed Method
Phase Watermarking 4.3. 690 4,8% 499 2,7 187 30 dB
sg
X X
Phase Watermarking 4.3. 690 6,6% 449 2,7 168 30 dB
sg
X X X
Ph. WM (vector sym.) [2] 300 9,0% 169 4,0 42 10 dB X
Ph. WM (vector sym.) [2] 300 2,0% 258 4,0 64 15 dB X
Ph. WM (scalar sym.) [2] 2000 11,0% 1000 4,0 250 20 dB X
Ph. WM (scalar sym.) [2] 2000 2,5% 1663 4,0 416 30 dB X
Ph. WM (vector sym.) [3] 130 0,0% 130 4,0 32 ∞
Ph. WM (scalar sym.) [3] 2000 0,8% 1866 4,0 466 ∞
Alternative Speech Methods
Spread spectrum [28] 24 0,0% 24 2,8 9 20 dB X X
Spread spectrum [27] 800 28,4% 111 4,0 28 ∞ X X
QIM of AR coeff. [29] 4 3,0% 3 6,0 1 ∞
QIM of AR coeff. [37] 24 n/a 24 8,0 3 n/a X
QIM of AR residual [33] 300 n/a 300 3,1 97 n/a
QIM of pitch [31] 3 1,5% 3 4,0 1 n/a
QIM of DHT coeff. [32] 600 0,0% 600 3,0 200 35 dB
QIM of DHT coeff. [32] 600 0,1% 600 3,0 200 V.56
bis
X X
Mod. of partial traj. [35] 200 1,0% 184 10,0 18 ∞
Repl. of maskees [39] 348 0,1% 344 4,0 86 25 dB
Alternative Audio Methods
QIM of DCT of DWT [26] 420 0,0% 418 2,0 209 ∞ X
QIM of DCT of DWT [26] 420 0,0% 418 22,5 19 10 dB
Allpass phase mod. [45] 243 10,0% 129 3,0 43 ∞ X
Allpass phase mod. [45] 243 2,0% 209 7,0 30 5 dB
65
Chapter 4. Watermarking Non-Voiced Speech
Table 4.3.: Causes for Capacity Gap and Attributed Rate Losses
Cause Loss in Rate
No embedding in voiced speech 47 %
Embedding of training symbols 60 %
Embedding of frame headers 29 %
Use of fixed frame position grid 14 %
Embedding in non-white host signal 67 %
Imperfect tracking of time-variation 28 %
signal, and the filtering characteristic of the transmission channel.
4.4.3. Equalization and Detection Performance
Three different methods to obtain the filter coefficients for the adaptive filtering in the
detector were evaluated. All methods are invariant against the bandpass filtering and
the flat fading transmission channel. Additive WGN at a moderate SNR does have
an influence on the bit error ratio but does not disrupt the over-all functioning of the
system.
The hypothetical experiment of using the embedder coefficients in the detector shows
that approximating the embedder coefficients is a desirable goal only if the channel
is linear phase. Using linear prediction in the detector results in a bit error ratio of
approximately 10 % even in the case of an ideal channel. Compared to our previous
results [2] this is a degradation, which is caused by the band-limited embedding
introduced herein.
The proposed RLS adaptive filtering based detection is the only method robust
against all tested transmission channels. It can compensate for a non-flat passband and
for non-linear and non-minimum phase filtering due to the embedded training symbols.
Over a channel that combines AWGN, flat fading, and a measured aeronautical
radio channel response, the method transmits the watermark with a data rate of
approximately 690 bit/s and a BER of 6.5 %, which corresponds with (4.6) to a BSC
capacity of 450 bit/s. The same RLS method is applied in [32] for a time-invariant
channel using a 4 s long initial training phase with a clearly audible training signal.
In contrast, we embed the training symbols as inaudible watermarks and perform
on-the-fly equalizer training and continuous tracking of the time-variant channel.
A further optimization seems possible by developing an equalization scheme that
is not solely based on the embedded training symbols, but that also incorporates
available knowledge about the source signal and the channel model, such as the
spectral characteristics.
66
4.5. Conclusions
4.5. Conclusions
The experimental results show that it is possible to embed a watermark in the phase
of non-voiced speech. Compared to quantization-based speech watermarking, the
principal advantage of our approach is that the theoretical capacity does not depend
on an embedding distortion to channel noise ratio, but depends only on the host signal
to channel noise ratio. We presented a proof-of-concept implementation that considers
many practical transmission channel attacks. While not claiming any optimality, it
embeds 690 bit/s with a BER of 6.6 % in a speech signal that is transmitted over a
time-variant non-linear phase bandpass channel with a segmental SNR of 30 dB and a
bandwidth of approximately 2.7 kHz.
The large gap between the presented implementation and the theoretical capacity
shows that there is a number of areas where the performance can be further increased.
For example, a decision feedback scheme for the RLS equalization, or a joint synchro-
nization, equalization, detection and decoding scheme could increase the reliability of
the watermark detector. On a larger scale, one could account for the non-white and
time-variant speech signal spectrum with an adaptive embedding scheme, and apply
the same embedding principle on all time-segments of speech for example using a
continuous noise-like versus tonal speech component decomposition [95].
67
Chapter 4. Watermarking Non-Voiced Speech
68
Chapter 5
Watermark Synchronization
This chapter discusses different aspects of the synchronization between wa-
termark embedder and detector. We examine the issues of timing recovery
and bit synchronization, the synchronization between the synthesis and
the analysis systems, as well as the data frame synchronization. Bit syn-
chronization and synthesis/analysis synchronization are not an issue when
using the adaptive equalization-based watermark detector of Chapter 4. For
the simpler linear prediction-based detector we present a timing recovery
mechanism based on the spectral line method which achieves near-optimal
performance.
Using a fixed frame grid and the embedding of preambles, the information-
carrying frames are detected in the presence of preamble bit errors with
a ratio of up to 10 %. Evaluated with the full watermarking system, the
active frame detection performs near-optimal with the overall bit error ratio
increasing by less than 0.5 %-points compared to ideal synchronization.
Parts of this chapter have been published in K. Hofbauer and H. Hering, “Noise robust speech
watermarking with bit synchronisation for the aeronautical radio,” in Information Hiding, ser. Lecture
Notes in Computer Science. Springer, 2007, vol. 4567/2008, pp. 252–266.
K. Hofbauer, G. Kubin, and W. B. Kleijn, “Speech watermarking for analog flat-fading bandpass
channels,” IEEE Transactions on Audio, Speech, and Language Processing, 2009, revised and resubmitted.
69
Chapter 5. Watermark Synchronization
“Synchronization is the process of aligning the time scales between two or more
periodic processes that are occurring at spatially separated points”[96]. It is a mul-
tilayered problem, and certainly so for an analog radio channel. We discuss in this
chapter the different aspects of synchronization between the watermark embedder and
the watermark detector, and present a practical synchronization scheme. Section 5.1
presents related work and our approach to the different levels of synchronization from
the shortest to the longest units of time. In Section 5.2 we describe an implementation
of the synchronization scheme, which is tailored to the watermarking algorithm of
Chapter 4. Experimental results are shown and discussed in Section 5.3.
5.1. Theory
Due to the analog nature of the transmission channel and the signal-dependent em-
bedding of the data in the speech signal, a watermark detector as shown in Figure 4.3
needs to estimate 1) the correct sampling time instants, 2) the transformations applied
to the data, and 3) the position of the data-carrying symbols.
5.1.1. Timing Recovery and Bit Synchronization
When transmitting digital data over an analog channel, the issue of bit or symbol
synchronization in-between digital systems arises. The digital sampling clocks in the
embedder and detector are in general slowly time-variant, have a slightly different
frequency, and have a different timing phase (which is the choice of sampling instant
within the symbol interval). Therefore, it is necessary that the detector clock synchro-
nizes itself to the incoming data sequence or compensates for the sampling phase
shift. Although bit synchronization is a well explored topic in communications, it
is still a major challenge in most modern digital communication systems, and even
more so in watermarking due to very different conditions. Bit synchronization can be
achieved with data-aided or non-data-aided methods, and we will use one or the other
depending on the used watermark detector structure.
5.1.1.1. Data-Aided Bit Synchronization
In data-aided bit synchronization, the transmitter allocates a part of the available
channel capacity for synchronization. It is then possible to either transmit a clock
signal, or to interleave the information data with a known synchronization sequence
[65]. This enables simple synchronizer designs, but also reduces the channel capacity
available for information transmission.
Given the RLS equalization based watermark detection presented in Section 4.1.7.2
(which is necessary for a channel with linear transmission distortion), there is already
a known sequence used for equalizer training embedded in the signal. While this
training sequence could also be used to drive a data-aided bit synchronization method,
this is in fact not necessary for the RLS based watermark detector. The RLS equalizer
inherently compensates for a timing or sampling phase error, as long as the error is
smaller than the RLS filter length.
70
5.1. Theory
In the aeronautical application discussed in Chapter 6, the radio-internal frequency
references are specified to be accurate within ±5 ppm (parts per million) for up-to-date
airborne transceivers and ±1 ppm for up-to-date ground transceivers [97]. Assuming a
sampling frequency of 8 kHz and a worst-case frequency offset of 6 ppm, this leads to
a timing phase error of 0.048 samples per second, which accumulates over time but is
less than one sample (and well below the RLS filter length) given the short utterance
durations in air traffic control.
5.1.1.2. Non-Data-Aided Bit Synchronization
We discussed in Section 4.3.3 the use of linear prediction (LP) instead of the RLS
equalizer as adaptive filter in the watermark detector, in case the transmission channel
has no linear filtering distortion. In this case, no equalizer training sequence is
embedded, and payload data can be transmitted, instead. Since the LP error filter
does not compensate for timing phase errors, a sampling timing recovery method is
required, which is ideally non-data-aided in order to obtain a high payload data rate.
In non-data-aided bit synchronization the detector achieves self-synchronization
by extracting a clock-signal from the received signal. A wide range of methods
exist, including early-late gate synchronizers, minimum mean-square-error methods,
maximum likelihood methods and spectral line methods [98, 65]. For our watermarking
system, we use the highly popular nonlinear spectral line method because of its simple
structure and low complexity [65, 86]. It belongs to the family of deductive methods.
Based on higher-order statistics of the received signal, it derives from the received
signal a timing tone whose frequency equals the symbol rate and whose zero crossings
indicate the desired sampling instants.
For non-voiced speech our watermarking scheme is essentially a pulse amplitude
modulation (PAM) system with a binary alphabet. To demonstrate the basic idea of
the spectral line method, we assume that the received analogue watermark signal is a
baseband PAM signal R(t) based on white data symbols with variance σ
2
a
. Then R(t)
has zero mean but non-zero higher moments that are periodic with the symbol rate.
With p(t) being the band-limited interpolation pulse shape, the second moment of R(t)
is [86]
E
_
[R(t)[
2
_
= σ
2
a


m=−∞
[p(t −mT)[
2
,
which is periodic with the symbol period T. A narrow bandpass filter at f
s
=
1
T
is used
to extract the fundamental of the squared PAM signal R
2
(t), which results in a timing
tone with a frequency corresponding to the symbol period and a phase such that the
negative zero crossings are shifted by
T
4
in time compared to the desired sampling
instants [65]. We compensate this phase shift with a Hilbert filter.
Due to the noisy transmission channel, and even though most of the noise is filtered
out by the narrow bandpass filter at f
s
=
1
T
, the timing tone is noise-corrupted. This
leads to a timing jitter, which we reduce using a phase-locked loop that provides a
stable single-frequency output without following the timing jitter.
71
Chapter 5. Watermark Synchronization
5.1.2. Synthesis and Analysis System Synchronization
In principle, the signal analysis in the watermark detector needs to synchronize to the
signal synthesis in the watermark embedder. In order to perfectly detect the watermark
signal w(n) (see Figure 4.1), the LP analysis, the voiced/unvoiced segmentation, as
well as the gain extraction in the detector would have to be perfectly in sync with
the same processes in the watermark embedder. This is difficult to achieve because
these processes are signal-dependent and the underlying signal is modified by a)
the watermarking process and b) the transmission channel. In the context of our
watermarking system it is only a matter of definition if one considers these issues as a
synchronization problem, or if they are considered as part of the hidden data channel
as shown in 4.1. We address these issues with different approaches.
LP Frame Synchronization
When using an LP-based watermark detector, one might intuitively assume that the
block boundaries for the linear prediction analysis in the embedder and detector have
to be identical. However, using LP parameters as given in Chapter 4 and real speech
signals, the predictor coefficients do not change rapidly in between the update interval
of 2 ms. As a consequence, a synchronization offset in the LP block boundaries is not an
issue. In [2] we also showed experimentally that the bit error rate is not affected when
the LP block boundaries in the detector are offset by an integer number of samples
compared to the embedder.
When using the RLS-based watermark detector, LP frame synchronization does not
exist since the RLS equalizer operates on a per sample basis.
Adaptive Filtering and Gain Modulation Mismatch
The mismatch between the LP synthesis filter in the embedder and the adaptive
filtering in the detector, as well as the the gain estimation mismatch between watermark
embedder and detector are mostly a result of the transmission channel attacks. We
address this mismatch in two ways. First, we reduce the mismatch using the embedded
training sequences and the RLS equalizer in the receiver. Second, we accept that there
is a residual mismatch such that we are unable to exactly recuperate the watermark
signal w(n). Therefore, we only use a binary alphabet for w(n) to achieve sufficient
robustness.
Voiced/Unvoiced Segmentation
Due to the transmission channel attacks outlined in 4.1 it appears difficult to obtain
an identical sample-accurate voiced/unvoiced segmentation in the embedder and the
detector. While the embedder knows about the availability of the hidden watermark
channel (i.e., the host signal being voiced or non-voiced) and embeds accordingly,
the watermark channel appears to the detector as a channel with random insertions,
deletions and substitutions. We deal with this issue in combination with the frame
synchronization described in the next subsection.
72
5.2. Implementation
5.1.3. Data Frame Synchronization
In this subsection we address the localization of the information-carrying symbols
in the watermark detector. Due to the host signal dependent watermark channel
availability, the hidden data channel appears to the detector as a channel with in-
sertions, deletions and substitutions (IDS). Various schemes have been proposed for
watermark synchronization, including embedding in an invariant domain, embed-
ding of preambles or pilot sequences, host signal feature based methods, the use of
IDS-capable channel codes, the use of convolution codes with multiple interconnected
Viterbi decoders, and exhaustive search (cf. [99, 100, 101, 102]).
We use a fixed frame grid and an embedding of preambles, since this allows the
exploitation of the watermark embedder’s knowledge of the watermark channel
availability, and does not require a synchronized voicing decision in the detector. As
outlined in Section 4.1.5.1, a preamble is embedded in frames that carry a watermark,
so-called active frames, and consists of 1) a fixed marker sequence m
Ac
that identifies
active frames and 2) a consecutive frame counter m
ID
used for addressing or indexing
among the active frames.
To achieve synchronization at the watermark detector, a threshold-based correlation
detection of the active frame markers m
Ac
over a span of multiple frames detects the
frame grid position and as such the overall signal delay. After the adaptive equalization
of the received signal as described in Section 4.1.7, there remain errors in the detection
of the preambles due to channel attacks. Therefore, the active frame detection is based
on a dynamic programming approach that detects IDS errors and minimizes an error
distance that incorporates the active frame markers m
Ac
, the frame indices m
ID
, and
the training symbols m
Tr
.
Frame synchronization is considered as a secondary topic in the scope of this thesis
and does not significantly influence the overall performance. Alternative schemes,
which perform similarly, are possible. An introduction to frame synchronization is
provided in [103].
5.2. Implementation
5.2.1. Bit Synchronization
For the RLS-based watermark detector data-aided bit synchronization is inherently
achieved by the RLS equalizer. Thus, we focus in the following on the non-data-
aided bit synchronization for the LP-based watermark detector using the spectral line
method.
Spectral Line Synchronization Figure 5.1 shows the block diagram of the proposed
spectral line synchronization system.
1
The received watermarked analog signal s(t)
is oversampled by a factor k compared to the original sampling frequency f
s
. In the
1
Whether the proposed structure could be implemented even in analog circuitry could be subject of
further study.
73
Chapter 5. Watermark Synchronization
Over-
sampl.
LP
Anal.
Square
Law
BP
Filt.
DPLL
(opt.)
Hilbert
Filter
r r
H
e Zero
Cross.
Point
Smooth.
(opt.)
Correct.
Polyn.
Sam-
pling
r
D
s(t)
s(t)
s'
Figure 5.1.: Synchronization system based on spectral line method.
later simulations a factor k = 32 is used. Values down to k = 8 are possible at the cost
of accuracy. The linear prediction residual e(n) of the speech signal s(t) is computed
using LP analysis of the same order P = 10 as in the embedder and with intervals of
equal length in time. This results in a window length of k 160 samples and an update
interval of k 15 samples after interpolation.
2
We exploit the fact that the oversampled residual shows some periodicity with the
embedding period T =
1
f
s
due to the data embedding at these instances. We extract the
periodic component r(n) at f
s
from the squared residual (e(n))
2
with an FIR bandpass
filter with a bandwidth of b=480 Hz centered at f
s
. The output r(n) of the bandpass
filter is a sinusoid with period T, and is phase-shifted by
π
2
with an FIR Hilbert filter
resulting in the signal r
H
(n). The Hilbert filter can be designed with a large transition
region given that r(n) is a bandpass signal.
The zero-crossings of r
H
(n) are determined using linear interpolation between the
sample values adjacent to the zero-crossings. The positions of the zero-crossings on
the rising slope are a first estimate of the positions of the ideal sampling points of
the analog signal s(t). It was found that the LP framework used in the simulations
introduces a small but systematic fractional delay which depends on the oversampling
factor k and results in a timing offset. We found that this timing offset can be corrected
using a third-order polynomial t

= a
0
+ a
1
k
−1
+ a
2
k
−2
+ a
3
k
−3
. The coefficients a
i
have been experimentally determined to be a
0
= 0, a
1
= 1.5, a
2
= −7 and a
3
= 16 .
Since the estimated sampling points contain gaps and spurious points, all points
whose distance to a neighbor is smaller than 0.75 UI (unit interval, fraction of a
sampling interval T =
1
f
s
) are removed in a first filtering step. In a second step all gaps
larger than 1.5 UI are filled with new estimation points which are based on the position
of previous points and the observed mean distance between the previous points. The
received analog signal s(t) is again sampled, but instead of with a fixed sampling grid
it is now sampled at the estimated positions. The output is a discrete-time signal with
rate f
s
, which is synchronized to the watermark embedder and which serves as input
to the watermark detector.
Digital Phase-Locked Loop The bit synchronization can be further improved by the
use of a digital phase-locked loop (DPLL). The DPLL still provides a stable output in
the case of the synchronization signal r
H
(n) being temporarily corrupt or unavailable.
In addition, the use of a DPLL renders the previous point filtering and gap filling steps
unnecessary.
There is a vast literature on the design and performance of both analog and digital
2
To improve numerical stability and complexity one could also perform the LP analysis at a lower
sample rate and then upsample the resulting error signal.
74
5.2. Implementation
phase-locked loops [104, 105]. We start with a regular second-order all digital phase-
locked loop [106]. Inspired by dual-loop gear-shifting DPLLs [107] we extended the
DPLL to a dual-loop structure to achieve fast locking of the loop. We also dynamically
adapt the bandwidth of the second loop in order to increase its robustness against
spurious input signals.
D
c
1
c
2
D
c
3

in

out

err

f

out
Figure 5.2.: Second-order all digital phase-locked loop.
The structure of the two loops is identical and is shown in Figure 5.2. The input
signal θ
in
(n) of the DPLL is the phase angle of the synchronization signal r(n), which is
given by the complex argument of the analytic signal of r(n). The phase output θ
out
(n)
is converted back to a sinusoidal signal r
d
(n) = −cos(θ
out
(n)). The loop is computed
and periodically updated with the following set of equations:
θ
in
(n) = arg(r(n) + jr
H
(n)) θ
err
(n) = θ
in
(n) −θ
out
(n)
θ
f
(n) = c
1
θ
err
(n) + θ
f
(n −1)
θ
out
(n) = c
3
+ θ
f
(n −1) + c
2
θ
err
(n −1) + θ
out
(n −1)
The input and output phases θ
in
(n) and θ
out
(n) are always taken modulo-2π. The
parameter c
3
=
2πf
s
k f
s
specifies the initial frequency of the loop, with k being the
oversampling factor of the synchronizer. The parameters c
1
and c
2
are related to the
natural frequency f
n
=
ω
n

of the loop, which is a measure of the response time, and
the damping factor η =
1

2
, which is a measure of the overshoot and ringing, by
c
1
=
_
2πf
n
k f
s
_
2
and c
2
=
4πη f
n
k f
s
.
The loop is stable if [106]
c
1
> 0, c
2
> c
1
and c
1
> 2c
2
−4.
In the first loop we use a natural frequency f
n,1
= 5000 Hz, which enables a fast
locking of the loop. However, the output θ
out
(n) then closely resembles the input
θ
in
(n) of the loop, so that when the input is corrupted, also the output becomes
corrupted. We consider the first loop as locked when the sample variance σ
2
1
(n) of the
phase error θ
err
(n) in a local window of ten samples is below a given threshold for a
defined amount of time. We then gradually decrease the loop natural frequency f
n,2
of the second loop (which runs in parallel with the first loop) from f
n,2
= 5000 Hz to
f
n,2
= 50 Hz. This increases the loop’s robustness against noise and jitter. When the
75
Chapter 5. Watermark Synchronization
input signal is corrupt, the DPLL should continue running at the same frequency and
not be affected by the missing or corrupt input. Therefore, when the variance of the
phase error in the first loop increases, we reduce the natural frequency of the second
loop to f
n,2
(n) = 10
−σ
2
1
(n)
50 Hz.
5.2.2. Data Frame Synchronization
This subsection describes the implementation and experimental validation of the frame
synchronization method outlined in Section 5.1.3. To simplify the description, and
because all experiments in Section 4.3 are based on a binary symbol alphabet, we use
binary symbols herein, too.
5.2.2.1. Frame Grid Detection
To estimate the delay of the transmission channel and localize the frame grid position
in the received signal, the received and sampled signal s
/
(n) is correlated with the
embedded training sequence m
Tr
. The delay ∆ between the embedder and the receiver
can be estimated by the lag between the two signals where the absolute value of their
cross-correlation is maximum, or
∆ = argmax
k

s
/
m
Tr
(k)[ = argmax
k
¸
¸
¸
¸
¸


n=−∞
s
/
(n)m
Tr
(n −k)
¸
¸
¸
¸
¸
. (5.1)
The range of k in the maximization is application-dependent. It can be short if a defined
utterance start point exists, which is the case in the aeronautical radio application. The
sign of the cross-correlation at its absolute maximum also indicates if a phase inversion
occurred during transmission. The delay is compensated and, in case a phase inversion
occurred, the polarity adjusted, and
s
/
FB
(n) = sgn[ρ
s
/
m
Tr
(∆)]s
/
(n +∆).
5.2.2.2. Active Frame Detection
The frame synchronization scheme must be able to cope with detection errors in
the frame preambles. A simple minimum Euclidean distance detector for the binary
symbols in the equalized signal e
/
(n) is the sgn function, and the detected binary
symbol sequence m
/
d
of the d’th frame is
m
/
d
(k) = sgn
_
e
/
(dL
D
+ k)
¸
. (5.2)
There are bit errors in m
/
d
due to the noise and the time-variant filtering of the channel.
To determine the active frames in the presence of bit errors, we denote with n
Ac
(d)
the number of samples in frame m
/
d
at the corresponding position that are equal to the
marker sequence m
Ac
. Equivalently, we denote with n
Tr
(d) the number of samples that
equal the training sequence m
Tr
. We then consider those frames d
//
a
as potentially active
frames where
n
Ac
(d)
L
Ac
n
Tr
(d)
L
Tr
≥ τ
S
(5.3)
76
5.2. Implementation
with τ
S
being a decision threshold (with 0 ≤ τ
S
≤ 1). In the case of zero bit errors,
the product in (5.3) would evaluate to unity in active frames, and to 0.25 on average
in non-active frames. We use a decision threshold τ
S
= 0.6, which was determined
experimentally.
Insertions, deletions and substitutions (IDS) occur in the detection of the active
frame candidates d
//
a
. To accommodate for this, we incorporate the frame indices m
ID
,
which sequentially number the active frames and are encoded so as to maximize the
Hamming distance between neighboring frame indices. To improve the detection
accuracy of the frame indices m
ID
, the signals r
/
(n) and m
/
d
are recomputed using the
active frame markers m
Ac
in the frame candidates d
//
a
as additional training symbols
for equalization.
The sequence of decoded frame indices Υ
/
detected in the frames d
//
a
is not identical
to the original sequence of indices Υ = 1, 2, 3, . . ., because of the IDS errors in d
//
a
and
the bit errors in m
/
d
. We use a dynamic programming algorithm for computing the
Levenshtein distance between the sequences Υ and Υ
/
, in order to determine where the
sequences match, and where insertions, deletions and substitutions occurred [108]. We
process the result in the following manner:
1. Matching frame indices are immediately assumed to be correct active frames.
2. In the case of an insertion in Υ
/
, either the insertion or an adjacent substitution in
Υ
/
must be deleted. The one active frame candidate is removed from the sequence
that minimizes as error measure the sum of the number of bit errors in the frame
indices of the remaining frame candidates.
3. In the case of a deletion in Υ
/
, first all substitutions surrounding the deletion
are removed, too. Then, the resulting ‘hole’ of missing active frames must
be filled with a set of new active frames. Out of all possible frames that lie
between the adjacent matching frames, a fixed number of frames (given by the
size of the hole) needs to be selected as active frames. Out of all possible frame
combinations the one set is selected that minimizes as error measure the sum of
the number of bit errors in the frame identification, the active frame marker and
the training sequence, over all frames in the candidate set. To give higher weight
to consecutive frames, we reduce the error distance by one if a frame candidate
has a direct neighbor candidate or is adjacent to a previously matched active
frame.
Using the above detection scheme, and assuming the total number of frames embedded
in the received signal to be known, there are by definition no more insertion or deletion
errors, but only substitution errors in the set of detected active frames d
/
a
. This then
also applies for the embedded watermark message, which considerably simplifies the
error correction coding of the payload data since a standard block or convolution code
can be applied.
77
Chapter 5. Watermark Synchronization
5.3. Experimental Results and Discussion
5.3.1. Timing Recovery and Bit Synchronization
5.3.1.1. Non-Data-Aided Bit Synchronization
Experimental Setup The spectral line based timing recovery algorithm presented in
this chapter is necessary only when using an LP-based watermark detector. LP-based
watermark detection is only possible when the transmission channel has no filtering
distortion. Since in this case many components of the watermark embedding scheme
presented in Chapter 4 can be omitted, we evaluate the timing recovery using the
simpler watermark embedding system that we presented in [2]. The system omits the
bandpass embedding and pulse shaping parts, assumes ideal frame synchronization,
and performs LP-based watermark detection.
An unsynchronized resampling of the received signal leads to a timing phase error,
which is a phase shift of a fractional sample in the sampling instants of the watermark
embedder and detector.
We use a piecewise cubic Hermite interpolating polynomial (PCHIP) to simulate
a reconstruction of a continuous-time signal and resample the resulting piecewise
polynomial structure at equidistant sampling points at intervals of
1
f
s
or
1
k f
s
respectively.
In this reconstruction process, each sample of the watermarked speech signal ˆ s(n)
serves as a data point for the interpolation. The nodes of these data points would
ideally reside on an evenly spaced grid with intervals of
1
f
s
. In order to simulate an
unsynchronized system we move the nodes of the data points to different positions
according to three parameters:
Timing phase offset: All nodes are shifted by a fraction of the regular grid interval
1
f
s
(unit interval, UI).
Sampling frequency offset: The distance between all nodes is changed from one unit
interval to a slightly different value.
Jitter: The position of each single node is shifted randomly following a Gaussian
distribution with variance σ
2
J
.
The experiments presented in this subsection with the LP-based watermark detector
use as input signal a short sequence of noisy air traffic control radio speech with a
length of 6 s and a sampling frequency of f
s
= 8000 Hz.
Results Without synchronization, the watermarking system is inherently vulnerable
to an offset of the sampling phase and frequency between the embedder and the detec-
tor. Figure 5.3 shows the adverse effect of a timing phase offset if no synchronization
system is used. We measure this timing phase error in ’unit intervals’ (UI), which is
the fraction of a sampling interval T =
1
f
s
.
The proposed synchronization system aims to estimate the original position of the
sample nodes, which is the optimal position for resampling the continuous-time signal
78
5.3. Experimental Results and Discussion
−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4
0
5
10
15
20
25
30
Timing phase offset (in UI)
B
i
t

e
r
r
o
r

r
a
t
i
o

(
i
n

%
)
Figure 5.3.: Robustness of the LP-based watermark detector without synchronization
with respect to a timing phase error.
at the detector. The phase estimation error is the distance between the original and the
estimated node position in unit intervals. Its root-mean-square value across the entire
signal is shown in Figure 5.4 for the above three types of synchronization errors. The
figure also shows the bit error ratio for different sampling frequency offsets. The bit
error ratio as a function of the uncorrected timing phase offset is shown in Figure 5.3.
Figure 5.5 shows the raw bit error ratio of the overall watermarking system (in two
different configurations and including a watermark floor and residual emphasis as
described in [2]) at different channel SNR and compares the proposed synchronizer to
artificial error-free synchronization. Compared to the case where ideal synchronization
is assumed, the raw BER increases by less than two percentage points across all SNR
values.
5.3.1.2. Data-Aided Bit Synchronization
We experimentally evaluate the capability of the RLS equalization based watermarking
system (Chapter 4) to compensate a timing synchronization offset. In contrast to the
previous experiments, we use the system implementation and experimental settings of
Section 4.3, again using ten randomly selected utterances of the ATCOSIM corpus as
input speech. Figure 5.6(a) shows the resulting BER when introducing a timing phase
offset of a fractional sample in the simulated channel as described above.
The error ratio significantly increases for timing phase offsets around −0.4 UI (equiv-
alent to an offset of 0.6 UI). This results from the fact that the RLS equalizer is a causal
system and cannot shift the received signal forward in time. The increased error rate
can be mitigated by delaying the training sequence that is fed to the equalizer by one
sample in time and adding one equalizer filter tap (K = 12). As shown in Figure 5.6(b),
the BER is then below 8 % for any timing phase offset.
The remaining increase in BER for non-zero timing phase offsets is a result of aliasing.
To overcome this limitation, a fractionally spaced equalizer should be used instead
of the sample-spaced equalizer of Chapter 4 [109, 110]. The discontinuity in the BER
curve at 0.6 UI results from the shift of the detected frame grid towards the next sample
(cf. Section 5.2.2.1).
79
Chapter 5. Watermark Synchronization
−0.4 −0.2 0 0.2 0.4
0
0.005
0.01
0.015
0.02
0.025
0.03
Timing phase offset (in UI)
P
h
a
s
e

e
s
t
i
m
a
t
i
o
n

e
r
r
o
r

(
i
n

U
I

R
M
S
)


No synchronization
Spectral line (SL)
SL with DPLL
−5 0 5
0
0.005
0.01
0.015
0.02
0.025
0.03
Sampling frequency offset (in Hz)
P
h
a
s
e

e
s
t
i
m
a
t
i
o
n

e
r
r
o
r

(
i
n

U
I

R
M
S
)


No synchronization
Spectral line (SL)
SL with DPLL
0 0.02 0.04 0.06 0.08
0
0.02
0.04
0.06
0.08
0.1
Jitter (in UI RMS)
P
h
a
s
e

e
s
t
i
m
a
t
i
o
n

e
r
r
o
r

(
i
n

U
I

R
M
S
)


No synchronization
Spectral line (SL)
SL with DPLL
−5 0 5
0
0.5
1
1.5
2
2.5
3
Sampling frequency offset (in Hz)
B
i
t

e
r
r
o
r

r
a
t
i
o

(
i
n

%
)


No synchronization
Spectral line (SL)
SL with DPLL
Figure 5.4.: Synchronization system performance for LP-based watermark detection:
Phase estimation error and bit error ratio for various types of node offsets.
80
5.3. Experimental Results and Discussion
−10 0 10 20 30 40 50 60
0
5
10
15
20
25
30
35
40
45
50
Channel SNR (in dB)
B
i
t

e
r
r
o
r

r
a
t
i
o

(
i
n

%
)


Synchronizer with DPLL (1 bit/sample)
Ideal Synchronization (1 bit/sample)
Synchronizer with DPLL (0.14 bit/sample)
Ideal Synchronization (0.14 bit/sample)
Figure 5.5.: Comparison of overall system performance (using the LP-based watermark
detector) with ideal (assumed) synchronization and with the proposed
synchronizer at two different embedding rates (in watermark bits per host
signal sample).
5.3.2. Data Frame Synchronization
We evaluate the data frame synchronization in the context of the RLS equalization
based watermarking system discussed in Chapter 4.
5.3.2.1. Frame Grid Detection
In certain applications a near-real-time availability of the watermark data is required. To
evaluate the time needed until frame grid synchronization is achieved, in the following
experiment we used only the first T
FG
= 0 s . . . 4 s of each utterance to detect the frame
grid. Figure 5.7 shows the fraction of utterances with correct frame grid detection for
a given signal segment length T
FG
, evaluated over 200 randomly selected utterances
of the ATCOSIM corpus and a clean transmission channel. The mean fraction of
non-voiced samples in the corresponding segments (also shown in Figure 5.7) explains
the decrease of the detection ratio for T
FG
> 0.2 s.
The experiment shows that a correct frame grid detection can be achieved for
correlation windows as short as T
FG
= 50 ms (Figure 5.7). The detection ratio locally
decreases for longer correlation windows because most utterances in the used database
start with a non-voiced segment. Because m
Tr
is embedded in the non-voiced segments
only, correlation is low when the window is still relatively short and the fraction
of non-voiced samples within the window is low. This effect could be avoided by
considering only the non-voiced segments in the correlation sum (5.1), which, however,
requires a voiced/non-voiced segmentation in the detector.
81
Chapter 5. Watermark Synchronization
(a) Initial System
−0.6 −0.4 −0.2 0 0.2 0.4 0.6
0
5
10
15
20
25
Timing phase offset (in UI)
B
i
t

e
r
r
o
r

r
a
t
i
o

(
i
n

%
)
(b) Enhanced System (additional delay and equalizer tap)
−0.6 −0.4 −0.2 0 0.2 0.4 0.6
0
5
10
15
20
25
Timing phase offset (in UI)
B
i
t

e
r
r
o
r

r
a
t
i
o

(
i
n

%
)
Figure 5.6.: Overall system robustness in the presence of a timing phase offset at an
average uncoded bit rate of 690 bit/s.
0 0.5 1 1.5 2 2.5 3 3.5 4
0
0.2
0.4
0.6
0.8
1
T
FG
(in s − signal segment length used for frame grid detection)
F
r
a
c
t
i
o
n


Fract. of correctly synchronized utterances
Fract. of non−voiced frames in segment
Figure 5.7.: Fraction of successfully synchronized utterances as a function of the signal
segment length used for the frame grid detection.
82
5.4. Conclusions
0 0.05 0.1 0.15 0.2 0.25
0.5
0.6
0.7
0.8
0.9
1
Bit error ratio
F
r
a
m
e

I
D

d
e
t
e
c
t
i
o
n

r
a
t
i
o
Figure 5.8.: Robustness of active frame detection against bit errors in the received
binary sequence m
/
d
.
5.3.2.2. Active Frame Detection
In order to evaluate the performance of the active frame detection subsystem, we
introduced artificial bit errors into an error-free version of the detected binary sequence
m
/
d
of (5.2), resulting from ten randomly selected utterances of the ATCOSIM corpus.
Figure 5.8 shows the ratio of correctly detected frames as a function of the bit error
ratio (BER) in m
/
d
. The fraction of spurious frames is one minus the given ratio.
The experiment shows that the active frame detection is highly reliable up to a bit
error ratio of 10 % in the detected binary sequence. It was further observed that active
frame detection errors only contribute to a small extent to the overall system bit error
ratio: Using ideal frame synchronization instead of the proposed scheme, the change
in the bit error ratios of Figure 4.7 is <0.5 %-points for all channels.
5.4. Conclusions
We conclude that timing recovery and bit synchronization can be achieved without
additional effort using the RLS equalization based watermark detector of Chapter 4.
In addition, the RLS equalizer resolves many problems occurring otherwise in the
synthesis/analysis system synchronization.
We demonstrated the application of classical spectral line synchronization in the
context of watermarking for a simpler but less robust linear prediction based watermark
detector. In combination with a digital PLL, the synchronization method provides
highly accurate timing recovery.
Frame grid synchronization can be achieved with windows as short as 50 ms, but
leaves room for further improvements for example by incorporating a voiced/non-
voiced detection in the detector. On the contrary, the active frame detection works
reliably up to a bit error ratio of 10 % and does not significantly contribute to the
overall watermark detection errors.
83
Chapter 5. Watermark Synchronization
84
Part II.
Air Traffic Control
85
Chapter 6
Speech Watermarking for Air Traffic
Control
The aim of civil air traffic control (ATC) is to maintain a safe separation between all
aircraft in the air in order to avoid collisions, and to maximize the number of aircraft
that can fly at the same time. Besides a set of fixed flight rules and a number of
navigational systems, air traffic control relies on human air traffic control operators
(ATCO, or controllers). The controller monitors air traffic within a so-called sector (a
geographic region or airspace volume) based on previously submitted flight plans and
continuously updated radar pictures, and gives flight instructions to the aircraft pilots
in order to maintain a safe separation between the aircraft.
Although digital data communication links between controllers and aircraft are
slowly emerging, most of the communication between controllers and pilots is verbal
and by means of analog voice radios. Air traffic control (ATC) has relied on the
voice radio for communication between aircraft pilots and air traffic control operators
since its beginning. The amplitude-modulation (AM) radio, which is in operation
worldwide, has basically remained unchanged for decades. The AM radio is based on
the double-sideband amplitude modulation (DSB-AM) of a sinusoidal carrier. For the
continental air-ground communication, the carrier frequency is within a range from
118 MHz to 137 MHz, the ‘very high frequency’ (VHF) band, with a channel spacing
of 8.33 kHz or 25 kHz [97]. Given the aeronautical life cycle constraints, it is expected
that the analog radio will remain in use for ATC voice in Europe well beyond 2020
[111, 112].
Parts of this chapter have been published in K. Hofbauer and H. Hering, “Digital signatures for the
analogue radio,” in Proceedings of the NASA Integrated Communications Navigation and Surveillance
Conference (ICNS), Fairfax, VA, USA, 2005.
87
Chapter 6. Speech Watermarking for Air Traffic Control
6.1. Problem Statement
The avionic radio is the main tool of the controller for giving flight instructions and
clearances to the aircraft pilot. It is crucial for a secure and safe air traffic operation to
have a reliable and fail-safe radio communication network to guarantee the possibility
of communication at any given time. From a technical point of view, high effort is
put into the systems in order to provide permanent availability through robust and
redundant design.
Once this ‘technical’ link between ground and aircraft is established (which we
assume further-on), the verbal communication between pilot and controller can start.
In order to avoid misunderstandings and to guarantee a common terminology, the two
parties use a restricted, very simple, common language. Although most of the words
are borrowed from the English language, the terms and structure of this language are
clearly defined in the corresponding standards [113]. Every voice communication on
the aeronautical channel is supposed to take place according to this ICAO terminology.
The radio communication occurs on a party-line channel, which means that all
aircraft within a sector as well as the responsible controller transmit and listen to
all messages on the corresponding radio channel frequency. In order to establish a
meaningful communication in this environment, it has to be clear who is talking (the
addresser, sender, originator) and to whom the current word is addressed to (the
addressee, recipient, acceptor). For the ATC air-ground communication, certain rules
are in place to establish this identification. In the standard situations, the air traffic
controller is the only person on the channel that does not identify himself as addresser
on the beginning of the message. Instead, the controller starts the message with the
call-sign of the aircraft, the addressee of the message. In case not otherwise explicitly
specified, every voice message of the aircraft pilot is inherently addressed to the air
traffic controller. Therefore the identification of the addressee (the controller in this
case) is omitted, and the pilot starts the message with the flight’s call-sign to identify
the addresser of the message.
The correct identification of addresser and addressee is crucial for a safe communi-
cation, and there are a number of problems associated with this identification process:
1. Channel monitoring workload. The aircrew needs to carefully monitor the radio
channel to identify messages which are addressed to their aircraft. This creates
significant workload.
2. Aircraft identification workload. The controller needs to identify the aircraft first
in the radio call, then on the radar screen. There is no convergence between
the radio call and the radar display system, which creates additional controller
workload.
3. Wrongly understood identification. An aircraft call-sign can be wrongly understood
by the controller or the pilot. This highly compromises safety, and is most likely
to happen when aircraft with similar call-sign are present in the same sector. This
potential risk is usually referred to as call-sign confusion.
88
6.2. High-Level System Requirements
4. Missed identification. If the identification is not understood by the addressee,
the entire message is declared void and has to be repeated. This is additional
workload for both the aircrew and the controller.
5. Forged identification. There is currently no technical barrier preventing a third
party to transmit voice messages with a forged identification. This security threat
is currently only addressed through operational procedures and the party-line
principle of the radio channel.
An automatic identification and authentication of the addresser and the addressee
could solve these issues and, thus, improve the safety and security in air traffic control.
6.2. High-Level System Requirements
Developing an application for the aviation domain implies some special peculiarities
and puts several constraints on the intended system. From the above scenario we can
derive a list of requirements which such an authentication system has to fulfill. The
user-driven requirements specify the functions and features that are necessary for the
users of the system. At the same time various aspects have to be considered to enable
the deployment of the system within the current framework of avionic communications.
6.2.1. Deployment-Driven Requirements
Rapid Deployment First and foremost the development should keep a rapid im-
plementation and a short time to operational use in mind. The basic approach is to
improve the current communication system, as call-sign confusion and ambiguity is
currently an unresolved issue and will become even more problematic as air traffic
increases. Even though a long-term solution for avionic air-ground communication
will probably lie outside the scope of analog voice radio, analog radio will continue to
be used worldwide for many years to come.
Legacy System Compliance The system should be backward compatible to the
legacy avionic radio system currently in use. Changing to a completely new radio
technology would have many benefits. However, it is very difficult to obtain a seamless
transition to a new system. Neither is it possible to change all aircraft and ground
equipment from one day to another, nor is it trivial to enforce deployment of new
technologies among all participants. Co-existence among old and new systems is
necessary and they should ideally complement each other.
Bandwidth Efficiency Ideally the system should not create any additional demands
on frequency bandwidth. Especially in Europe there is a severe lack of frequency
bandwidth in the preferable VHF band, with the future demand only increasing. This
led to the reduction of the frequency spacing between two adjacent VHF channels from
25 kHz to 8.33 kHz, but this being a temporary solution. That is, a system that operates
within the existing communication channels would be highly preferable.
89
Chapter 6. Speech Watermarking for Air Traffic Control
Minimal Aircraft Modifications Changes to the certified aircraft equipment should
be avoided as much as possible. Every new or modified piece of equipment, might
it be a different radio transceiver or a connection to the aircraft data bus, entails a
long and costly certification process. Through minimizing the changes to on-board
equipment and maintaining low complexity, the certification process can be highly
simplified.
Cost Efficiency Finally, the total costs of necessary on-board equipment should be
kept as low as possible. Not only that there are much higher numbers of airborne than
ground systems, but as well most of the ground radio stations are operated by the air
traffic authorities, which are generally less cost-sensitive than airlines. Therefore the
development process should, wherever possible, shift costs from on-board to ground
systems.
6.2.2. User-Driven Requirements
Perceptual Quality Several potential identification systems affect the perceptual
quality of the speech transmission. A too severe degradation of the sound quality
would not be accepted by both the certification authorities and the intended users, and
for example become annoying to the air traffic controllers. Ideally the participants of
the communication should not notice the presence of the system. For this reason the
perceptual distortion needs to be kept at a minimum.
Real-Time Availability The identification of the addresser has to be detected im-
mediately at the beginning of the voice message. If the identification would display
only after the voice message, this would be of little use to the air traffic controller, as
he is then already occupied with another task. The automatic identification should
be completed before the call-sign is completely enunciated, ideally in less than one
second.
Data Rate In order to serve the purpose of providing identification of the addresser,
it is necessary to transmit a certain amount of data. This might be the aircraft’s tail
number or the 27 bit Data Link Service Address for aircraft and ground stations used
in the context of Controller Pilot Data Link Communication (CPDLC). A GPS position
report, which would be advantageous in various scenarios, requires approximately
50 bit of payload data. Altogether a data rate of 100 bit/s is desired, leaving some room
for potential extensions. For a secure authentication, most likely a much higher data
rate is required.
Error Rate Robustness is a basic requirement. An unreliable system would not
be accepted by the users. Two types of errors are possible. On the one hand, the
system can fail to announce the addresser altogether, and although this not safety
critical, too frequent occurrence would lead to user’s frustration. On the other hand
90
6.3. Background and Related Work
Security
Special
displays
Innovative
functions
in the ODS
Figure 6.1.: Identification of the transmitting aircraft through an embedded watermark.
the announcement of a wrong addresser can compromise safety. Therefore, it is
indispensable to assure robust transmission and to verify the data’s validity.
Maintaining Established Procedures Due to the strong focus on safety, the organi-
zation and work-flow in commercial aviation and air traffic control is highly based
on procedures, of which some are in place for decades. As a consequence, whenever
changes to the procedures are proposed, there is a thorough and time-demanding
review and evaluation process. For a rapid deployment it therefore seems beneficial
not to change or replace these procedures, but to provide supplementary services, only.
No User Interaction The system should be autonomous and transparent to the
user. For the controller, the system should support his work by providing additional
information, and should not add another task. For the pilots, a need for additional
training and additional workload is unwanted. Therefore, the system should work
without any human intervention.
6.3. Background and Related Work
6.3.1. Inband Data Transmission Using Watermarking
In order to fulfill the above requirements for an automatic identification system, the
use of speech watermarking was previously proposed [114, 28]. Figure 6.1 shows the
overall idea.
The general components of a speech watermarking system for the aeronautical
application are shown in Figure 6.2. The voice signal is the host medium that carries
the watermark and is produced from the speaker’s microphone or headset. The
watermark, the data that is embedded into the voice signal, could for example consist
of the 24 bit aircraft identifier and—depending on the available data rate—auxiliary
data such as the aircraft’s position or an authentication. The watermark embedder
could be fitted into a small adapter box between the headset and the existing VHF
91
Chapter 6. Speech Watermarking for Air Traffic Control
Data
(Aircraft ID, ...)
Voice Signal
(pilot or ATCO)
Watermark
Embedder
VHF
Channel
(antennas,
propagation)
Received Voice Signal
(received with legacy radio)
Watermark
Detector
Extracted
Data
R
a
d
i
o
R
a
d
i
o
Figure 6.2.: General structure of a speech watermarking system for the aeronautical
voice radio.
radio [115]. It converts the analog speech signal to the digital domain, embeds the
watermark data, and converts the watermarked digital signal back to an analog speech
signal for transmission with the standard VHF radio.
The transmission channel consists of the airborne and ground-based radio transceivers,
corresponding wiring, antenna systems, etc., and the VHF radio propagation channel.
The channel has crucial influence on the performance of the system and is therefore
the subject of investigation in subsequent chapters. Although the transmitted signal
contains a watermark, it is technically and perceptually very similar to the original
speech signal and can therefore be received and listened to with every standard VHF
radio receiver without any modifications. This allows a stepwise deployment and
parallel use with the current legacy system. The watermark detector extracts the data
from the received signal, assures the validity of the data and displays it to the user.
The information could also be integrated into the ATC systems, e.g., by highlighting
the radar screen label of the aircraft that is currently transmitting. An industrial study
produced a detailed overview on operational concepts, potential benefits, applications
and packages, equipment and implementation scenarios, as well as a cost-benefit
analysis [116].
6.3.2. Related Work
This thesis focuses on the algorithms for the watermark embedding and detection, and
the aeronautical radio channel. Figure 6.3 shows how the area of research narrows
down by considering the constraints as outlined in Section 6.2.
Transmitting the identification tag is first and foremost a data communications prob-
lem. As the transmission should occur simultaneously within the legacy voice channel,
a speech watermarking system has to be applied. As shown in Section 2.3, the body of
research concerning watermarking tailored to speech signals is small. While most of
the existing watermarking schemes are tailored to be robust against lossy perceptual
coding, this requirement does not apply to watermarking for the aeronautical radio
communication. This thesis ultimately focuses on speech watermarking for the noisy
and time-variant analog aeronautical radio channel, and we give in the following a
short summary on the few existing systems tailored to the aeronautical application.
92
6.3. Background and Related Work
Data Communications .
Watermarking Systems
Audio Watermarking
Speech Watermarking
Speech WM for
Analog Channels
Speech WM for
Aeron. Channels
Legacy Channel
Sound
Speech
VHF Radio
Aeronautic
Image
Watermarking
Video
Watermarking
Music
Watermarking
Broadcast Radio
Watermarking
Speech WM for
Digital Ch.
Speech WM for
Compressed Ch.
Speech WM for
Time-Invariant Ch.
Digital Comm.
Systems
Analog Comm.
Systems
Speech WM for
Noiseless Channels
...
...
...
...
...
Figure 6.3.: Research focus within the field of data communications.
6.3.2.1. Multitone Sequence Transmission
The most elementary technique to transmit the sender’s digital identification is a
multitone data sequence at the beginning of the transmission [117]. Standard analog
modulation and demodulation schemes can be applied to create a short high power
data package which is transmitted before each message. This is a very simple and field-
proven technology, which provides high robustness and data rates. The transmission is
clearly audible to the listener as a noise burst at the beginning of the message. The
noise burst, which would hardly be accepted by the user base, can be removed by an
adaptive filter. However, the fact that all of the receivers would have to be equipped
with this technology renders the entire system undesirable.
6.3.2.2. Data-in-Voice (DiV) Modem
The ‘Data-in-Voice’ system presented in [118] tries to decrease the audibility of the
multitone sequence by reducing its power and spectral bandwidth. Using a band-
reject filter, the system removes some parts of the voice spectrum around 2 kHz as
to operate an in-band modem. A standard 240 bit/s MSK (Minimum Shift Keying)
modem technique is used, which occupies a frequency bandwidth of approximately
300 Hz. In order to fully suppress the audibility of the data sequence, the system also
requires a filter on the side of the voice receivers.
93
Chapter 6. Speech Watermarking for Air Traffic Control
6.3.2.3. Spread Spectrum Watermarking
The ‘Aircraft Identification Tag’ (AIT) system presented in [28, 119] is based on direct
sequence spread spectrum watermarking and embeds the watermark as additive
pseudo-random white noise. The embedder first adds redundancy to the data by an
error control coding scheme. The coded data is spread over the available frequency
bandwidth by a well-defined pseudo-noise sequence. The watermark signal is then
spectrally shaped with a linear predictive coding filter and additively embedded into
the digitized speech signal, thus exploiting frequency masking properties of human
perception. The detector relies on a whitening filter to compensate for the spectral
shaping of the watermark signal produced by the embedder. A maximum-length
pseudo random sequence (ML-PRS), detected with a matched filter, is used to ensure
synchronization between embedder and detector. The signal is then de-spread and the
watermark data extracted. The voice signal acts as additive interference (additive noise)
that deteriorates the detector’s ability to estimate the watermark. Therefore, even in
the case of an ideal noiseless transmission channel, the data rate is inherently limited.
6.3.3. Proposed Solution
For the embedding of an identification or other data in the ATC radio speech, we
propose to use the speech watermarking algorithm presented in Chapter 4, or a variant
thereof. To substantiate the practical feasibility of our approach, a thorough evaluation
and a large number of further tests is required, and we focus in the following chapters
on two important aspects.
First, the embedding capacity of the proposed algorithm is host speech signal
dependent. In order to obtain realistic simulation results given the particular ATC
phraseology and speaking style, we produced and present in Chapter 7 a speech
database of ATC operator speech. The resulting ATCOSIM corpus is not only used
for evaluating the watermarking method, but is also of general value for ATC-related
spoken language technologies.
Second, the aeronautical transmission channel has a crucial influence on the data
capacity and robustness of the overall system. We therefore study the aeronautical
voice channel properties both in terms of stochastic channel models as well as using
empirical end-to-end measurements and derive a simulation model based on the
measurements.
Given the speech corpus and the results of the channel analysis, Chapter 10 will
validate the suitability of the proposed watermarking scheme for the aeronautical
application.
94
Chapter 7
ATC Simulation Speech Corpus
The ATCOSIM Air Traffic Control Simulation Speech corpus is a speech
database of air traffic control (ATC) operator speech. ATCOSIM is a con-
tribution to ATC-related speech corpora: It consists of ten hours of speech
data, which were recorded during ATC real-time simulations. The database
includes orthographic transcriptions and additional information on speak-
ers and recording sessions. The corpus is publicly available and provided
free of charge. Possible applications of the corpus are, among others, ATC
language studies, speech recognition and speaker identification, the design
of ATC speech transmission systems, as well as listening tests within the
ATC domain.
Parts of this chapter have been published in K. Hofbauer, S. Petrik, and H. Hering, “The ATCOSIM
corpus of non-prompted clean air traffic control speech,” in Proceedings of the International Conference on
Language Resources and Evaluation (LREC), Marrakech, Morocco, May 2008. More detailed information
is available in [120].
95
Chapter 7. ATC Simulation Speech Corpus
7.1. Introduction
Until today spoken language technologies such as automatic speech recognition are
close to non-existing in operational air traffic control (ATC). This is in parts due to
the high reliability requirements that are naturally present in air traffic control. The
constant progress in the development of spoken language technologies opens a door
to the use of such techniques for certain applications in the air traffic control domain.
This is particularly the case for the controller speech on ground, considering the good
signal quality (close-talk microphone, low background noise, known speaker) and
the restricted vocabulary and grammar in use. In contrast, doing for example speech
recognition for the incoming noisy and narrowband radio speech is still a quite difficult
task.
In the development of practical systems the need for appropriate corpora comes into
place. The quality of air traffic control speech is quite particular and falls in-between
the classical categories: It is neither spontaneous speech due to the given constraints,
nor is it read, nor is it a pure command and control speech (in the sense of controlling a
device). Due to this and also due to the particular pronunciation and vocabulary in air
traffic control, there is a need for speech corpora that are specific to air traffic control.
This is even more the case considering the high accuracy and robustness requirements
in most air traffic control applications.
The international standard language for ATC communication is English. The use of
the French, Spanish or Russian language is also permitted if it is the native language
of both pilot and controller involved in the communication. The phraseology that is
used for this communication is strictly formalized by the International Civil Aviation
Organization [113]. It mandates the use of certain keywords and expressions for certain
types of instructions, gives clear rules on how to form digit sequences, and even defines
non-standard pronunciations for certain words in order to account for the band-limited
transmission channel. In practice however, both controllers and pilots deviate from
this standard phraseology.
After a review of the existing ATC related corpora known to the authors, the
subsequent sections present the new ATCOSIM Air Traffic Control Simulation Speech
corpus, which fills a gap that is left by the existing corpora. Section 7.2 outlines the
difficulty of obtaining realistic air traffic control speech recordings and shows the
path chosen for the ATCOSIM corpus. The transcription and production process is
described in Section 7.3 and 7.4. We conclude with a proposal for a specific automatic
speech recognition (ASR) application in air traffic control.
ATC Related Corpora
Despite the large number of existing speech corpora, only a few corpora are in the air
traffic control domain.
The NIST Air Traffic Control Complete Corpus [121] consists of recordings of 70 hours
of approach control radio transmissions at three airports in the United States. The
recordings are narrowband and of typical AM radio quality. The corpus contains an
orthographic transcription and for each transmission the corresponding flight number
96
7.2. ATCOSIM Recording and Processing
is listed. The corpus was produced in 1994 and is commercially available.
The HIWIRE database [122] is a collection of read or prompted words and sentences
taken from the area of military air traffic control. The recordings were made in a studio
setting, and cockpit noise was artificially added afterwards. The database contains
8,100 English utterances pronounced by non-native speakers without air traffic control
experience. The corpus is available on request.
The non-native Military Air Traffic Control (nnMATC) database [123] is a collection of
24 hours of military ATC radio speech. The recordings were made in a military air
traffic control center, wire-tapping the actual radio communication during military
exercises. The recordings are narrowband and of varying quality depending on the
speaker location (control room or aircraft). The database was published in 2007, but its
use is restricted to the NATO/RTO/IST-031 working group and its affiliates.
The VOCALISE project [124, 125] recorded and analyzed 150 hours of operational
ATC voice radio communication in France, including en route, approach and tower
control. The database is not available for the public and its use is restricted to research
groups affiliated with the French ‘Centre d’Études de la Navigation Aérienne’ (CENA)—
now part of the ’Direction des Services de la Navigation Aérienne’ (DSNA).
Another corpus that is similar to ours, but outside the domain of ATC, is the NATO
Native and Non Native (N4) database [126]. It consists of ten hours of military naval radio
communication speech recorded during training sessions.
The aforementioned corpora vary significantly among each other with respect to
e.g. scope, technical conditions and public availability (Table 7.1). The aim of the
ATCOSIM corpus is to fill the gap that is left by the above corpora: ATCOSIM provides
50 hours of publicly available direct-microphone recordings of operational air traffic
controller speech in a realistic civil en-route control situation. The corpus includes an
utterance segmentation and an orthographic transcription. ATCOSIM is meant to be
versatile and is as such not tailored to any specific application.
7.2. ATCOSIM Recording and Processing
The aim of the ATCOSIM corpus production was to provide wideband ATC speech
which should be as realistic as possible in terms of speaking style, language use,
background noise, stress levels, etc.
In most air traffic control centers the controller pilot radio communication is recorded
and archived for legal reasons. However, these legal recordings are problematic for a
corpus production for a multitude of reasons. First, most recordings are based on a
received radio signal and thus not wideband. Second, it is in general difficult to get
access to these recordings. And third, even if one would obtain the recordings, their
public distribution would be legally problematic at least in many European countries.
The logical resort is the conduction of simulations in order to generate the speech
samples. However, the required effort to set up realistic ATC simulations (including
facilities, hard- and software, trained personal, . . . ) is large and would far exceed the
budget of a typical corpora production. However, such simulations are performed for
the sake of evaluating air traffic control and air traffic management concepts, also on a
97
Chapter 7. ATC Simulation Speech Corpus
T
a
b
l
e
7
.
1
.
:
F
e
a
t
u
r
e
C
o
m
p
a
r
i
s
o
n
o
f
E
x
i
s
t
i
n
g
A
T
C
-
R
e
l
a
t
e
d
S
p
e
e
c
h
C
o
r
p
o
r
a
a
n
d
t
h
e
A
T
C
O
S
I
M
C
o
r
p
u
s
N
I
S
T
H
I
W
I
R
E
n
n
M
A
T
C
V
O
C
A
L
I
S
E
A
T
C
O
S
I
M
R
e
c
o
r
d
i
n
g
S
i
t
u
a
t
i
o
n
-
R
e
c
o
r
d
i
n
g
c
o
n
t
e
n
t
c
i
v
i
l
A
T
C
O
&
P
L
T
N
/
A
m
i
l
i
t
a
r
y
A
T
C
O
&
P
L
T
c
i
v
i
l
A
T
C
O
&
P
L
T
c
i
v
i
l
A
T
C
O
-
C
o
n
t
r
o
l
p
o
s
i
t
i
o
n
a
p
p
r
o
a
c
h
N
/
A
m
i
l
i
t
a
r
y
m
i
x
e
d
e
n
-
r
o
u
t
e
-
G
e
o
g
r
a
p
h
i
c
r
e
g
i
o
n
U
S
A
N
/
A
E
u
r
o
p
e
(
B
E
)
E
u
r
o
p
e
(
F
R
)
E
u
r
o
p
e
(
D
E
/
C
H
/
F
R
)
-
S
p
e
a
k
i
n
g
s
t
y
l
e
(
c
o
n
t
e
x
t
)
o
p
e
r
a
t
i
o
n
a
l
p
r
o
m
p
t
e
d
t
e
x
t
o
p
e
r
a
t
i
o
n
a
l
o
p
e
r
a
t
i
o
n
a
l
o
p
e
r
a
t
i
o
n
a
l
(
1
)
R
e
c
o
r
d
i
n
g
S
e
t
u
p
-
S
p
e
e
c
h
b
a
n
d
w
i
d
t
h
n
a
r
r
o
w
b
a
n
d
w
i
d
e
b
a
n
d
m
o
s
t
l
y
n
a
r
r
o
w
b
a
n
d
u
n
k
n
o
w
n
w
i
d
e
b
a
n
d
-
T
r
a
n
s
m
i
s
s
i
o
n
c
h
a
n
n
e
l
r
a
d
i
o
n
o
n
e
n
o
n
e
/
r
a
d
i
o
n
o
n
e
/
r
a
d
i
o
n
o
n
e
-
R
a
d
i
o
t
r
a
n
s
m
i
s
s
i
o
n
n
o
i
s
e
h
i
g
h
n
o
n
e
m
i
x
e
d
m
i
x
e
d
n
o
n
e
-
A
c
o
u
s
t
i
c
a
l
n
o
i
s
e
C
O
&
C
R
C
O
(
a
r
t
i

c
i
a
l
)
C
O
&
C
R
C
O
&
C
R
C
R
-
S
i
g
n
a
l
s
o
u
r
c
e
V
H
F
r
a
d
i
o
d
i
r
e
c
t
m
i
c
r
o
p
h
o
n
e
m
i
x
e
d
u
n
k
n
o
w
n
d
i
r
e
c
t
m
i
c
r
o
p
h
o
n
e
-
D
u
r
a
t
i
o
n
[
a
c
t
i
v
e
s
p
e
e
c
h
]
7
0
h
[
?
]
(
8
1
0
0
u
t
t
e
r
a
n
c
e
s
)
7
0
0
h
[
2
0
h
]
1
5
0
h
[
4
5
h
]
5
1
.
4
h
[
1
0
.
7
h
]
S
p
e
a
k
e
r
P
r
o
p
e
r
t
i
e
s
-
L
e
v
e
l
o
f
E
n
g
l
i
s
h
m
o
s
t
l
y
n
a
t
i
v
e
n
o
n
-
n
a
t
i
v
e
m
o
s
t
l
y
n
o
n
-
n
a
t
i
v
e
m
i
x
e
d
n
o
n
-
n
a
t
i
v
e
-
G
e
n
d
e
r
m
i
x
e
d
m
i
x
e
d
m
o
s
t
l
y
m
a
l
e
m
i
x
e
d
m
i
x
e
d
-
O
p
e
r
a
t
i
o
n
a
l
y
e
s
n
o
(
!
)
y
e
s
y
e
s
y
e
s
-
F
i
e
l
d
o
f
p
r
o
f
.
o
p
e
r
a
t
i
o
n
c
i
v
i
l
N
/
A
m
i
l
i
t
a
r
y
c
i
v
i
l
c
i
v
i
l
-
N
u
m
b
e
r
o
f
s
p
e
a
k
e
r
s
u
n
k
n
o
w
n
(
l
a
r
g
e
)
8
1
u
n
k
n
o
w
n
(
l
a
r
g
e
)
u
n
k
n
o
w
n
(
l
a
r
g
e
)
1
0
P
u
b
l
i
c
l
y
A
v
a
i
l
a
b
l
e
y
e
s
y
e
s
(
?
)
n
o
n
o
y
e
s
(
1
)
L
a
r
g
e
-
s
c
a
l
e
r
e
a
l
-
t
i
m
e
s
i
m
u
l
a
t
i
o
n

C
O
:
C
o
c
k
p
i
t

C
R
:
C
o
n
t
r
o
l
R
o
o
m

A
T
C
O
:
C
o
n
t
r
o
l
l
e
r

P
L
T
:
P
i
l
o
t
98
7.2. ATCOSIM Recording and Processing
large scale involving tens of controllers. The ATCOSIM speech recordings were made
at such a facility during an ongoing simulation.
7.2.1. Recording Situation
The voice recordings were made in the air traffic control room of the EUROCONTROL
Experimental Centre (EEC) in Brétigny-sur-Orge, France (Figure 7.1).
1
The room and
its controller working positions closely resemble an operational control center room.
The simulations aim to provide realistic air traffic scenarios and working conditions
for the air traffic controller. Several controllers operate at the same time, in order to
simulate also the inter-dependencies between different control sectors. The controller
communicates via a headset with pseudo-pilots which are located in a different room
and control the simulated aircraft. During the simulations only the controllers’ voice,
but not the pilots’, was recorded, because the working environment of the pseudo-
pilots, and as such the speaking style, did not to any extent resemble reality.
7.2.2. Speakers
The participating controllers were all actively employed air traffic controllers and
possessed professional experience in the simulated sectors. The six male and four
female controllers were of either German or Swiss nationality and had German, Swiss
German or Swiss French native language. The controllers had agreed to the recording of
their voice for the purpose of language analysis as well as for research and development
in speech technologies, and were asked to show their normal working behavior.
7.2.3. Recording Setup
The controller’s speech was picked up by the microphone of a Sennheiser HME 45-KA
headset. The microphone signal and a push-to-talk (PTT) switch status signal were
recorded onto digital audio tape (DAT) with a sampling frequency of 32 kHz and a
resolution of 12 bit. The push-to-talk switch is the push-button that the controller has
to press and hold in order to transmit the voice signal on the real-world radio. The
speech signal was automatically muted when the push-button was not pressed. This
results in a truncation of the speech signal if the push-button was pressed too late or
released too early. Figure 7.2 shows an example of the recorded signals. The recorded
PTT signal is a high-frequency tone that is active when the PTT switch is not pressed.
After the digital transfer of the DAT tapes onto a personal computer and an automatic
format conversion to a resolution of 16 bit by setting the less significant bits to zero, the
status signal of the push-to-talk button could after some basic processing be used to
reliably perform an automatic segmentation of the recorded voice signal into separate
controller utterances.
1
The raw recordings were collected by Horst Hering (EUROCONTROL) for a different purpose and
prior to the work presented herein.
99
Chapter 7. ATC Simulation Speech Corpus
Figure 7.1.: Control room and controller working position at the EUROCONTROL
Experimental Centre (recording site).
100
7.2. ATCOSIM Recording and Processing
Figure 7.2.: A short speech segment (transwede one zero seven rhein identified)
with push-to-talk (PTT) signal. Time domain signal and spectrogram of
the PTT signal (top two) and time-domain signal and spectrogram of the
speech signal (bottom two).
101
Chapter 7. ATC Simulation Speech Corpus
7.3. Orthographic Transcription
The speech corpus includes an orthographic transcription of the controller utterances.
The orthographic transcriptions are aligned with each utterance.
7.3.1. Transcription Environment
The open-source tool TableTrans was chosen for the transcription of the corpus [127].
TableTrans was selected for its table-based input structure as well as for its capability to
readily import the automatic segmentation. The transcriptionist fills out a table in the
upper half of the window where each row represents one utterance. In the lower half
of the window the waveform of the utterance that is currently selected or edited in the
table is automatically displayed (Figure 7.3). The transcriptionist can play, pause and
replay the currently active utterance by a single key stroke or as well select and play a
certain segment in the waveform display. A small number of minor modifications to
the TableTrans applications were made in order to lock certain user interface elements
and to extend its replay capabilities.
A number of keyboard shortcuts were provided to the transcriptionist using the open-
source tool AutoHotKey [128]. These were used for conveniently accessing alternative
time-stretched sound files
2
and for entering frequent character or word sequences such
as predefined keywords, ICAO alphabet spellings and frequent commands, both for
convenience and in order to avoid typing mistakes.
7.3.2. Transcription Format
The orthographic transcription follows a strict set of rules which is given in Appendix A.
In general, all utterances are transcribed word-for-word in standard British English. All
standard text is written in lower-case. Punctuation marks including periods, commas
and hyphens are omitted. Apostrophes are used only for possessives (e.g. pilot’s
radio)
3
and for standard English contractions (e.g. it’s, don’t). Numbers, letters,
navigational aids and radio call signs are transcribed following a given definition
based on several aeronautical standards and references. Regular letters and words
are preceded or followed by special characters to mark truncations (=), individually
pronounced letters (~) or unconfirmed airline names (@).
Stand-alone technical mark-up and meta tags are written in upper case letters with
enclosing squared brackets. They denote human noises such as coughing, laughing and
sighs ([HNOISE]), fragments of words ([FRAGMENT]), empty utterances ([EMPTY]), non-
sensical words ([NONSENSE]), and unknown words ([UNKNOWN]). Groups of words are
embraced by opening and closing XML-style tags to mark off-talk (<OT> ... </OT> ),
which is also transcribed, and foreign language (<FL> </FL>), for which a transcrip-
tion could be added at a later stage. Table 7.2 gives several examples of transcribed
2
The duration of each utterance was stretched by a factor of 1.7 using the PRAAT implementation of the
PSOLA method [77]. These time-stretched sound files were used only when dealing with utterances
that were difficult to understand.
3
Corpus transcription excerpts are written in a mono-spaced typewriter font.
102
7.3. Orthographic Transcription
F
i
g
u
r
e
7
.
3
.
:
S
c
r
e
e
n
-
s
h
o
t
o
f
t
h
e
t
r
a
n
s
c
r
i
p
t
i
o
n
t
o
o
l
T
a
b
l
e
T
r
a
n
s
.
103
Chapter 7. ATC Simulation Speech Corpus
utterances.
Silent pauses both between and within words are not transcribed. For consistent
results this would require an objective measure and criterion and is thus easier to
integrate in combination with a potential future word segmentation of the corpus.
Also technical noises as well as speech and noises in the background—produced by
speakers other than the one recorded—are not transcribed, as they are virtually always
present and are part of the typical acoustical situation in an air traffic control room.
Table 7.2.: Examples of Controller Utterance Transcriptions in the ATCOSIM Corpus
Human noises are labeled with [HNOISE], word fragments with [FRAGMENT] and unintelli-
gible words with [UNKNOWN]. Truncations are marked with an equals sign (=), individually
pronounced letters with a tilde (~), and beginning and end of off-talk is labeled with <OT>
and </OT>.
aero lloyd five one seven proceed direct to frankfurt
speedway three three five two contact milan one three four five two bye
ah lufthansa five five zero four turn right ten degrees report new heading
hapag lloyd six five three in radar contact climb to level two nine zero
scandinavian six one seven proceed to fribourg fox romeo india
ind= israeli air force six eight six resume on navigation to tango
good afternoon belgian airforce forty four non ~r ~v ~s ~m identified
[HNOISE] alitalia six four seven four report your heading
[FRAGMENT] hapag lloyd one one two identified
sata nine six zero one is identified <OT> oh it’s over now </OT>
aero lloyd [UNKNOWN] charlie papa alfa guten tag radar contact
7.3.3. Transcription Process and Quality Assurance
The entire corpus was transcribed by a single person, which promises high consistency
of the transcription across the entire database. The native English speaker was intro-
duced to the basic ATC phraseology [113] and given lists covering country-specific
toponyms and radio call signs (e.g., [129]). Clear transcription guidelines were estab-
lished and new cases that were not yet covered by the transcription format immediately
discussed.
Roughly three percent of all utterances were randomly selected across all speakers
and used for training of the transcriptionist. This training transcription was also used
to validate the applicability of the transcription format definition and minor changes
were made. The transcriptions collected during the training phase were discarded and
the material re-transcribed in the course of the final transcription.
After the transcription was finished, the transcriptionist once again reviewed all
utterances, verified the transcriptions and applied corrections where necessary. Re-
maining unclear cases were shown to an operational air traffic controller and most of
104
7.4. ATCOSIM Structure, Validation and Distribution
them resolved.
Due to the frequent occurrence of special location names and radio call signs an
automatic spell check was not performed. Instead of this, a lexicon of all occurring
words was created, which includes a count of occurrence and examples of the context
in which the word occurs. Due to the limited vocabulary used in ATC, this list consists
of less than one thousand entries including location names, call signs, truncated words,
and special mark-up codes. Every item of the list was manually checked and thus
typing errors eliminated.
7.4. ATCOSIM Structure, Validation and Distribution
The entire corpus including the recordings and all meta data has a size of approximately
2.5 gigabyte and is available in digital form on a single DVD or an electronic ISO disk
image. Some statistics of the corpus are given in Table 7.3.
7.4.1. Corpus Structure and Format
The ATCOSIM corpus data is composed of four directories.
The ‘WAVdata’ directory contains the recorded speech signal data as single-channel
Microsoft WAVE files with a sample rate of 32 kHz and a resolution of 16 bits per
sample. Each file corresponds to one controller utterance. The 10,078 files are located
in a sub-directory structure with a separate directory for each of the ten speakers and
sub-directories thereof for each simulation session of the speaker.
The ‘TXTdata’ directory contains single text files with the orthographic transcription
for each utterance. They are organized in the same way as the audio files. The directory
also contains an alphabetically sorted lexicon and a comma-separated-value file which
includes not only the transcription of all utterances but also all meta data such as
speaker, utterance and session IDs and transcriptionist comments.
The files in the ‘HTMLdata’ directory are HTML files which present the transcriptions
and the meta data in a table-like format. They enable immediate sorting, reviewing
and replaying of utterances and transcriptions from within a standard HTML web
browser.
Last, the ‘DOC’ directory contains all documentation related to the corpus.
7.4.2. Validation
The validation of the database was carried out by the Signal Processing and Speech
Communication Laboratory (SPSC) of Graz University of Technology, Austria. The
examiner and author of the validation report has not been involved in the production
of the ATCOSIM corpus, but only carried out an informal pre-validation and the formal
final validation of the corpus.
The validation procedure followed the guidelines of the Bavarian Archive for Speech
Signals (BAS) [130]. It included a number of automatic tests concerning complete-
ness, readability, and parsability of data, which were successfully performed without
105
Chapter 7. ATC Simulation Speech Corpus
Table 7.3.: Key Figures of the ATCOSIM Corpus
Duration total (thereof speech) 51.4 h (10.7 h)
Data size 2.4 GB
Speakers (thereof female/male) 10 (4/6)
Sessions total 50
Sessions per speaker 7, 9, 5, 6, 1, 2, 2, 8, 7, 3
Utterances total 10078
Utterances per speaker
1167, 1848, 808, 1162, 238,
384, 378, 1716, 1739, 638
Utterance duration
(mean, std. deviation, min, max)
3.8 s, 1.6 s, 0.04 s, 38.9 s
Utterances, containing
- <FL> </FL> 182
- <OT> </OT> 84
- [EMPTY] 319
- [FRAGMENT] 35
- [HNOISE] 62
- [NONSENSE] 11
- [UNKNOWN] 11
Words 108883
Characters (thereof without space) 626425 (517542)
Lexicon entries
- Total 858
- Meta tags 9
- Truncations 106
- Compounds 13
- Unique words 730
106
7.5. Conclusions
revealing errors. Furthermore, manual inspections of documentation, meta data, tran-
scriptions, and the lexicon were done, which showed minor shortcomings that were
fixed before the public release of the corpus. Finally, a re-transcription of 1% of the
corpus data was made by the examiner, showing a transcription accuracy on word
level of 99.4%, proving the transcriptions to be accurate. The ATCOSIM corpus was
therefore declared to be in a usable state for speech technology applications.
7.4.3. License and Distribution
The ATCOSIM corpus is available online at http://www.spsc.tugraz.at/ATCOSIM and
provided free of charge. It can be freely used for research and development, also in
a commercial environment. The corpus is also foreseen to be distributed on DVD
through the European Language Resources Association (ELRA).
7.5. Conclusions
The ATCOSIM corpus is a valuable contribution to application-specific language re-
sources. To our best knowledge currently no other speech corpus is publicly available
that contains non-prompted air traffic control speech with direct microphone record-
ings, as it is difficult to produce such recordings for public distribution. The large-scale
real-time ATC simulations exploited for this corpus production provide an opportunity
to record ATC speech which is very similar to operational speech, while avoiding the
legal hassle of recording operational ATC speech.
Applications
The application possibilities for spoken language technologies in air traffic control
are manifold, and the corpus can be utilized within different fields of speech-related
research and development.
ATC Language Study Analysis can be undertaken on the language used by controllers
and on the instructions given.
Listening Tests The speech recordings can be used as a basis for ATC-related listening
tests and provide real-world ATC language samples.
Speaker Identification The corpus can be used as testing material for speaker identi-
fication and speaker segmentation applications in the context of ATC.
ATC Speech Transmission System Design The corpus can be used for the design,
development and testing of ATC speech transmission and enhancement systems.
Examples where the corpus was applied include the work presented in this thesis
and the pre-processing of ATC radio speech to improve intelligibility [131].
Speech Recognition The corpus can also be used for the development and training
of speech recognition systems. Such systems might become useful in future
107
Chapter 7. ATC Simulation Speech Corpus
ATC environments and for example be used to automatically input into the ATC
system the controller instructions given to pilots .
We expand on the speech recognition example:
The controller sees among other information the current flight level of the aircraft on
the radar screen in the text label corresponding to the aircraft. For example, a controller
issues an instruction to an aircraft to climb from its current flight level 300 to flight level
340 (e.g. “sabena nine seven zero climb to flight level three four zero”). In
certain ATC display systems, the controller now enters this information (‘climb to 340’)
into the system and it shows up in the aircraft label, as this information is relevant
later on when routing other adjacent aircraft. However, the voice radio message sent to
the pilot already contains all information required by the system, namely the aircraft
call-sign and the instruction.
Depending on the achievable robustness, an automatic speech recognition (ASR)
system that recognizes the controller’s voice radio messages could perform various
tasks: In case of extremely high accuracy the system could gather the information
directly from the voice message without any user interaction. The ASR system could
otherwise provide a small list of suggestions, to ease the process of entering the
instructions into the system. Alternatively, the system could compare in the background
the voice messages sent to the pilot and the instructions entered into the system, and
give a warning in case of apparent discrepancies.
Compared to other ASR applications, the conditions would be comparably favorable
in this scenario: The signal quality is high due to the use of a close-talk microphone and
the absence of a transmission channel. The vocabulary is limited, and additional side
information, such as the aircraft present in the sector and context-related constraints,
can be exploited. The ASR system can be speaker-dependent and pre-trained, and
continue training due do the constant feedback given by the controller during operation.
Possible Extensions
Depending on the application, further annotation layers might be useful and could be
added to the corpus.
Phonetic Transcription A phonetic transcription is beneficial for speech recognition
purposes. For accurate results it requires transcriptionists with a certain amount of
experience in phonetic transcription or at least a solid background in phonetics and
appropriate training.
Word and Phoneme Segmentation A more fine-grained segmentation is also bene-
ficial for speech recognition purposes. It can often be performed semi-automatically,
but still requires manual corrections and substantial effort.
Semantic Transcription A semantic transcription would describe in a formal way the
actual meaning of each utterance in terms of its functionality in air traffic control, such
108
7.5. Conclusions
as clearance and type of clearance, read-back, request, . . . This would support speech
recognition and language interface design tasks, as well as ATC language studies. The
production of such a transcription layer requires good background knowledge in air
traffic control. Due to lack of contextual information, such as the pilots’ utterances,
certain utterances might appear ambiguous.
Call Sign Segmentation and Transcription This transcription layer marks the signal
segment which includes the call sign and extracts the corresponding part of the
(already existing) orthographic transcription. This can be considered as a sub-part
of the semantic transcription which can be achieved with significantly less effort and
requires little ATC related expertise. Nevertheless this might be beneficial for certain
applications such as language interface development.
Extension of Corpus Size and Coverage With an effective size of ten hours of
controller speech the corpus might be too small for certain applications. There are
two reasons that would support the collection and transcription of more recordings.
The first reason is the pure amount of data that is required by the training algorithms
in modern speech recognition systems. The second reason is the need to extend the
coverage of the corpus in terms of speakers, phonological distribution and speaking
style, as well as control task and controlled area.
This chapter presented a database of ATC controller speech, which represents the
input signal of the watermarking-based enhancement application considered in this
thesis. Likewise, the following chapter presents a database of measurements of the
ATC voice radio channel, which constitutes the transmission channel of the application
at hand.
109
Chapter 7. ATC Simulation Speech Corpus
110
Chapter 8
Aeronautical Voice Radio Channel
Measurement System and Database
This chapter presents a system for measuring time-variant channel impulse
responses and a database of such measurements for the aeronautical voice
radio channel. Maximum length sequences (MLS) are transmitted over the
voice channel with a standard aeronautical radio and the received signals
are recorded. For the purpose of synchronization, the transmitted and
received signals are recorded in parallel to GPS-based timing signals. The
flight path of the aircraft is accurately tracked. A collection of recordings
of MLS transmissions is generated during flights with a general aviation
aircraft. The measurements cover a wide range of typical flight situations as
well as static back-to-back calibrations. The resulting database is available
under a public license free of charge.
Parts of this chapter have been published in K. Hofbauer, H. Hering, and G. Kubin, “A measurement
system and the TUG-EEC-Channels database for the aeronautical voice radio,” in Proceedings of the
IEEE Vehicular Technology Conference (VTC), Montreal, Canada, Sep. 2006, pp. 1–5.
111
Chapter 8. Voice Radio Channel Measurements
Audio
Player
TX
Radio
Audio
Rec.
RX
Radio
Channel
Audio Channel
Sync
Figure 8.1.: A basic audio channel measurement system.
8.1. Introduction
A large number of sophisticated radio channel models exist, based on which the
behavior of a channel can be derived and simulated [132, 133, 6]. In Appendix B,
the basic concepts in the modeling and simulation of the mobile radio channel are
reviewed. The propagation channel is time-variant and dominated by multipath
propagation, Doppler effect, path loss and additive noise. Stochastic reference models
in the equivalent complex baseband facilitate a compact mathematical description of
the channel’s input-output relationship. Three different small-scale area simulations of
the aeronautical voice radio channel are presented in Appendix B, and we demonstrate
the practical implementation based on a scenario in air/ground communication.
Unfortunately, very few measurements of the aeronautical voice radio channel that
could support and quantify the theoretical models are available to the public [134].
The aim of the experiments presented in this chapter is to obtain knowledge about
the characteristics of the channel through measurements in realistic conditions. These
measurements shall support the insight obtained through the existing theoretical
channel models discussed in Appendix B. Therefore, we measure the time-variant
impulse response of the aeronautical VHF voice radio channel between the aircraft and
the ground-based transceiver station under various conditions.
The outline of this chapter is as follows: Section 8.2 shows the design and implemen-
tation of a measurement system for the aeronautical voice channel. Section 8.3 gives
an overview of the conducted measurement flights. The collected data form the freely
available TUG-EEC-Channels database, which is presented in Section 8.4.
8.2. Measurement System for the Aeronautical Voice Radio
Channel
Conventional wideband channel sounding is virtually unfeasible in the aeronautical
VHF band. It requires a large number of frequency channels, which are reserved
for operational use and are therefore not available. Thus a narrowband method is
presented, which moreover allows a simpler and less expensive setup.
Figure 8.1 shows the basic concept of the proposed system: A known audio signal
is transmitted over the voice radio channel and the received signal is recorded. The
112
8.2. Measurement System for the Aeronautical Voice Radio Channel
time-variant impulse response of the channel can then be estimated from the known
input/output pairs.
Measuring the audio channel from baseband to baseband has certain advantages and
disadvantages. On the one hand, besides its simplicity, the measurement already takes
place in the same domain as where our target application, a speech watermarking
system, would be implemented; on the other hand, a baseband measurement does not
reveal a number of parameters that could be obtained with wideband channel sounding,
such as distinct power delay and Doppler spectra. Although these parameters cannot
be directly measured with the proposed system, their effect on the audio channel can
nevertheless be captured.
8.2.1. Synchronization between Aircraft and Ground
As already indicated in Figure 8.1, it is necessary to synchronize the measurement
equipment on the transmitter and the receiver side. This linking is necessary as the
internal clock frequency of two devices is never exactly the same and the devices
consequently drift apart. Moreover, the clock-frequency is time-variant, due in part
to its temperature-dependency. But, in a setup where one unit is on the ground
and the other one is in an aircraft, the different clocks cannot be linked and a direct
synchronization is not possible. This problem is common to all channel sounding
systems.
In high-grade wideband channel sounders a lot of effort is put into achieving
synchronization. In former days, accurately matched and temperature-compensated
oscillators were aligned before measurement. Once they were separated, the systems
still stayed in sync for a certain amount of time before they started drifting apart again.
In modern channel sounders atomic clocks provide so stable a frequency reference
that the systems stay in sync for a very long time once aligned. Such systems are
readily available on the market, but come with a considerable financial expense, and
are relatively large and complex in setup.
8.2.2. Synchronization using GPS Signals
The global positioning system (GPS) provides a time radio signal which is available
worldwide and accurate to several nanoseconds. It can therefore also serve as an
accurate frequency and clock reference [135]. Certain off-the-shelf GPS receivers output
a so-called 1PPS (one pulse per second) signal. The signal is a 20 ms pulse train with
a rate of 1 Hz. The rising edge of each pulse is synchronized with the start of each
GPS and UTC (coordinated universal time) second with an error of less than 1 µs. This
accuracy is in general more than sufficient for serving as clock reference for a baseband
measurement system.
To the authors’ knowledge no battery-powered portable audio player/recorder with
an external synchronization input was available on the market at the time of measure-
ments in 2006. As a consequence, direct synchronization between the transmitter and
the receiver-side is in our application again not possible, even though a suitable clock
reference signal would be available.
113
Chapter 8. Voice Radio Channel Measurements
We propose in the remainder of this section a measurement method and setup
with which synchronization can nevertheless be achieved by recording the clock
reference signal in parallel to the received and transmitted signals and appropriate
post-processing.
8.2.3. Hardware Setup
The basic structure of the measurement setup is shown in Figure 8.2. On the aircraft,
the signals are transmitted and received via the aircraft-integrated VHF transceiver.
The measurement audio signal is replayed by a portable CompactFlash-based audio
player/recorder [136]. The transmitted and received signals are recorded with a second
unit, in parallel with the 1 PPS synchronization signal which is provided by the GPS
module [137]. The aircraft position, altitude, speed, heading, etc., are accurately
recorded by a hand-held GPS receiver once every two seconds [138]. The setup is
portable, battery-powered, rigid, and can be easily integrated into any aircraft. The
only requirement is a connection to the aircraft radio headset connectors. Figure 8.3
shows the final hardware setup on-board the aircraft. An identical setup is used on the
ground.
All devices are interconnected in a star-shaped manner through a custom-built
interface box, which is fitted into a metal housing unit containing some passive circuitry
and standard XLR and TRS connectors (Figure 8.4). The interface box provides a push-
to-talk (PTT) switch remote control together with status indication and recording, and
power supply, status indication and configuration interface for the GPS receiver. The
box serves as the central connection point and provides potentiometers for signal level
adjustments, test points for calibration and a switch to select transmit or receive signal
routing. The design considers the necessary blocking of the microphone power supply
voltage, the appropriate shielding and grounding for electromagnetic compatibility
(EMC) as well as noise and cross-talk suppression.
8.2.4. Measurement Signal
The measurement signal consists of a continuously repeated binary maximum length
sequence (MLS) of length L = 63 samples or T = 7.875 ms at a chip rate of 8 kHz
[139]. This length promises a good trade-off between signal-to-noise ratio, frequency
resolution and time resolution. It results in an MLS frame repetition rate of 127 Hz
and an excitation of the channel with a white pseudo-random noise signal with energy
up to 4 kHz. The anticipated rate of channel variation at the given maximum aircraft
speed and the bandwidth of the channel are below these values [6].
The transmitted MLS sequence is interrupted once per minute by precomputed
dual-tone multi-frequency (DTMF) tone pairs, in order to ease rough synchronization
between transmitted and received signals. All audio files are played and recorded
at a sample rate of 48 kHz and with a resolution of 16 bit. The measurement signal
is therefore upsampled to 48 kHz by insertion of zeros after each sample in the time
domain and low-pass interpolation with a symmetric FIR filter (using Matlab’s interp-
function). The signal is continuously transmitted during the measurements. However,
114
8.2. Measurement System for the Aeronautical Voice Radio Channel
Ground system
Aircraft system
Audio Player/Recorder
Aircraft integrated
VHF radio transceiver
Aircraft integrated
antenna system
Auxiliary VHF radio
for pilot
GPS receiver
Ground-based
receiver antenna
Ground-based
VHF radio transceiver
2ch Audio Recorder
GPS data logger
GPS receiver with
time output (1PPS)
Interface
Circuit
2ch Audio Recorder
GPS receiver with
time output (1PPS)
Interface
Circuit
Audio Player/Recorder
VHF radio
channel
8.33kHz
Figure 8.2.: Overview of the proposed baseband channel measurement system, using
GPS timing signals for synchronization.
115
Chapter 8. Voice Radio Channel Measurements
Figure 8.3.: Complete voice channel measurement system on-board an aircraft, with
audio player and recorder, interface box, synchronization GPS (top left),
and tracking GPS (bottom right).
small interruptions are made every 30 seconds and every couple of minutes due to
technical and operational constraints.
8.2.5. Data Processing
The post-processing of the recorded material consists of data merging, alignment,
synchronization, processing, and annotation, and is mostly automated in Matlab. The
aim is to build up a database of annotated channel measurements, which is described
in Section 8.4. It follows an outline of the different processing steps.
8.2.5.1. Incorporation of Side Information
The original files are manually sorted into a directory structure and labeled with an
approximate time-stamp. The handwritten notes about scenarios, the transmission
directions, the corresponding files, the time-stamps and durations are transferred into
a data structure to facilitate automatic processing. Based on the data structure, the
start time of each file is estimated.
The GPS data is provided in the widely used GPX-format and is imported into Matlab
by parsing the corresponding XML structure, from which longitude, latitude, altitude
and time-stamp can be extracted [140]. The remaining parameters are computed
out of these values, also considering the GPS coordinates of the airport tower. The
speed values are smoothed over time using robust local regression smoothing with the
RLOESS method with a span of ten values or 20 s [141].
116
8.2. Measurement System for the Aeronautical Voice Radio Channel
Signal Wiring
Chassis and Signal GND
Battery V+
Battery GND
Test Wiring
GPS Config
VHF Voice Radio
GPS
Receiver
Audio Player Audio Recorder
Audio IN
Audio OUT
Audio OUT
(selected)
Audio IN
SyncOUT
Audio OUT
PTT OUT
Mic IN
+ PTT
Audio
IN/OUT
Mic IN Phones OUT
Audio OUT Audio IN 2 Audio IN 1
GPS I/O
S
e
r
i
a
l

I
/
O

G
P
S

a
n
d

T
e
s
t
Battery
1
2
3
4
5
6
7
G
8
9
T R S T R S T R S T R S
T R S T R S T S
+ -
1 2 3 4 5 6 7 G
100k
100k
100k
22µ
22n
6k8 Tx Rx
PTT
GPS
330R
PTT
Figure 8.4.: Overview and circuit diagram of measurement system and interface box.
117
Chapter 8. Voice Radio Channel Measurements
8.2.5.2. Analysis of 1PPS Signal and Sample Rate
The recorded files are converted into Matlab-readable one-channel Wave files using a
PRAAT script [77]. From the 1PPS recording, the positions of the pulses are identified
by picking the local maximum values of a correlation with a prototype shape. As
it is known that the pulses are roughly one second apart, false hits can be removed,
and missing pulses can be restored by interpolation. The time difference in samples
between two adjacent pulses defines the effective sample rate at which the file was
recorded. It was found that the sample rate is fairly constant but differs among the
different devices by several Hertz. It is therefore necessary to compensate for the
clock rate differences. For the aircraft and ground recordings, the sample rate can be
recovered by means of the 1PPS track. Based on those the sample rate of the audio
player can also be estimated.
8.2.5.3. Detection of DTMF Tones and Offset
For detecting the DTMF tones, the non-uniform discrete Fourier transform (NDFT) is
computed at the specific frequencies of the DTMF tones [142]. The tones are detected
in the power spectrum of the NDFT by correlation with a prototype shape, as the
duration and relative position of the tones is known from the input signal. Again after
filtering for false hits, the actual DTMF symbol is determined by the maxima of the
NDFT output values at the corresponding position. The detected DTMF sequences of
corresponding air/ground recording pairs are then aligned with approximate string
matching using the Edit distance measure [108]. Their average offset in samples is
computed and verified with the offset that is obtained from the data structure with
the manual annotations. These rough offsets are refined using the more accurate 1PPS
positions of both recordings.
8.2.5.4. Frame Segmentation and Channel Response
The recording of the received signal with a sample rate of 48 kHz is split up into
frames of 378 samples, which is the length of one MLS. The corresponding frame in the
recording of the transmitted signal is found by the previously computed time-variant
offsets between the 1PPS locations in the two recordings, and then with the local
position in-between two 1PPS pulses. Through this alignment the synchronization
between transmitted and recorded signal is reestablished. The alignment is accurate
to one 48 kHz sample, which is one sixth of a sample in terms of the original MLS
sequence.
Based on the realigned transmission input-output pairs, the channel frequency
response for every frame is estimated based on FIR system identification using the
cross power spectral density (CPSD) method with
H(ω
k
) =
X(ω
k
) Y(ω
k
)
[X(ω
k
)[
2
, (8.1)
where X, Y, and H are the discrete Fourier transforms of the channel’s input signal,
output signal and impulse response, respectively [143]. For numerical stability the
118
8.3. Conducted Measurements
(a) Aircraft during measurements. (b) Aircraft transceiver.
Figure 8.5.: General aviation aircraft and onboard transceiver used for measurements.
signals are downsampled to a sample rate of 8 kHz before system identification, as the
channel was excited only up to a frequency of 4 kHz.
All frames are annotated with the flight parameters and the contextual information
of the meta data structure using the previously computed time-stamps and offsets.
A more accurate channel model and estimation scheme based on the measurement
results is presented in Chapter 9.
8.3. Conducted Measurements
A number of ground-based and in-flight measurements were undertaken with a general
aviation aircraft at and around the Punitz airfield in Austria (airport code LOGG).
After initial trials at the end of March 2006, measurements were undertaken for two
days in May 2006.
The aircraft used was a 4-seat Cessna F172K with a maximum speed of approximately
70 m/s and a Bendix/King KX 125 TSO communications transceiver (Figure 8.5). The
initial trials were undertaken with a different aircraft.
On ground, the communications transceiver Becker AR 2808/25 was used (Figure 8.6).
It is permanently installed in the airport tower in a fixed frequency configuration.
For reference and comparison, a Rohde&Schwarz EM-550 monitoring receiver which
provides I/Q demodulation of the received AM signal was applied. The I/Q data with a
bandwidth of 12 kHz was digitally recorded onto a laptop computer, however without
a 1PPS synchronization signal. Synchronization is to a certain extent nevertheless
possible using the DTMF markers in the transmitted signal. For the monitoring
receiver a
λ
4
-ground plane antenna with a magnetic base was mounted onto the roof of
a van.
The measurements occurred on the AM voice radio channel of Punitz airport at
a carrier frequency of 123.20 MHz. Measurements were taken both in uplink and
downlink direction, with the tower or the aircraft radio transmitting, respectively. The
I/Q monitoring receiver continuously recorded every transmission.
119
Chapter 8. Voice Radio Channel Measurements
(a) I/Q measurement receiver.
(b) Airport tower transceiver.
Figure 8.6.: Ground-based transceivers used for measurements.
120
8.4. The TUG-EEC-Channels Database
A series of measurement scenarios was set up which covered numerous situations.
The ground measurements consisted of static back-to-back measurements with the
aircraft engine on and off. Static measurements were undertaken with the aircraft
at several positions along a straight line towards the tower, spaced out by 30 cm.
Additional measurements were taken with a vehicle parked next to the aircraft. The
ground based measurements were concluded with the rolling of the aircraft on the
paved runway at different speeds. A variety of flight maneuvers was undertaken.
These covered the following situations and parameter ranges:
• Low and high altitudes (0m to 1200m above ground)
• Low and high speeds (up to 250 km/h)
• Ascends and descends
• Headings perpendicular and parallel to line-of-sight
• Overflights and circular flights around airport
• Take-offs, landings, approaches and overshoots
• Flights across, along and behind a mountain
• Flights in areas with poor radio reception (no line-of-sight)
Thorough organizational planning, coordination and cooperation between all parties
involved proved to be crucial for a successful accomplishment of measurements in the
aeronautical environment.
8.4. The TUG-EEC-Channels Database
Based on the above measurements we present an aeronautical voice radio channel
database, the TUG-EEC-Channels database. It is available to the research community
under a public license free of charge online at the following address:
http://www.spsc.TUGraz.at/TUG-EEC-Channels/
The TUG-EEC-Channels database covers the channel measurements described in Sec-
tion 8.3 with three extensive flights and several ground measurements with a total
duration of more than five hours or two million MLS frames. For each frame of
8 ms (corresponding to 378 samples at 48 kHz) the following data and annotation is
provided:
• Time-stamp
• Transmission recording segment
• Reception recording segment
• Type of transmitted signal
121
Chapter 8. Voice Radio Channel Measurements
0 1000 2000 3000 4000
−30
−25
−20
−15
−10
−5
0
5
Frequency in Hz
M
a
g
n
i
t
u
d
e

i
n

d
B
Downlink Channel
0 1000 2000 3000 4000
−30
−25
−20
−15
−10
−5
0
5
Frequency in Hz
M
a
g
n
i
t
u
d
e

i
n

d
B
Uplink Channel
Figure 8.7.: Normalized frequency response of the static back-to-back voice radio
channel in up- and downlink direction.
• I/Q recording segment
• Link to original sequence
• Uplink/downlink direction
• Estimated channel impulse response
• Flight parameters elevation, latitude, longitude, distance to tower, speed, and
azimuth relative to line-of-sight
• Experimental situation (engine on/off, aircraft on taxiway or flying, . . . )
• Plain-text comments
An exact description of the data format of the TUG-EEC-Channels and additional
information can be found on the website. Upon request, the raw MLS recordings as well
as a number of voice signal transmission recordings and voice and background noise
recordings made in the cockpit using conventional and avionic headset microphones
can also be provided.
The following figures present several excerpts of the database. Figure 8.7 shows
the magnitude response of the transmission chain in both directions, which is mostly
determined by the radio transceivers. In Figure 8.8, the magnitudes of selected
frequency bins of the downlink channel are plotted over time. The basic shape of
the frequency response is maintained, but the magnitude varies over time. This time
variation results from a synchronization error which is compensated in the more
accurate model of Chapter 9. Figure 8.9 illustrates a GPS track.
8.5. Conclusions
We presented a system for channel measurements and the TUG-EEC-Channels database
of annotated channel impulse and frequency responses for the aeronautical VHF voice
122
8.5. Conclusions
500
1000
1500
2000
2500
3000
0
0.5
1
−15
−10
−5
0
Time in s
Frequency in Hz
Downlink Channel
M
a
g
n
i
t
u
d
e

i
n

d
B
Figure 8.8.: Evolution of the frequency bins at 635, 1016, 1397, 1778, 2159, 2540 and
2921 Hz over time. Aircraft speed is 44.7
m
s
at a position far behind the
mountain and an altitude of 1156 m.
Figure 8.9.: Visualization of the aircraft track based on the recorded GPS data.
123
Chapter 8. Voice Radio Channel Measurements
channel. The current data can be directly applied to simulate a transmission channel
for system test and evaluation. We hope that the aeronautics community will make
extensive use of our data for future developments, which would also help to verify the
validity of our measurement scenarios.
Beyond that, it would be interesting to see what conclusions with respect to parame-
ters of the aeronautical radio channel models can be drawn from the data. In Chapter 9,
we present a further analysis and abstraction of the data and propose a model of the
aeronautical VHF voice channel based on the collected data.
Acknowledgments
We are grateful to FH JOANNEUM (Graz, Austria), and Dr. Holger Flühr and Gernot
Knoll in particular, for taking charge of the local organization of the measurements.
We also want to thank the Union Sportfliegerclub Punitz for the kind permission to use
their facilities, Flyers GmbH and Steirische Motorflugunion for providing the aircraft,
and Rohde & Schwarz for contributing the monitoring receiver. Finally, we want to
recognize all the other parties that helped to make the measurements possible. The
campaign was funded by EUROCONTROL Experimental Centre (France).
124
Chapter 9
Data Model and Parameter Estimation
In this chapter, we propose a data model to describe the data in the TUG-
EEC-Channels database, and a corresponding estimation method. The model
is derived from various effects that can be observed in the database, such
as different filter responses, a time-variant gain, a sampling frequency
offset, a DC offset and additive noise. To estimate the model parameters,
we compare six well-established FIR filter identification techniques and
conclude that best results are obtained using the method of Least Squares.
We also provide simple methods to estimate and compensate the sampling
frequency offset and the time-variant gain.
The data model achieves a fit with the measured data down to an error
of -40 dB, with the modeling error being smaller than the channel’s noise.
Applying the model to select parts of the database, we conclude that the
measured channel is frequency-nonselective. The data contains a small
amount of gain modulation (flat fading). Its source could not be conclusively
established, but several factors indicate that it is not a result of radio channel
fading. The observed noise levels are in a range from 40 dB to 23 dB in
terms of SNR.
This chapter presents recent results that have not yet been published.
125
Chapter 9. Data Model and Parameter Estimation
9.1. Introduction
The TUG-EEC-Channels database contains an estimated channel impulse response for
every measurement frame based on the cross power spectral density of the realigned
channel input and output signal of the corresponding frame. This set of impulse
responses can itself form a channel description based on the assumption of a time-
variant filter channel model. We will apply this channel description, among others, in
the evaluation of the watermarking system in Chapter 10.
A manual inspection of the recorded signals and the estimated channel responses re-
vealed that the simple time-variant filter channel model underlying the CPSD estimates
of (8.1) may not be an adequate representation for the input/output relationship of the
data and may lead to wrong conclusions. We found the following signal properties
that are not represented in the basic model.
Additive Noise The recorded signals are corrupted by additive noise, which fully
affects the CPSD channel estimates.
DC Offset There are significant time-variant DC offsets in the recorded signals and
the estimated impulse responses.
Time Variation There is a systematic time-variation of the impulse response estimates
of (8.1), even in the case of the aircraft not moving. Figure 9.1 shows this for a
short segment of a static back-to-back measurement with the aircraft on ground
(frame number 2012056 to 2013055). We denote the n’th sample of the estimated
impulse response of the k’th frame with h(n, k). The DC offset of each impulse
response has been removed a-priori by subtracting the corresponding mean value.
Sampling Frequency Offset Albeit small, there is a sampling frequency offset be-
tween the recordings of the transmitted and received signals.
Gain Modulation Measured in a sliding rectangular window with a length of one
frame (i.e., 63 samples or 378 samples at a sample rate of 8 kHz or 48 kHz, respec-
tively), the transmitted signal is due to its periodicity of exactly constant power.
Measuring the recordings of the received signal with the same sliding window
reveals an amplitude (gain) modulation, which appears to be approximately
sinusoidal with a frequency of 40–90 Hz.
In the remainder of this chapter, we propose an improved channel model and a corre-
sponding estimation method that provides stable and accurate estimations with low
errors, addresses the irregularities in the channel estimates contained in the database,
and provides a mean to estimate the noise in the measured channel by incorporating
neighboring frames. We apply the method to select parts of the measurement database
to demonstrate the validity of the model and to estimate its parameters.
126
9.1. Introduction
(a) Mean and standard deviation of the coefficients over all frames.
0 10 20 30 40 50 60
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
Sample index n
I
m
p
l
u
s
e

r
e
s
p
o
n
s
e


h
(
n
,
k
)


h(n,k) (µ±σ over all frames k)
(b) Time variation of selected coefficients over all frames.
0 200 400 600 800 1000
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
Frame index k
I
m
p
l
u
s
e

r
e
s
p
o
n
s
e

s
a
m
p
l
e


h
(
n
,
k
)


h(3,k)
h(7,k)
h(5,k)
Figure 9.1.: Time-variation of the estimated impulse responses h(n, k) in the TUG-EEC-
Channels database for a static back-to-back measurement over 1000 frames
(8 s).
127
Chapter 9. Data Model and Parameter Estimation
Measurement Errors
(Transmitter-Side)
Voice Radio
Channel
Measurement Errors
(Receiver-Side)
TX RX
Figure 9.2.: Separation between voice radio channel model and measurement errors.
9.2. Proposed Data and Channel Model
We propose an improved channel model that better represents the measured data by
incorporating the aforementioned effects.
9.2.1. Assumptions
It is important to take into consideration that part of the observed effects might be
measurement errors. This is represented in Figure 9.2 by separating the measurement
error and voice radio channel models. It is often difficult to conclusively establish if
an observed effect is caused by the measurement system or the radio channel, and we
make the following assumptions.
By specification, the voice radio channel includes the radio transmitter, the radio
receiver, and the VHF wave propagation channel. As such, it is a bandpass channel and
does not transmit DC signal components. We attribute the DC offset to deficiencies in
the measurement hardware, and to the interface box circuits in particular, and consider
the time-variant DC offset (measured within a window of one measurement frame) as
a measurement error.
We assume that the filtering characteristics of the system can be mostly attributed to
the voice radio channel, since in laboratory measurements the frequency response of
the measurement system was verified to be flat compared to the observed filtering. It
appears that the filtering is dominated by the radio transmitter and receiver filters. One
can observe two particular filter characteristics in the database that solely depend on
the transmission direction (air/ground or ground/air) and, as such, on the transmitter
and receiver used.
The sampling frequency offset can be clearly attributed to the measurement system,
since the sampling clocks of the audio devices were not synchronized and the signals
replayed and recorded with slightly offset sampling frequencies.
We conjecture—and later results will confirm—that the systematic time-variation
of the TUG-EEC-Channels estimates shown in Figure 9.1 results from an insufficient
synchronization between the recordings of the transmitted and the received signals.
The database aims to synchronize all signal segments, no matter their content, using
the GPS timing signals. It successfully does so with an accuracy of one sixth of a
sample in terms of the original MLS chip rate. However, the results will show that
this accuracy is not sufficient for obtaining low estimation errors. In fact, the constant
slope in the lower plot of Figure 9.1 results from a slight drift between the channel
input/output frame pairs. When the offset between a pair is larger than half a sample
128
9.3. Parameter Estimation Implementation
H Resampling TX RX
x y
H
y
Hw
y
Hwg
y
Hwgc
y
HwgcR
w(n)
g(n) c(n)
Voice Radio Channel
Measurement Errors
Figure 9.3.: Proposed data and channel model with a linear and block-wise time-
invariant filter H, additive noise w, time-variant gain g and time-variant
DC offset c. The subscripts of the signal y denote the model components
that the signal has already passed. For example, y
Hw
denotes the input
signal after having passed the filter H and the addition of the noise w.
(at a sample rate of 48 kHz), the offset is automatically compensated by shifting the
input by one sample, which leads to the discontinuities in h(n, k) visible in Figure 9.1.
The observed gain modulation could be a measurement effect or a channel property,
and we will discuss this issue in more detail at the end of this chapter.
We expect that most of the additive noise results from the voice radio channel.
Nevertheless, the measurement system also contributes to the overall noise level. For
example, there is small cross-talk between the recorded MLS and GPS signals due to
deficiencies in the measurement hardware, and there is electromagnetic interference
from the aircraft avionics.
9.2.2. Proposed Model
Based on the above assumptions, we propose a data model as shown in Figure 9.3.
We merge the transmitter- and receiver-side measurement errors into one block. The
overlap between the voice radio channel model and the measurement error model
represents the uncertainty in the attribution of the components in the overlapping
segment to one or the other category. Also, there is an uncertainty in the order of the
model elements, and the depicted arrangement represents one possible choice.
9.3. Parameter Estimation Implementation
Before Section 9.3.3 shows the implementation of an estimation method for the channel
model shown in Figure 9.3, we first address two sub-problems. We present a simple
sampling clock synchronization method using resampling (Section 9.3.1), and compare
different methods to estimate a linear and time-invariant channel filter given a known
channel input and a noise-corrupted channel output (Section 9.3.2).
129
Chapter 9. Data Model and Parameter Estimation
9.3.1. Self-Synchronization of the Received Signal by Resampling
In the following, we focus on signal segments in the database containing MLS signals,
and present a simple resynchronization method in which we leave aside the transmitted
signal recordings, and accurately resample the received signal recordings to the chip
rate of the original MLS. While we do not claim any optimality herein, the presented
method is simple and provides sufficient performance given the task at hand.
A manual inspection of the recordings showed that the sampling frequencies of the
different devices are different among each other, but stable within segments of several
minutes. In the following we assume that the sampling frequencies are constant within
one block. A block denotes a recording fragment that contains a continuous MLS
signal, and typically has a length of 20 s to 30 s.
The original MLS signal used in the measurements has a length of 63 chips at a
chip-rate of 8000 chips/s. Since the signal was upsampled to a sampling frequency
f
0
= 48 kHz, the MLS signal is periodic with a signal period N
0
= 6 63 = 378 samples
or T
0
=
378
48000
s ≈ 7.875 ms. The signal was replayed at an unknown sampling frequency
close to 48 kHz, transmitted over an analog channel, and recorded at a sampling
frequency that is again close to 48 kHz but unknown.
In order to compensate for the two unknown sampling frequencies, we transform
the recorded received signal y
1
(n), denoted RX in the database and corresponding to
y
HwgcR
in Figure 9.3, into a cubic spline representation and resample the spline with a
sampling frequency f
2
such that the signal period of the resulting y
2
(n) is again exactly
N
0
(or T
0
, respectively). We determine the optimal f
2
or, equivalently, the resampling
factor γ =
f
2
f
0
using two autocorrelation-based periodicity measures and a three-step
local maximization procedure.
Let the vector
x
k
= [y
2
(kN
0
), y
2
(kN
0
+1), y
2
(kN
0
+2), . . . , y
2
(kN
0
+ N
0
−2), y
2
(kN
0
+ N
0
−1)]
denote the k’th frame of the resampled signal (with k in the range from 0 . . . M-1,
assuming M frames within one block of y
1
(n), and M even). We define two scalar
error functions
g
1
(γ) =
_
x
0
x
2
. . . x
M−4
x
M−2
¸ _
x
1
x
3
. . . x
M−3
x
M−1
¸
T
g
2
(γ) =
_
x
0
x
1
. . . xM
/2−2
xM
/2−1
¸ _
xM
/2
xM
/2+1
. . . x
M−2
x
M−1
¸
T
.
The functions g
1
and g
2
are similar to an autocorrelation of y
2
at lags N
0
and
N
0
M
2
, and
are plotted in Figure 9.4 for an exemplary block of the database (frame-ID 2011556
to 2013057). While g
1
has a smooth and parabolic shape, which is advantageous for
numeric maximization, its maximum peak is very wide and thus prone to the influence
of noise. In contrast, g
2
has a very narrow and distinct peak, but exhibits many local
maxima.
We maximize g
1
and g
2
as a function of the resampling factor γ =
f
2
f
0
in the following
way:
130
9.3. Parameter Estimation Implementation
(a) Periodicity Measure g
1
0.98 0.985 0.99 0.995 1 1.005 1.01 1.015 1.02
1
2
3
4
5
x 10
12
Frequency scaling f
2
: f
0
P
e
r
i
o
d
i
c
i
t
y

m
e
a
s
u
r
e


g
1


g
1
f
2, opt
(b) Periodicity Measure g
2
0.998 0.9985 0.999 0.9995 1 1.0005 1.001 1.0015 1.002
−2
−1
0
1
2
3
4
x 10
12
Frequency scaling f
2
: f
0
P
e
r
i
o
d
i
c
i
t
y

m
e
a
s
u
r
e


g
2


g
2
f
2, opt
Figure 9.4.: Two periodicity measures as a function of the resampling factor
f
2
f
0
.
131
Chapter 9. Data Model and Parameter Estimation
H
x y
H
y
Hw
w
Figure 9.5.: LTI filter plus noise channel model.
1. Maximization of g
1
over γ in the interval γ = [0.98; 1.02] using golden section
search and parabolic interpolation (Matlab function ‘fminbnd’, [144]), resulting
in the estimate γ
1
.
2. Maximization of g
2
over γ in the interval γ = [γ
1
−2δ; γ
2
+2δ] using a manually
tuned δ = 5 10
−5
+
0.06
M
and a grid search with a search interval of 0.1δ, resulting
in the estimate γ
2
.
3. Maximization of g
2
over γ in the interval γ = [γ
2
−δ; γ
2
+δ], again using golden
section search and parabolic interpolation, and resulting in the final estimates
γ
opt
and f
2,opt
at which the spline is resampled to form y
2
(n).
The frequency offset f
2
− f
0
for different measurement signal segments is included in
Table 9.2.
9.3.2. Comparison of Filter Estimation Methods
We compare six well-established methods to estimate the channel impulse response
of the filter component H of the model in Figure 9.3, in terms of accuracy and noise
robustness.
1
Filter Plus Noise Channel Model
For the comparison of the different estimators, we assume a linear and time-invariant
(LTI) filter plus noise model to characterize the relation between input and output. The
model is characterized by the filter’s impulse response h(n) or frequency response
H(e

), the input signal x(n), the output signal y
Hw
(n) and additive white Gaussian
noise (AWGN) w(n) (see Figure 9.5). We denote with P the order of the filter, with N
the number of input and output samples used for the channel estimation, and with the
circumflex or hat accent ( ˆ ) estimated variables.
Estimation Methods
We compare the estimation accuracy of the following methods.
1
Parts of this subsection are based on M. Gruber and K. Hofbauer, “A comparison of estimation methods
for the VHF voice radio channel,” in Proceedings of the CEAS European Air and Space Conference (Deutscher
Luft- und Raumfahrtkongress), Berlin, Germany, Sep. 2007. The experiments were performed by Mario
Gruber in the course of a Master’s thesis [145].
132
9.3. Parameter Estimation Implementation
Deconvolution The filter’s impulse response
ˆ
h(n) is estimated recursively using
[146]
ˆ
h(p) =
1
x(0)
_
y
Hw
(p) −
p

m=1
x(m)
ˆ
h(p −m)
_
.
Spectral Division The impulse response
ˆ
h(n) is estimated by an inverse discrete
Fourier transformation (DFT) of the frequency response
ˆ
H(k), which is obtained using
[146]
Time domain Frequency domain
y
Hw
(n) = x(n) ∗
ˆ
h(n)
c s
Y(k) = X(k)
ˆ
H(k)
.
Power Spectral Densities The frequency response
ˆ
H(k) can also be obtained using
[146]
Time domain Frequency domain
R
xy
(n) = R
xx
(n) ∗
ˆ
h(n)
c s
G
xy
(k) = G
xx
(k)
ˆ
H(k),
where R
xx
(n) = ∑
N−1
m=0
x(m)x(m + n) is the auto-correlation of the input x(n) and
R
xy
(n) = ∑
N−1
m=0
x(m)y
Hw
(m + n) the cross-correlation between input x(n) and the
output y
Hw
(n), and G
xx
(k) and G
xy
(k) the corresponding power spectral density (PSD)
and cross-PSD.
Method of Least Squares The impulse response vector
ˆ
h is given by the solution of
the normal equation [147]
ˆ
h = (X
T
X)
−1
X
T
y. (9.1)
Using the covariance windowing method, the Toeplitz structured convolution matrix X
is
X =
_
¸
¸
¸
_
x(P −1) x(1) x(0)
x(P) x(2) x(1)
.
.
.
.
.
.
.
.
.
.
.
.
x(N −1) x(N −P +1) x(N −P)
_
¸
¸
¸
_
and the output vector y is
y =
_
¸
¸
¸
_
y
Hw
(
P−1
2
)
y
Hw
(1 +
P−1
2
)
.
.
.
y
Hw
(N −1 −
P−1
2
)
_
¸
¸
¸
_
.
Maximum Length Sequence If the input signal x(n) is a maximum length sequence
(MLS), then R
xx
(n) ≈ δ(n) and the cross-correlation R
xy
(n) between input x(n) and
output y
Hw
(n) approximates the channel’s impulse response, or [148]
ˆ
h(n) = δ(n) ∗
ˆ
h(n) ≈ R
xx
(n) ∗
ˆ
h(n) = R
xy
(n).
133
Chapter 9. Data Model and Parameter Estimation
The approximation R
xx
(n) ≈ δ(n) is only valid for long MLS. The approximation error
can be fully compensated with [148]
ˆ
h(n) =
1
L +1
_
R
xy
(n) +
P−1

m=0
ˆ
h(m)
_
.
Estimation Performance Comparison
Experimental Settings We estimate P = 63 filter coefficients using as input N = 2
14
samples of an audio signal with a sample rate f
s
= 22050 Hz or a repeated MLS signal.
The signal x(n) is filtered using an FIR lowpass filter of order 60 (P
LP
= 61) and a
cutoff-frequency of 4.4 kHz, and AWGN w(n) added at an SNR of 10 dB, with
SNR = 20 log
10
_
(x(n) ∗ h(n))
RMS
w
RMS
_
dB.
The index RMS indicates the root mean square value of the corresponding signal.
Experimental Results We use as error measure the error in the filtered signal ob-
tained with the original and the estimated filter. It is given by
e
y
(n) = y
H
(n) − ˆ y
H
(n) = x(n) ∗ h(n) −x(n) ∗
ˆ
h(n)
and expressed as the ratio
E
y
= 20 log
10
_
e
y, RMS
y
H,RMS
_
dB.
The estimation error for all methods and for the two input signals with and without
noise w(n) is shown in Table 9.1. In the presence of noise the method of Least Squares
outperforms all other methods. This result is consistent with estimation theory, as the
method of Least Squares is an efficient estimator (i.e., it attains the Cramer-Rao lower
bound) given the linear model and additive white Gaussian observation noise [147].
Estimation Parameters and Noise Estimation The number of estimated filter coef-
ficients, so far set to P = 63, influences the observed estimation error. Figure 9.6 shows
that in the noise-free case the estimation is error-free as soon as P > P
LP
(using the
same experimental settings as above, and Least Squares estimation). In the presence of
noise, the estimation error has a minimum at P ≈ P
LP
, since for large P the estimator
begins to suffer more from the channel noise due to the random fluctuations in the
redundant filter coefficient estimates.
The number of input samples, so far set to N = 2
14
samples, affects the observed
estimation error. Assuming a time-invariant channel and using the same experimental
settings as before, the estimation accuracy increases with increasing N (Figure 9.7). In
the case of large N (i.e., N ¸ P) and using the method of Least Squares, it is possible
134
9.3. Parameter Estimation Implementation
Table 9.1.: Filter Estimation Results
Estimation error to output ratio E
y
for different estimation methods,
input signals, and with and without observation noise w.
Estimation Method Audio Audio+Noise MLS MLS+Noise
Deconvolution < -150 dB > +100 dB < -150 dB > +100 dB
Spectral Division < -150 dB -16 dB -68 dB > +100 dB
PSD < -150 dB -21 dB < -150 dB +8 dB
Least Squares < -150 dB -34 dB < -150 dB -34 dB
MLS n/a n/a -27 dB -9.7 dB
MLS (compensated) n/a n/a < -150 dB -10 dB
(a) Noise-Free Channel
0 20 40 60 80 100 120
−350
−300
−250
−200
−150
−100
−50
0
P (number of estimated coefficients)
E
y

i
n

d
B


E
y
P
LP
(b) Noise-Corrupted Channel
0 20 40 60 80 100 120
−40
−30
−20
−10
0
P (number of estimated coefficients)
E
y

i
n

d
B


E
y
P
LP
Figure 9.6.: Estimation error to output ratio E
y
as a function of the number of estimated
coefficients. The original filter has P
LP
= 61 coefficients.
135
Chapter 9. Data Model and Parameter Estimation
0 1000 2000 3000 4000 5000 6000 7000
−35
−30
−25
−20
−15
−10
−5
0
N in samples
E


i
n

d
B


E
w
E
y
Figure 9.7.: Signal and noise estimation errors E
y
and E
w
at an SNR of 10 dB as a
function of the number N of input and output samples ( f
s
= 22050 Hz)
used for estimation.
to estimate not only the filter coefficients h(n), but also the channel noise w(n). The
noise estimate
ˆ w(n) = y
Hw
(n) − ˆ y
H
(n) = y
Hw
(n) −x(n) ∗
ˆ
h(n) (9.2)
has an estimation error
e
w
(n) = w(n) − ˆ w(n),
which is again expressed as an estimation error ratio
E
w
= 20 log
10
_
e
w, RMS
w
RMS
_
dB
and shown in Figure 9.7 as a function of the number of samples used for estimation.
We conclude that, in comparison to the frame-based PSD estimator used for the
database, the Least Squares method is a more suitable estimator. In combination with
using long signal segments it also allows to estimate the channel noise, which is needed
in the channel estimation method presented hereafter.
9.3.3. Channel Analysis using Resampling, Pre-Filtering, Gain
Normalization and Least Squares Estimation
Based on the results of the previous two subsections, we have developed an estimation
method for the channel model of Figure 9.3. We focus only on measurement data
blocks containing MLS signals.
As depicted in Figure 9.8, the estimation consists of the following steps.
1. The recorded received signal y
HwgcR
(n), denoted RX in the database, is resampled
with the resampling factor γ
opt
to 48000 Hz, using the method described in
Section 9.3.1.
2. The resampled signal is decimated (including anti-alias filtering) by a factor of
six to the chip-rate of the original MLS sequence.
136
9.4. Experimental Results and Discussion
RX
x
MLS
y
Hw
y
Hwg
y
Hwgc
y
HwgcR
↓6
Resampling
(48.0 kHz)
Gain
Normalization
Filter
Estimation
MLS
y
H
w
h
ˆ
ˆ
ˆ
ˆ ˆ ˆ ˆ
Figure 9.8.: Estimation of the channel’s impulse response, time-variant gain, and addi-
tive noise based on measured data and with compensation for measurement
errors.
g
-1
y
Hw
y
Hwg
g ˆ
|...| (...)
-1
ˆ
ˆ ˆ
Figure 9.9.: Implementation of the gain normalization.
3. The DC offset c(n) is estimated using a lowpass filter of order 1000 and a cutoff
frequency of 40 Hz.
4. To remove the DC offset, and low and high frequency noise outside the frequency
range of interest, ˆ y
Hwgc
(n) is bandpass filtered using a bandpass filter with a
passband from 100 Hz to 3800 Hz.
5. The time-variant gain g(n) is estimated with a conventional envelope follower
(shown in Figure 9.9, and using a lowpass filter with a cutoff frequency of 100 Hz).
The signal ˆ y
Hwg
(n) is then normalized to unit gain using the gain estimate ˆ g.
6. The channel filter’s impulse response h(n) is estimated using the original binary-
valued MLS signal x
MLS
(n) of the database as input, ˆ y
Hw
(n) as noise-corrupted
output, and the method of Least Squares for estimation, as discussed in Sec-
tion 9.3.2 and computed with Matlab’s backslash operator.
2
7. The channel noise w(n) is estimated using (9.2) and
ˆ w(n) = ˆ y
Hw
(n) −x
MLS
(n) ∗
ˆ
h(n) = ˆ y
Hw
(n) − ˆ y
H
(n).
9.4. Experimental Results and Discussion
To show the validity of the analysis system and to obtain insight in the channel
properties, we apply the analysis system presented in the previous section to select
parts of the database that cover a wide range of experimental conditions.
2
It is assumed that the filter is time-invariant within one block. Section 9.4 will show that the assumption
of a time-variant filter in combination with a tracking estimation method leads to overall inferior
results.
137
Chapter 9. Data Model and Parameter Estimation
9.4.1. Filter and Noise Estimation
The noise estimate ˆ w(n), which is at the same time the error signal of the Least Squares
estimation, is a strong measure for the suitability of the entire channel estimation
scheme. Assuming N ¸ P, the error signal ˆ w(n) is expected to contain no MLS signal
components, but only noise. Thus, we use as error measure the overall power of the
error signal ˆ w(n), as well as its spectral and acoustical quality. If, for example, there is
a sampling frequency drift or gain modulation that is not appropriately compensated,
there is no good overall Least Squares fit, and a significant fraction of the (modulated)
MLS signal resides in ˆ w(n). In fact, parts of the proposed channel model were inspired
by residuals observed in ˆ w(n). We define the estimated signal-to-noise ratio (and error
measure) as
SNR
est
= 20 log
10
_
ˆ w
RMS
ˆ y
H, RMS
_
dB.
Applying the proposed estimation method to the database using P = 63 and always
the full data block for input and output, results in the estimated SNRs as summarized
in Table 9.2 and the estimated frequency responses as shown in Figure 9.10. A manual
inspection of the estimated responses showed that the frequency response mainly
depends on the transmission direction (air/ground or ground/air), and as such on the
transmitting and receiving voice radios. Apart from this, very little other variation of
the frequency response among the different blocks was observed.
Table 9.2 also provides estimation results for two alternative estimation methods. In
the ‘Least Squares’ (LS) method described so far, the full data block is used for the
estimation of h(n), which assumes that h(n) is time-invariant. To accommodate for
potential slow time-variations of h(n, t), the ‘Windowed LS’ method performs a Least
Squares estimation within a window of five frames (315 samples), and calculates an
estimate
ˆ
h(n, t) for every sample of the data block. The table also shows the estimation
error when tracking h(n, t) with an exponentially weighted recursive-least squares
(RLS) adaptive filter [87] with a forgetting factor λ = 0.998.
The results show that the assumption of a time-invariant h(n) leads to the lowest
estimation error. This means that the performance improvement obtained by using as
many samples as possible for the estimation (cf. Figure 9.7) outweighs the performance
degradation induced by the time-invariance assumption. In other words, the time-
variation of the impulse response is smaller than what can be measured given the
channel noise. Also, a manual inspection of the error signals did not reveal any
significant temporal changes within a data block, which is another justification for the
assumption of a time-invariant channel filter h(n).
At low noise levels (e.g., data block number 1), the error signal contains short
repetitive noise bursts with a constant frequency of approximately 6 Hz. The origin of
these noise bursts is unknown. However, since the same noise bursts are audible in
recordings of signal pauses of the database when no MLS signal was transmitted, we
conclude that the bursts are not an estimation error but are present in the signal. Due
to their periodic occurrence it appears likely that the bursts result from electromagnetic
interference within the measurement system itself or between the measurement system
and other components of the aircraft or ground systems. At medium noise levels
138
9.4. Experimental Results and Discussion
(a) Air/Ground Transmission
0 500 1000 1500 2000 2500 3000 3500 4000
−30
−20
−10
0
10
Frequency (in Hz)
M
a
g
n
i
t
u
d
e

R
e
s
p
o
n
s
e

(
i
n

d
B
)


Max. (all blocks)
Response (data block nr. 1)
Min. (all blocks)
0 500 1000 1500 2000 2500 3000 3500 4000
−6pi
−4pi
−2pi
0pi
Frequency (in Hz)
P
h
a
s
e

R
e
s
p
o
n
s
e

(
i
n

r
a
d
i
a
n
s
)
(b) Ground/Air Transmission
0 500 1000 1500 2000 2500 3000 3500 4000
−30
−20
−10
0
10
Frequency (in Hz)
M
a
g
n
i
t
u
d
e

R
e
s
p
o
n
s
e

(
i
n

d
B
)


Response (data block nr. 3)
0 500 1000 1500 2000 2500 3000 3500 4000
−32pi
−24pi
−16pi
−8pi
0pi
Frequency (in Hz)
P
h
a
s
e

R
e
s
p
o
n
s
e

(
i
n

r
a
d
i
a
n
s
)
Figure 9.10.: Estimated frequency responses.
139
Chapter 9. Data Model and Parameter Estimation
T
a
b
l
e
9
.
2
.
:
F
i
l
t
e
r
a
n
d
N
o
i
s
e
E
s
t
i
m
a
t
i
o
n
R
e
s
u
l
t
s
D
a
t
a
B
l
o
c
k
E
x
p
e
r
i
m
e
n
t
a
l
C
o
n
d
i
t
i
o
n
s
f
2

f
0
S
N
R
e
s
t
(
i
n
d
B
)
C
o
m
m
e
n
t
Number
Frame-ID (Start)
Frame-ID (End)
Transm. Direction
Engine On/Off
Distance (in km)
Speed (in m/s)
Altitude (in m)
Rel. Azimuth (in

)
Difference (in Hz)
Least Squares
Windowed LS
Recursive LS
1
2
0
1
2
0
5
6
2
0
1
3
5
6
0
A
/
G
O
f
f
0
.
1
0
.
0
2
9
7
n
/
a
0
.
9
6
3
9
.
5
3
8
.
5
3
7
.
0
B
a
c
k
-
t
o
-
B
a
c
k
2
2
0
6
5
3
8
8
2
0
6
6
8
9
2
A
/
G
O
n
0
.
1
2
.
2
2
9
6
n
/
a
0
.
9
4
3
6
.
7
3
5
.
8
3
6
.
0
B
a
c
k
-
t
o
-
B
a
c
k
3
1
0
0
8
0
0
0
1
0
0
9
5
0
4
G
/
A
O
f
f
0
.
1
0
.
0
3
0
0
n
/
a
-
0
.
0
9
3
7
.
6
3
6
.
7
3
6
.
6
B
a
c
k
-
t
o
-
B
a
c
k
4
4
0
1
6
6
9
8
4
0
1
8
2
0
2
G
/
A
O
n
0
.
1
0
.
3
2
9
5
n
/
a
-
0
.
1
0
3
5
.
3
3
4
.
1
3
4
.
6
B
a
c
k
-
t
o
-
B
a
c
k
5
2
3
9
6
2
3
6
2
3
9
7
7
4
0
A
/
G
O
n
5
.
9
6
1
.
7
1
4
6
7
1
6
6
0
.
9
2
2
8
.
8
2
8
.
1
2
8
.
1
P
o
s
.
D
o
p
p
l
e
r
S
h
i
f
t
6
2
4
0
0
0
7
5
2
4
0
1
5
7
9
A
/
G
O
n
4
.
0
6
4
.
2
1
3
4
2
1
7
3
0
.
9
2
2
7
.
4
2
6
.
6
2
7
.
0
P
o
s
.
D
o
p
p
l
e
r
S
h
i
f
t
7
2
3
5
4
3
6
3
2
3
5
5
8
6
7
A
/
G
O
n
5
.
8
5
8
.
2
1
3
6
3
1
9
7
0
.
9
2
2
7
.
5
2
6
.
7
2
7
.
0
P
o
s
.
D
o
p
p
l
e
r
S
h
i
f
t
8
2
4
1
2
0
6
4
2
4
1
3
5
6
8
A
/
G
O
n
2
.
2
5
1
.
9
1
2
2
4
2
6
0
.
9
4
2
2
.
9
2
2
.
3
2
2
.
8
N
e
g
.
D
o
p
p
l
e
r
S
h
i
f
t
9
2
3
7
7
5
7
1
2
3
7
9
0
7
5
A
/
G
O
n
3
.
9
4
1
.
5
1
3
4
3
3
3
1
0
.
9
3
2
9
.
3
2
8
.
3
2
8
.
9
N
e
g
.
D
o
p
p
l
e
r
S
h
i
f
t
1
0
2
1
1
6
1
9
1
2
1
1
7
6
9
5
A
/
G
O
n
5
.
7
4
0
.
6
1
0
3
2
3
0
.
8
9
3
0
.
8
2
9
.
7
3
0
.
3
N
e
g
.
D
o
p
p
l
e
r
S
h
i
f
t
1
1
2
2
5
1
9
1
1
2
2
5
3
4
1
5
A
/
G
O
n
7
.
3
3
9
.
2
1
5
9
4
2
6
8
0
.
9
8
2
2
.
7
2
2
.
3
2
2
.
4
Z
e
r
o
D
o
p
p
l
e
r
S
h
i
f
t
1
2
4
2
6
0
5
0
6
4
2
6
2
0
1
0
G
/
A
O
n
1
9
.
1
5
8
.
0
9
9
2
1
8
3
-
0
.
1
0
3
0
.
5
2
9
.
9
2
9
.
9
M
e
d
.
D
i
s
t
a
n
c
e
1
3
3
2
9
2
9
9
5
3
2
9
4
4
9
9
A
/
G
O
n
4
3
.
5
5
0
.
1
1
0
7
5
9
2
0
.
9
9
2
3
.
0
2
2
.
1
2
2
.
8
M
a
x
.
D
i
s
t
a
n
c
e
140
9.4. Experimental Results and Discussion
(a) Data Block Nr. 1 (A/G, engine off, back-to-back)
0 500 1000 1500 2000 2500 3000 3500 4000
−90
−80
−70
−60
−50
−40
Frequency (in Hz)
M
a
g
n
i
t
u
d
e

(
i
n

d
B
)
(b) Data Block Nr. 7 (A/G, engine on, flight)
0 500 1000 1500 2000 2500 3000 3500 4000
−90
−80
−70
−60
−50
−40
Frequency (in Hz)
M
a
g
n
i
t
u
d
e

(
i
n

d
B
)
(c) Data Block Nr. 13 (A/G, engine on, flight, max. distance)
0 500 1000 1500 2000 2500 3000 3500 4000
−90
−80
−70
−60
−50
−40
Frequency (in Hz)
M
a
g
n
i
t
u
d
e

(
i
n

d
B
)
Figure 9.11.: DFT magnitude spectrum of the noise estimate and estimation error
signal ˆ w for different data blocks.
(e.g., data block number 7) the error signal sounds similar to colored noise, and at
the highest noise levels (e.g., data block number 13), the error signal is very similar to
white noise. Figure 9.11 shows the corresponding power spectra of the error signals.
9.4.2. DC Offset
Figure 9.12 shows the time-variant DC offset c(n) in the recordings of the received
signal, obtained by filtering data block number 1 with a lowpass filter of order 1000 and
a cutoff frequency of 40 Hz. The clearly visible pulses with a frequency of 1 Hz result
from interference between the voice recordings and the GPS timing signal within the
measurement system, and are as such a measurement error. Even though the amplitude
of this DC offset is small compared to the overall signal amplitude, the offset, if not
141
Chapter 9. Data Model and Parameter Estimation
0 1 2 3 4 5 6
−0.01
−0.005
0
0.005
0.01
Time (in s)
A
m
p
l
i
t
u
d
e
Figure 9.12.: DC component of the received signal recording. The pulses result from
interference with the GPS timing signal.
removed a-priori, would severely degrade the overall estimation performance.
9.4.3. Gain
In some data blocks we observed a sinusoidal amplitude modulation of the estimated
gain ˆ g of the received signal. The DFT magnitude spectrum of the time-variant gain
ˆ g(n) for three different data blocks is shown in Figure 9.13. In two of these cases
the spectrum exhibits a distinct peak. For every data block, Table 9.3 provides the
frequency location f
sin
of the most dominant spectral peak, as well as its amplitude
A
sin
relative to the DC gain C, expressed in
L
A
sin
/C
= 20 log
10
_
A
sin
C
_
,
modeling the gain modulation with g(n) = C + A
sin
sin
_
2πf
sin
f
s
n
_
. Expressing the same
quantity in terms of RMS values results in
L
A
RMS
/C
RMS
= 20 log
10
_
_
_
A
sin
sin
_
2πf
sin
f
s
__
RMS
C
RMS
_
_
≈ L
A
sin
/C
−3 dB.
Table 9.3 also shows the overall power of the gain modulations relative to the DC gain,
expressed in
Lg−C
/C
= 20 log
10
_
[ ˆ g(n) −C]
RMS
C
RMS
_
.
The source of the sinusoidal gain modulations (GM) is not immediately clear. Given
Figure 9.13 and Table 9.3, we observe the following facts:
1. GM is present if engine is running, only.
2. GM is present even if aircraft is not moving.
3. Frequency of GM is independent of the relative heading of the aircraft.
142
9.5. Conclusions
4. Frequency of GM depends on an unknown variable that has some connection
with the aircraft speed.
5. Frequency of GM is significantly larger than the frequency predicted in Ap-
pendix B for gain modulations caused by multipath propagation and Doppler
shift.
We conclude from these facts that the observed gain modulation is not caused by the
physical radio transmission channel, especially since it as also present in situations
where the aircraft did not move.
We believe that the gain modulation results from interference between the aircraft
power system and the voice radio or the measurement system. In particular, the gain
modulation might result from a ripple voltage on the direct current (DC) power supply
of the aircraft radio. Such a ripple voltage is a commonly observed phenomenon
and results from insufficient filtering or voltage regulation of the rectified alternating
current (AC) output of the aircraft’s alternator or power generator [149]. The frequency
of the ripple voltage is, as such, a linear function of the engine’s rotational speed, which
is measured in rotations per minute (rpm). We conjecture that the unknown variable
that the gain modulation frequency depends on, is the engine rpm. This would align
well with the results of Table 9.3 with no gain modulation when the engine is off and
a significant frequency difference between engine idle and engine full throttle. The
magnitude of the ripple voltage as well as its impact on the communication system
is expected to be system- and aircraft-specific and to be less of an issue in modern
systems and large jet aircraft with an independent auxiliary power unit (APU).
An alternative explanation for the gain modulation could be a periodic variation
of the physical radio transmission channel induced by the rotation of the aircraft’s
propeller. The rotational speed of the propeller is linked to the engine rpm either
directly or via a fixed gear transmission ratio.
9.5. Conclusions
In this section, we presented an improved model to describe the data in the TUG-EEC-
Channels database presented in Chapter 8 and a corresponding estimation method.
This method allows to characterize the measured channel in terms of its linear filter
response, time-variant gain, sampling offset, DC offset and additive noise, and also
helps to identify measurement errors. In contrast, the channel estimates included in
the database combine all effects into a single variable, namely a channel response
estimated frame by frame.
Given the low level of the estimation error signal and the spectral characteristics
of the noise estimate, we conclude that the proposed estimation method works as
expected, and that the proposed data model is a reasonable description for the mea-
sured data. As expected by theory, the analysis of the measurements confirms that the
channel is not frequency-selective. The source of the observed flat fading could not
be conclusively established. However, there are several factors that strongly suggest
that the observed gain modulation does not result from radio channel fading through
143
Chapter 9. Data Model and Parameter Estimation
(a) Data Block Nr. 1 (A/G, engine off, back-to-back)
0 20 40 60 80 100 120
−20
0
20
40
60
80
Frequency (in Hz)
M
a
g
n
i
t
u
d
e

(
i
n

d
B
)
(b) Data Block Nr. 4 (G/A, engine on, back-to-back)
0 20 40 60 80 100 120
−20
0
20
40
60
80
Frequency (in Hz)
M
a
g
n
i
t
u
d
e

(
i
n

d
B
)
(c) Data Block Nr. 7 (A/G, engine on, flight)
0 20 40 60 80 100 120
−20
0
20
40
60
80
Frequency (in Hz)
M
a
g
n
i
t
u
d
e

(
i
n

d
B
)
Figure 9.13.: DFT magnitude spectrum of the time-variant gain ˆ g of different data
blocks.
144
9.5. Conclusions
Table 9.3.: Dominant Frequency and Amplitude Level of the Gain Modulation
Data Block Nr. Frequency L
A
sin
/C
L
A
RMS
/C
RMS
L
g−C
/C
1 79.9 Hz -69.7 dB -72.8 dB -51.2 dB
2 43.2 Hz -53.2 dB -56.3 dB -42.4 dB
3 27.1 Hz -64.8 dB -67.9 dB -51.1 dB
4 45.4 Hz -41.4 dB -44.5 dB -39.3 dB
5 86.8 Hz -28.3 dB -31.3 dB -23.3 dB
6 86.5 Hz -27.6 dB -30.6 dB -21.9 dB
7 86.6 Hz -26.8 dB -29.8 dB -22.9 dB
8 83.7 Hz -27.3 dB -30.3 dB -25.1 dB
9 81.8 Hz -36.7 dB -39.8 dB -34.8 dB
10 80.2 Hz -34.8 dB -37.8 dB -36.0 dB
11 73.0 Hz -24.2 dB -27.2 dB -21.8 dB
12 86.5 Hz -33.1 dB -36.1 dB -31.6 dB
13 65.1 Hz -36.6 dB -39.6 dB -31.7 dB
multipath propagation but from interference between the aircraft engine and the voice
radio system. The observed noise levels are in a range from 40 dB to 23 dB in terms of
SNR, which are worst-case estimations since all estimation errors accumulate in the
noise estimate, and the real channel noise level might in fact be smaller.
145
Chapter 9. Data Model and Parameter Estimation
146
Chapter 10
Experimental Watermark Robustness
Evaluation
In this chapter, we make use of the channel model derived in Chapter 9 to
evaluate the robustness of the proposed watermarking method in the aero-
nautical application. We experimentally demonstrate the robustness of the
method against filtering, desynchronization, gain modulation and additive
noise. Furthermore we show that pre-processing of the speech signal with
a dynamic range controller can improve the watermark robustness as well
as the intelligibility of the received speech.
This chapter presents recent results that have not yet been published.
147
Chapter 10. Experimental Watermark Robustness Evaluation
This chapter experimentally evaluates the robustness of the watermark system pre-
sented in Chapter 4 in light of the application to the air traffic control voice radio.
The evaluation is an explicit continuation of the experimental results of Section 4.3,
now incorporating the obtained knowledge about the aeronautical voice radio chan-
nel. Consequently, we use the same experimental setup and settings as described in
Section 4.3.
10.1. Filtering Robustness
Section 4.3 demonstrated the robustness of the watermarking method against linear
and time-invariant filtering using a bandpass filter, an IRS filter, a randomly chosen
estimated aeronautical channel response filter and an allpass filter. These results are
repeated and extended in Figure 10.1.
10.1.1. Estimated Static Prototype Filters
We evaluated the filtering robustness using the measured and estimated aeronautical
radio channel filters of Chapter 9. In particular, we used the two distinct channel
response filters for air/ground and ground/air transmissions as depicted in Figure 9.10.
The resulting BERs are shown in Figure 10.1 and indicate that the method is robust
against both filters (denoted ‘Ch. 9, air/gnd’ and ‘Ch. 9, gnd/air’). The increased BER
in the ground/air transmission direction likely results from a mismatch between the
used embedding band position (666 Hz–3333 Hz) and the frequency response of the
ground/air channel. As shown in Figure 9.10, the ground/air channel has a non-flat
frequency response and significant attenuation above 2.5 kHz. A more suitable choice
for the watermark embedding would be the configuration described in Section 4.2.2.3
with an embedding band from 0.5 kHz to 2.5 kHz.
10.1.2. Time-Variant TUG-EEC-Channels Filter
We evaluated the robustness against the time-variant CPSD-based channel estimates
of Section 8.2.5.4. This is a worst-case scenario, since, as discussed in Chapter 9, the
time-variation of these channel response estimates originates from an estimation error
due to insufficient compensation of the sampling clock drift.
We used the estimates of two different measurement scenarios, one with the aircraft
being static on ground (‘Ch. 8, ground’, Frame-ID 2011556 to 2013968 in the TUG-
EEC-Channels database) and one with aircraft flying at high speed (‘Ch. 8, flight’,
Frame-ID 2396236 to 2399927), both with transmission in air/ground direction. The
filter taps obtained from the CPSD-based channel estimates (available every 7.875 ms)
were interpolated over time and continuously updated while filtering the watermarked
speech signal. The results in Figure 10.1 confirm the robustness of the watermarking
method against this time-variant filtering attack.
148
10.2. Gain Modulation Robustness
0
0.02
0.04
0.06
0.08
I
d
e
a
l

c
h
a
n
n
e
l

B
a
n
d
p
a
s
s

f
i
l
t
e
r

I
R
S

f
i
l
t
e
r

A
e
r
o
n
.

c
h
a
n
.

(
C
h
.

4
)

T
i
m
e

v
a
r
.

c
h
a
n
.

(
C
h
.

8
,

g
r
o
u
n
d
)

T
i
m
e

v
a
r
.

c
h
a
n
.

(
C
h
.

8
,

f
l
i
g
h
t
)

A
e
r
o
n
.

c
h
a
n
.

(
C
h
.

9
,

a
i
r
/
g
n
d
)

A
e
r
o
n
.

c
h
a
n
.

(
C
h
.

9
,

g
n
d
/
a
i
r
)

B
i
t

e
r
r
o
r

r
a
t
i
o

(
B
E
R
)
Figure 10.1.: Overall system robustness in the presence of various transmission channel
filters at an average uncoded bit rate of 690 bit/s.
10.2. Gain Modulation Robustness
In this section, the robustness of the watermarking method against sinusoidal gain
modulations is evaluated, both systematically and in the measured aeronautical channel
conditions.
In contrast to the previous experiments, in this and all following experiments of this
chapter the input speech is normalized to unit variance on a per utterance level to
establish a coherent signal level across utterances.
10.2.1. Sinusoidal Gain Modulation Robustness
Figure 10.2 shows the robustness of the principle embedding method against sinusoidal
gain modulations of the form
ˆ s
FB
(n) =
_
1 + h
GM
sin
_
2πn
f
GM
f
s
__
s
/
FB
(n)
with a modulation frequency f
GM
and a modulation depth h
GM
. The experimental
conditions are the same as in Section 4.3, except that ideal frame synchronization is
assumed since the implemented frame detection scheme works reliably only up to a
BER of approximately 15 % (see Figure 5.8).
Using the measured worst-case modulation frequency and depth of Table 9.3 (Data
Block Nr. 11, f
GM
= 73 Hz, h
GM
= −24.2 dB = 0.062) and real frame detection, the
resulting BER is shown in comparison to an ideal channel with constant gain in
Figure 10.3.
149
Chapter 10. Experimental Watermark Robustness Evaluation
0
0.2
0.4
0.6
0.8
1
0
100
200
300
0
0.1
0.2
0.3
0.4
Modulation depth
Mod. frequency (in Hz)
B
i
t

e
r
r
o
r

r
a
t
i
o

(
B
E
R
)
Figure 10.2.: Watermark embedding scheme robustness against sinusoidal gain modu-
lation at an average uncoded bit rate of 690 bit/s (assuming ideal frame
synchronization).
0
0.01
0.02
0.03
0.04
I
d
e
a
l

c
h
a
n
n
e
l

G
a
i
n

m
o
d
.

(
C
h
.

9
,

B
k
.

1
1
)

A
u
t
o
m
a
t
i
c

g
a
i
n

c
o
n
t
r
o
l

B
i
t

e
r
r
o
r

r
a
t
i
o

(
B
E
R
)
Figure 10.3.: Overall system robustness in the presence of the measured gain modu-
lation and a simulated automatic gain control at an average uncoded bit
rate of 690 bit/s (and using the actual frame synchronization).
150
10.3. Desynchronization Robustness
10.2.2. Simulated Automatic Gain Control
Aeronautical radio transceivers typically contain an automatic gain control (AGC)
both in the radio frequency (RF) and the audio domain [97, 150]. To evaluate the
robustness against adaptive gain modulation, we simulated a channel with an AGC
using the SoX implementation [151] of a dynamic range controller (or ‘compressor’,
[152]) with an attack time of 20 ms, a release time of 500 ms, a look-ahead of 20 ms, and
a compression ratio of 1:4 above a threshold level of -24 dB [150]. The resulting BER
shown in Figure 10.3 demonstrates that the AGC does not decrease the watermark
detection performance.
10.3. Desynchronization Robustness
An experimental evaluation of the timing, bit and frame synchronization is provided
in Section 5.3. This topic is not further treated herein.
10.4. Noise Robustness
The robustness of the watermarking method against AWGN was evaluated in Sec-
tion 4.3 with a segmental SNR of 30 dB. The noise robustness for a wide range of
non-segmental SNRs is shown in Figure 10.4 (‘Original signal’). To show the behavior
of the principle method, perfect frame detection is assumed, since otherwise the BER
would level off at around 15 % due to failing frame synchronization.
The BER curve indicates that the method would sometimes not be robust against
the measured aeronautical channel conditions, since the worst-case measured SNR
(including estimation errors that contribute to the noise level) was found to be 22.7 dB
(see Table 9.2). However, the robustness of the watermarking scheme against AWGN
is difficult to specify in absolute terms in the aeronautical application. Due to the
presence of gain controls in the transceivers, the transmission channel is non-linear,
and care must be taken when defining or comparing signal-to-noise ratios. While
SNR is a measure of the mean signal and noise power, the aeronautical channel is in
fact peak power constrained, with the peak power level being given by the maximum
allowed output power of the transmitter (integrated over a short time window). The
BER curves corresponding to Figure 10.4 are shown in Figure C.1 of Appendix C for
given peak signal-to-noise ratios (PSNR). We assume the peak power to be the signal’s
maximum average power measured in windows of 20 ms.
The SNRs of Table 9.2 were measured with an input signal of constant instantaneous
power. As a consequence, the automatic gain controls were in a steady state and
the output signal power at some nominal level. Due to the reaction time of the gain
controls, a peak of a real speech signal may in practice far exceed this nominal level,
and the actual peak power limit lies above this measured steady-state level.
In the remainder of this section, we briefly discuss and experimentally evaluate
additional measures to increase the noise robustness of the watermarking method.
151
Chapter 10. Experimental Watermark Robustness Evaluation
0 10 20 30 40 50 60
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
SNR (in dB, wrt. original signal)
B
i
t

e
r
r
o
r

r
a
t
i
o


Original signal
Original signal with silence removed [sr]
sr+multiband compr. and limiting [mbcl]
sr+mbcl + unvoiced gain (+3dB) [ug]
sr+mbcl + unvoiced gain (+6dB)
sr+mbcl + waterm. floor (at −SNR−3dB) [wmfl]
sr+mbcl + wmfl (at −SNR) + ug (+3dB)
sr+mbcl + wmfl (at −SNR+3dB)
sr+mbcl + wmfl (at −SNR+3dB) + ug (+3dB)
Figure 10.4.: Watermark embedding scheme robustness against AWGN at average
uncoded bit rates of 690 bit/s (‘original signal’), 470 bit/s (‘sr’) and
485 bit/s (‘sr+mbcl’).
10.4.1. Selective Embedding
The watermarking method as described in Chapter 4 embeds the watermark in all
time segments that are not voiced, which also includes pauses and silent segments.
However, when assuming a fixed channel noise level, the local SNR in silent regions
is very low and watermark detection fails, leading to high BERs. A simple measure
to increase the noise robustness of the watermarking method is to not embed during
pauses and constrain the embedding to regions where the speech signal has significant
power. Inevitably, this decreases the uncoded watermark bit rate.
We simulate the non-embedding in pauses by removing all silent regions of the
input signal, which are defined by a 43ms-Kaiser-window-filtered intensity curve being
more than 30 dB below its maximum for longer than 100ms. Given the ten normalized
utterances of the ATCOSIM corpus used throughout the tests of this chapter, 14 % of
the speech signal are detected as silence. This reduces the average watermark bit rate
from 690 bit/s to 470 bit/s, both calculated on the basis of the duration of the original
signal. The resulting BER curve shown in Figure 10.4 (‘silence removal’) demonstrates a
significant improvement in terms of noise robustness, however, at the cost of a reduced
152
10.4. Noise Robustness
Dynamic Range
Compression
Watermark
Embedding
ATC
Channel
En h a n c i n g
Sy s t e m
A T C
Ch a n n e l
En h a n c i n g
Sy s t e m
A T C
Ch a n n e l
Figure 10.5.: Improving watermark robustness and speech intelligibility by dynamic
range compression.
bit rate.
10.4.2. Speech Enhancement and Watermarking
A Master’s thesis [131] that was supervised by the author of this work studied and
compared different algorithms to pre-process the speech signal before the transmission
over the aeronautical radio channel in order to improve the intelligibility of the cor-
rupted received speech. The compared systems are fully backward-compatible and
operate only at the transmitting side. The methods were compared subjectively using
the modified rhyme test (MRT) and significantly improved the intelligibility of the
received speech at high channel noise levels.
The study showed the high effectiveness of a dynamic range controller (or ‘compres-
sor’, [152]) to improve the intelligibility. The isolated word recognition rate increased
by 12.6 percentage points on average. Given the peak power constrained transmission
channel and, thus, a fixed peak signal to noise level ratio, the compressed speech signal
has much higher overall energy compared to the original signal and, consequently, a
larger perceptual distance from the constant-power channel noise.
The compressor increases the SNR of the transmitted signal without increasing the
peak signal-to-noise ratio (PSNR). As a consequence, the dynamic range compression
not only increases the intelligibility, but also the noise robustness of the watermark
system. Since the speech signal has overall higher energy, also the watermark signal
that replaces the speech signal is higher in energy relative to the channel noise if the
compression precedes the watermark embedding as shown in Figure 10.5.
We simulate a dynamic rage controller using Audacity [153] with the Apple Core-
Audio implementation [154] of a multiband compressor followed by the LASDPA
implementation [155] of a limiter [152]. The settings used for both units were chosen
manually and are shown in Figure C.2 of Appendix C.
The input speech signal without silent regions as used in the previous subsection
was pre-processed by the compressor before the embedding of the watermark as shown
in Figure 10.5. Given various channel noise levels, the resulting BER curve is shown in
Figure 10.4 and exhibits a significant improvement.
Due to the presence of the compression system, it is of relevance if the signal to
channel noise ratio is defined as SNR or PSNR. Further on, the SNR is different
depending on whether the signal power is measured before the enhancement system
(as the channel is peak-power constrained and both processed and unprocessed signal
have the same peak power) or if it is measured at the channel input. Figure C.1 shows
the BER curves corresponding to Figure 10.4 but using alternative signal-to-noise ratio
153
Chapter 10. Experimental Watermark Robustness Evaluation
measures.
We cannot overstate the fact that within a single system that combines watermark
embedding and compression it is possible to robustly embed a watermark and, at the
same time, increase the intelligibility of the transmitted ATC speech. The system could
be fully backward-compatible. At the receiving end all receivers (including legacy
installations) would benefit from the intelligibility improvement. An equipped receiver
is required in order to benefit also from the embedded watermark.
10.4.3. Watermark Amplification
In principle, any measure that increases the power of the embedded watermark signal
relative to the channel noise increases the likelihood that the embedded message will
be successfully detected. Two simple measures are introduced and their effectiveness
evaluated.
A primitive way to increase the watermark power is to add the watermark data
signal w(n) (using the notation of Section 4.1) to the watermarked signal ˆ s
PB
(n). The
watermark signal is added as a watermark floor at a fixed power level relative to the
channel noise power. In non-voiced regions, a comfort noise of equivalent power is
added in order to maintain a constant background noise level. The resulting BER
curves shown in Figure 10.4 (‘waterm. floor’) demonstrate the effectiveness of the
watermark floor. Given a watermark floor level -3 dB below the channel noise, the
watermark floor is imperceptible since it is masked by the channel noise. Nevertheless
it significantly improves the noise robustness of the watermark. At higher levels, the
watermark floor becomes audible, and the additional gain in robustness comes at the
cost of perceptual quality. If the channel noise level is not known a-priori, a certain
value has to assumed based on operational requirements.
Another simple measure to increase the watermark power is to increase the gain
g(n) of the non-voiced speech components (cf. Section 4.1). Figure 10.4 (‘unvoiced
gain’) shows the positive effect on the BER curves given a gain g(n) that is increased by
a factor of +3 dB and +6 dB relative to its initial measured value. Since the non-voiced
components are often low in power compared to voiced components, the additional
gain does in most cases not increase the peak power of the watermarked signal. The
increased noise robustness comes, however, at the cost of perceptual quality.
10.5. Conclusions
We conclude that the proposed watermarking method is robust against any filtering
that is expected to occur on the aeronautical radio channel. The method allows
the embedding of the watermark in a frequency band that matches the passband of
the transmission channel. The equalizer in the watermark detector compensates for
phase distortions and tracks eventual time-variations of the channel filters. The high
robustness against filtering attacks comes as no surprise since the watermark detector
also has to cope with the rapidly time-varying LP synthesis (vocal tract) filter.
The watermarking method is also highly robust against sinusoidal gain modulations
154
10.5. Conclusions
over a wide range of modulation frequencies and modulation depths, considering
that the measured worst case modulation depth is well below -20 dB. Non-sinusoidal
gain modulations in the form of an AGC do not degrade the detection performance
either. The high gain modulation robustness is not surprising because the method
must inherently be robust against the rapidly time-varying gain of the unvoiced speech
excitation signal.
The timing synchronization scheme based on the detector’s equalizer shows satis-
factory performance, but there is room for further improvement by the application of
fractionally spaced instead of sample-spaced equalizers. A working implementation
for frame synchronization was presented, but many alternative schemes exist and
further optimization is possible.
We evaluated the noise robustness over a wide range of signal-to-noise ratios. It
is difficult to specify the SNR ranges of real world ATC channels. The implementa-
tion presented herein far exceeds the operationally required watermark bit rate of
approximately 100 bit/s [116]. Given the presented BER curves, for an operational ATC
implementation it might be beneficial to reduce the bit rate and, therefore, increase the
noise robustness of the watermark.
A number of further measures are available to improve the noise robustness. The
most remarkable is a pre-processing of the speech signal by multiband dynamic range
compression and limiting. This processing not only boosts the noise robustness of the
watermark, but also increases the intelligibility of the received speech signal.
155
Chapter 10. Experimental Watermark Robustness Evaluation
156
Chapter 11
Conclusions
Air traffic control (ATC) radio communication between pilots and controllers is subject
to a number of limitations given by the legacy system currently in use. To overcome
these limitations, and ultimately increase safety, security and efficiency in ATC, this
thesis investigates the embedding of digital side information into ATC radio speech.
It presents a number of contributions towards the ATC domain as well as the area
of robust speech watermarking. Figure 11.1 illustrates the connections among the
different contributions in the context of the ATC application at hand. The structure of
this chapter mostly mirrors the signal flow shown in the figure.
A review of related work reveals that most of the existing watermarking theory
and most practical schemes are based on a non-removability requirement, where the
watermark should be robust against tampering by an informed and hostile attacker.
While this is a key requirement in, e.g., transaction tracking applications, it poses
an unnecessary constraint in the legacy enhancement application considered in this
thesis. Theoretical capacity derivations show that dropping this constraint allows for
significantly larger watermark capacities, because watermarking can be performed in
perceptually irrelevant instead of perceptually relevant host signal components. In
contrast to perceptually relevant components, perceptually irrelevant signal compo-
nents are not limited to small modifications but can be replaced by arbitrary values.
ATC
Speech
Watermark
Embedding
ATC
Channel
Synchro-
nization
Watermark
Detection
[Ch. 7] [Ch. 4, Ch. 10] [Ch. 5] [Ch. 8, Ch. 9] [Ch. 4]
Figure 11.1.: Relationship among the different chapters of this thesis.
157
Chapter 11. Conclusions
We derived the watermark capacity in speech based on perceptual masking and the
auditory system’s insensitivity to the phase of unvoiced speech, and showed that this
capacity far exceeds the capacity of the conventional ideal Costa scheme.
To underline the validity of our theoretical results, we designed and implemented
a practical speech watermarking method that replaces the imperceptible phase of
non-voiced speech by a watermark data signal. Even though the practical scheme
still operates far below the theoretical capacity limit, it outperforms most current
state-of-the-art methods and shows high robustness against a wide range of signal
processing attacks.
To validate the robustness of our scheme in the aeronautical application, we cre-
ated two evaluation resources. At first, good knowledge about the characteristics of
the aeronautical voice radio channel is required. Since little information is publicly
available, we designed and implemented a ground/airborne based audio channel
measurement system and performed extensive in-flight channel measurements. The
collected data is now publicly available and can be used as a basis for further devel-
opments related to the air/ground voice radio channel. We proposed a model for the
measured channel data, and the extracted model parameters are used in the validation
of the watermarking scheme to simulate different effects of the aeronautical channel.
In addition, due to the particular nature of ATC speech and the limited availability
of ATC language resources, we also created an ATC speech corpus. The corpus is
useful in the validation of our watermarking scheme, but also constitutes a valuable
resource for language research and spoken language technologies in ATC.
1
Synchronization between watermark embedder and detector is necessary to detect
the watermark in the received signal. Many proposed watermarking schemes assume
perfect synchronization and do not treat this (often difficult) problem any further.
We carefully addressed this issue, transferred digital communication synchronization
methods to the watermarking domain, and presented a full implementation that covers
the various layers of synchronization.
The watermark detector is based on hidden training sequences and adaptive equal-
ization. The overall method is inherently robust against time-variant filtering and gain
modulation, which is a distinct advantage compared to quantization-based watermark-
ing. Moreover, the proposed scheme outperforms current state-of-the-art methods on a
scale of embedded bit rate per Hz of host signal bandwidth, considering the type of
channel attacks that are relevant in the aeronautical application.
Using the data obtained in the channel measurements, we evaluated the robustness
of the proposed method in light of the aeronautical application. The method is robust
against static and time-variant measured channel filter responses, the measured channel
gain modulation, and a simulated automatic gain control. Due to the non-linearity of
the ATC channel it is difficult to specify absolute channel noise levels. We evaluated
the noise robustness over a large range of channel SNRs and proposed a number
of measures that, if necessary, can increase the noise robustness of the proposed
1
Within one year since its public release, the speech corpus DVD with a download size of 2.4 GB
was requested by mail from several institutions, and was downloaded in full from our website by
approximately 40 people (conservative estimate based on manually filtered server log-files).
158
method. We recommend a pre-processing of the input speech signal with multiband
dynamic range compression and limiting for the aeronautical application. This not
only significantly increases the noise robustness of the watermark but also improves
the subjective intelligibility of the received speech. An increased intelligibility in
air/ground voice communication would constitute a further substantial contribution
to the safety in air traffic control.
Naturally, the work presented in this thesis is subject to limitations, and a number of
options for further research arise.
Since we proposed an entirely novel watermarking scheme, our implementation can
be considered as proof-of-concept implementation, only. The large gap to the theoretical
capacity shows that the performance can be further increased and numerous options
for further optimizations exist. For example, it is expected that an optimization of
the existing parameters, an SNR-adaptive watermark embedding, fractionally-spaced
equalization, or a joint synchronization, equalization and detection scheme could
increase both robustness and data rate. Furthermore, we have not evaluated the
real-time capability of our implementation. While the method operates in principle
on a per frame basis and is computationally not prohibitively expensive, a real-time
implementation was outside the scope of this thesis and is yet to be carried out. The
evaluation of the method’s robustness in the aeronautical application was limited by
the relatively small scope of the available channel data. To improve the validity of the
evaluation, the aeronautical channel measurements should be carried out on a large
scale with a wide range of aircraft types, transceiver combinations and flight situations.
We expect that a continuation of the theoretical work laid out in this thesis could
lead to a deeper understanding of speech watermarking and, ultimately, to better
performing system designs. Our theoretical capacity estimation for phase modulation
based watermarking is an upper bound, only, due to the idealistic assumptions made
about the host signal and the transmission channel. An evaluation of the capacity under
tighter but more realistic assumptions would provide additional insight and likely
result in a significantly lower watermark capacity. Complementing our experimental
robustness evaluation, a theoretical analysis of the robustness of the phase modulation
method against different transmission channel attacks could provide additional insight.
Incorporating the particular characteristics of the host signal and the aeronautical
channel, such an analysis would likely lead to a better system design. Last but not least,
the capacity derivations for masking based watermarking show that in our system
a lot of additional watermark capacity is still unused because perceptual masking is
not incorporated. Further research in speech watermarking should aim to exploit this
additional capacity by combining masking-based principles with the phase replacement
approach introduced in this thesis.
159
Chapter 11. Conclusions
160
Appendix A
ATCOSIM Transcription Format
Specification
This appendix provides the formal specification of the transcription format of the
ATCOSIM speech corpus presented in Chapter 7. While Section A.1 describes the basic
rules of the transcription format, Section A.2 contains content-specific amendments
such as lists of non-standard vocabulary.
A.1. Transcription Format
The orthographic transcription follows a strict set of rules which is presented hereafter.
In general, all utterances are transcribed word-for-word in standard British English. All
standard text is written in lower-case. Punctuation marks including periods, commas
and hyphens are omitted. Apostrophes are used only for possessives (e.g. pilot’s
radio)
1
and for standard English contractions (e.g. it’s, don’t).
Technical noises as well as speech and noises in the background—produced by
speakers other than the one recorded—are not transcribed. Silent pauses both between
and within words are not transcribed either. Numbers, letters, navigational aids and
radio call signs are transcribed as follows.
In terms of notation, stand-alone technical mark-up tags are written in upper case
letters with enclosing squared brackets (e.g. [HNOISE]). Regular lower-case letters
and words are preceded or followed by special characters to mark truncations (=),
individually pronounced letters (~) or unconfirmed airline names (@). Groups of
words are embraced by opening and closing XML-style tags to mark off-talk (<OT>
1
The mono-spaced typewriter type represents words or items that are part of the corpus transcription.
161
Appendix A. ATCOSIM Transcription Format Specification
... </OT> ), which is also transcribed, and foreign language (<FL> </FL>), for which
currently no transcription is given.
A.1.1. ICAO Phonetic Spelling
Definition: Letter sequences that are spelled phonetically.
Notation: As defined in the reference.
Example: india sierra alfa report your heading
Word List: alfa bravo charlie delta echo foxtrot golf hotel india juliett kilo
lima mike november oscar papa quebec romeo sierra tango uniform victor
whiskey xray yankee zulu
Supplement: fox (short form for foxtrot)
Reference: [113]
A.1.2. Acronyms
Definition: Acronyms that are pronounced as a sequence of separate letters in standard
English.
Notation: Individual lower-case letters with each letter being preceded by a tilde (~).
The tilde itself is preceded by a space.
Example: ~k ~l ~m
Exception: Standard acronyms which are pronounced as a single word. These are
transcribed without any special markup.
Exception Example: NATO, OPEC, ICAO are transcribed as nato, opec and icao
respectively.
A.1.3. Numbers
Definition: All digits, connected digits, and the keywords ‘hundred’, ‘thousand’ and
‘decimal’.
Notation: Standard dictionary spelling without hyphens. It should be noted that
controllers are supposed to use non-standard pronunciations for certain digits,
such as ‘tree’ instead of ‘three’, ‘niner’ instead of ‘nine’, or ‘tousand’ instead
of ‘thousand’ [113]. This is however applied inconsistently, and in any case
transcribed with the standard dictionary spelling of the digit.
Example: three hundred, one forty four, four seven eight, one oh nine
Word List: zero oh one two three four five six seven eight nine ten hundred
thousand decimal
162
A.1. Transcription Format
A.1.4. Airline Telephony Designators
Definition: The official airline radio call sign.
Notation: Spelling exactly as given in the references.
Examples: air berlin, britannia, hapag lloyd
Exceptions: Airline designators given letter-by-letter using ICAO phonetic spelling
as well as airline designators articulated as acronyms.
Exceptions Examples: foxtrot sierra india, ~k ~l ~m
References: [156, 157, 129]
A.1.5. Navigational Aids and Airports
Definition: Airports and navigational aids (navaids) corresponding to geographic
locations.
Notation: Geographical locations (navaids) are transcribed as given in the references
using lower-case letters. The words used can be names of real places (ex.
hochwald) or artificial five-letter navaid or waypoint designators (ex. corna,
gotil).
2
Airports and control centers are transcribed directly as said and in
lower-case spelling.
Examples: contact rhein on one two seven
alitalia two nine two turn left to gotil
alitalia two nine two proceed direct to corna charlie oscar romeo november
alfa
References: [158, Annex A: Maps of Simulation Airspace], [159, 160]
A.1.6. Human Noises
Definition: Human noises such as coughing, laughing and sighs produced by the
speaker. Also breathing noises that were considered by the transcriptionist as
exceptionally loud were marked using this tag.
Notation: [HNOISE] (in upper-case letters)
Example: sabena [HNOISE] four one report your heading
A.1.7. Non-verbal Articulations
Definition: Non-verbal articulations such as confirmatory, surprise or hesitation
sounds.
Notation: Limited set of expressions written in lower-case letters.
Example: malaysian ah four is identified
Word List: ah hm ahm yeah aha nah ohh
3
2
In some occasions less popular five-letter designators are also spelled out using ICAO phonetic spelling.
3
In contrast to ohh as an expression of surprise, the notation oh is used for the meaning ‘zero’, as in one
oh one.
163
Appendix A. ATCOSIM Transcription Format Specification
A.1.8. Truncated Words
Definition: Words which are cut off either at the beginning or the end of the word
due to stutters, full stops, or where the controller pressed the push-to-talk (PTT)
button too late. This also applies to words that are interrupted by human noises
([HNOISE]). This notation is used when the word-part is understandable. Empty
pauses within words are not marked.
Notation: The missing part of the word is replaced by an equals sign (=).
Examples: good mor= good afternoon (correction), luf= lufthansa three two five
(stutter), =bena four one (PTT pressed too late), sa= [HNOISE] =bena (inter-
ruption by cough)
Exception: Words which are cut off either at the beginning or the end of the word
due to fast speech or sloppy pronunciation are recorded according to standard
spelling and not marked. If the word-part is too short to be identified, another
notation is used (see below).
Exception Example: “goo’day” is transcribed as good day.
A.1.9. Word Fragments
Definition: Fragments of words that are too short so that no clear spelling of the
fragment can be determined.
Notation: [FRAGMENT]
Example: [FRAGMENT]
A.1.10. Empty Utterances
Definition: Instances where the controller pressed the PTT button, said nothing at all,
and released the button again.
Notation: [EMPTY]
Exception: If the utterance contains human noises produced by the speaker, the
[HNOISE] tag is used.
A.1.11. Off-Talk
Definition: Speech that is neither addressed to the pilot nor part of the air traffic
control communication.
Notation: Off-talk speech is transcribed and marked with opening and closing XML-
style tags: <OT> ... </OT>
Example: speedbird five nine zero <OT> ohh we are finished now </OT>
164
A.2. Amendments to the Transcription Format
A.1.12. Nonsensical Words
Definition: Clearly articulated word or word part that is not part of a dictionary and
that also does not make any sense. This is usually a slip of the tongue and the
speaker corrects the mistake.
Notation: [NONSENSE]
Example: [NONSENSE] futura nine three three identified
A.1.13. Foreign Language
Definition: Complete utterances, or parts thereof, given in a foreign language.
Notation: The foreign language part is not transcribed but is in its entirety replaced
by adjacent XML-style tags: <FL> </FL>
Example: <FL> </FL> break alitalia three seven zero report mach number
Exception: Certain foreign language terms, such as greetings, are transcribed accord-
ing to the spelling of that language, and are not tagged in any special way. A full
list is given below.
Exception Examples bonjour, tag, ciao
Exception Word List: See Section A.2.4.
A.1.14. Unknown Words
Definition: Word or group of words that could not be understood or identified.
Notation: [UNKNOWN]
Example: [UNKNOWN] five zero one bonjour cleared st prex
A.2. Amendments to the Transcription Format
The actual language use in the recordings required the following additions to the above
transcription format definitions.
A.2.1. Airline Telephony Designators
The following airline telephony designators cannot be found in the references cited
above, but are nonetheless clearly identified.
A.2.1.1. Military Radio Call Signs
There was no special list for military aircraft call signs available. The following call
signs were confirmed by an operational controller:
• ~i ~f ~o
• mission
165
Appendix A. ATCOSIM Transcription Format Specification
• nato
• spar
• steel
A.2.1.2. General Aviation Call Signs
In certain cases general aviation aircraft are addressed using the aircraft manufacturer
and type number (e.g. fokker twenty eight). The following manufacturer names
occurred:
• fokker
• ~b ~a (Short form for British Aerospace.)
A.2.1.3. Deviated Call Signs
In certain cases the controller uses a deviated or truncated version of the official call
sign. The following uses occurred:
• bafair (Short form for belgian airforce.)
• netherlands (Short form for netherlands air force.)
• netherlands air (Short form for netherlands air force.)
• german air (Short form for german air force.)
• french air force (The official radio call sign is france air force.)
• israeli (Short form for israeli air force.)
• israeli air (Short form for israeli air force.)
• turkish (Short form for turkish airforce.)
• hapag (Short form for hapag lloyd.)
• french line (Short form for french lines.)
• british midland (This is the airline name. The radio call sign is midland.)
• berlin (Short form for air berlin.)
• algerie (Short form for air algerie.)
• hansa (Short form for lufthansa.)
• lufty (Short form for lufthansa.)
• luha (Short form for lufthansa.)
• france (Short form for airfrans.)
• meridiana (This is the airline name, which also used to be the radio call sign.
The official call sign was changed to merair at some point in the past.)
• tunis air (This is the airline name. The radio call sign is tunair.)
• malta (Short form for air malta.)
• lauda (Short form for lauda air.)
A.2.1.4. Additional Verified Call Signs
The following call sign occurred and is also verified:
• london airtours (This call sign is listed only in the simulation manual [161].)
166
A.2. Amendments to the Transcription Format
A.2.1.5. Additional Unverified Radio Call Signs
The following airline telephony designators could not be verified through any of the
available resources. They are transcribed as understood by the transcriptionist on a
best-guess basis and preceded by an at symbol (@).
@aerovic
@alpha
@aviva
@bama
@cheena
@cheeseburger
@color
@devec
@foxy
@hanseli
@indialook
@ingishire
@jose
@metavec
@nafamens
@period
@roystar
@sunwing
@taitian
@tele
A.2.2. Navigational Aids
A.2.2.1. Deviated Navaids
In certain cases the controller uses a deviated version of the official navaid name. The
following uses occurred:
• milano (Local Italian version for milan.)
• trasa (Short form for trasadingen.)
A.2.2.2. Additional Verified Navaids
The following additional navaids occurred and are verified as they were occasionally
spelled out by the controllers:
• corna
• gotil
A.2.3. Special Vocabulary and Abbreviations
The following ATC specific vocabulary and abbreviations occurred. This listing is most
likely incomplete.
• masp (Minimum Aviation System Performance standards, pronounced as one
word)
• ~r ~v ~s ~m (Reduced Vertical Separation Minimum)
• ~c ~v ~s ~m (Conventional Vertical Separation Minimum)
• ~i ~f runway (Initial Fix runway)
• sec (sector)
• freq (frequency)
A.2.4. Foreign Language Greetings and Words
Due to their frequent occurrence the following foreign language greetings and words
were transcribed, using a simplified spelling which avoids special characters:
167
Appendix A. ATCOSIM Transcription Format Specification
• hallo (German for ‘hello’)
• auf wiederhoren (German for ‘goodbye’)
• gruss gott (German for ‘hello’)
• servus (German for ‘hi’)
• guten morgen (German for ‘good morning’)
• guten tag (German for ‘hello’)
• adieu (German for ‘goodbye’)
• tschuss (German for ‘goodbye’)
• tschu (German for ‘goodbye’)
• danke (German for ‘thank you’)
• bonjour (French for ‘hello’)
• au revoir (French for ‘goodbye’)
• merci (French for ‘thank you’)
• hoi (Dutch for ‘hello’)
• dag (Dutch for ‘goodbye’)
• buongiorno (Italian for ‘hello’)
• arrivederci (Italian for ‘goodbye’)
• hejda (Swedish for ‘goodbye’)
• adios (Spanish for ‘goodbye’)
168
Appendix B
Aeronautical Radio Channel Modeling and
Simulation
In this appendix, the basic concepts in the modeling and simulation of
the mobile radio channel are reviewed. The propagation channel is time-
variant and dominated by multipath propagation, Doppler effect, path loss
and additive noise. Stochastic reference models in the equivalent complex
baseband facilitate a compact mathematical description of the channel’s
input-output relationship. The realization of these reference models as
filtered Gaussian processes leads to practical implementations of frequency
selective and frequency nonselective channel models. Three different small-
scale area simulations of the aeronautical voice radio channel are presented
and we demonstrate the practical implementation of a frequency flat fading
channel. Based on a scenario in air/ground communication the parameters
for readily available simulators are derived. The resulting outputs give
insight into the characteristics of the channel and can serve as a basis for the
design of digital transmission and measurement techniques. We conclude
that the aeronautical voice radio channel is a frequency nonselective flat
fading channel and that in most situations the frequency of the amplitude
fading is band-limited to the maximum Doppler frequency.
Parts of this chapter have been published in K. Hofbauer and G. Kubin, “Aeronautical voice radio
channel modelling and simulation—a tutorial review,” in Proceedings of the International Conference on
Research in Air Transportation (ICRAT), Belgrade, Serbia, Jul. 2006.
169
Appendix B. Aeronautical Radio Channel Modeling and Simulation
B.1. Introduction
Radio channel modeling has a long history and is still a very active area of research.
This is especially the case with respect to terrestrial mobile radio communications
and wideband data communications due to commercial interest. However, the results
are not always directly transferable to the aeronautical domain. A comprehensive
up-to-date literature review on channel modeling and simulation with the aeronautical
radio in mind is provided in [134]. It is highly recommended as a pointer for further
reading and its content is not repeated herein. In this appendix, we review the general
concepts of radio channel modeling and demonstrate the application of three readily
available simulators to the aeronautical voice channel.
B.2. Basic Concepts
This and the following section are based on the work of Pätzold [132] and provide a
summary of the basic characteristics, the modeling, and the simulation of the mobile
radio channel. Another comprehensive treatment on this extensive topic is given in
[133].
B.2.1. Amplitude Modulation and Complex Baseband
The aeronautical voice radio is based on the double-sideband amplitude modulation
(DSB-AM, A3E or simply AM) of a sinusoidal, unsuppressed carrier [65]. An analog
baseband voice signal x(t) which is band-limited to a bandwidth f
m
modulates the
amplitude of a sinusoidal carrier with amplitude A
0
, carrier frequency f
c
and initial
phase ϕ
0
. The modulated high frequency (HF) signal x
AM
(t) is defined as
x
AM
(t) = (A
0
+ kx(t)) cos(2πf
c
t + ϕ
0
)
with the modulation depth
m =
[kx(t)[
max
A
0
≤ 1
The real-valued HF signal can be equivalently written using complex notation and
ω
c
= 2πf
c
as
x
AM
(t) = Re
_
(A
0
+ kx(t))e

c
t
e

0
_
(B.1)
Under the assumption that f
c
¸ f
m
the HF signal can be demodulated and the
input signal x(t) reconstructed by detecting the envelope of the modulated sine wave.
The absolute value is low-pass filtered and the original amplitude of the carrier is
subtracted.
x(t) =
1
k
([[x
AM
(t)[]
LP
− A
0
)
Figure B.1 shows the spectra of the baseband signal and the corresponding HF
signal. Since the baseband signal is, by definition, low-pass filtered, the HF signal is a
170
B.2. Basic Concepts
Figure B.1.: Signal spectra of (a) the baseband signal x(t) and (b) the HF signal x
AM
(t)
with a carrier at f
c
and symmetric upper and lower sidebands (Source:
[139], modified).
bandpass signal and contains energy only around the carrier frequency and the lower
and upper sidebands LSB and USB.
In general, any real bandpass signal s(t) can be represented as the real part of a
modulated complex signal,
s(t) = Re
_
g(t)e

c
t
_
(B.2)
where g(t) is called the equivalent complex baseband or complex envelope of s(t) [139].
The complex envelope g(t) is obtained by downconversion of the real passband signal
s(t), namely
g(t) = (s(t) + j ˆ s(t))e
−jω
C
t
with ˆ s(t) being the Hilbert transform of s(t). The Hilbert transform removes the
negative frequency component of s(t) before downconversion [86]. A comparison of
(B.1) and (B.2) reveals that the complex envelope g
AM
(t) of the amplitude modulated
HF signal x
AM
(t) simplifies to
g
AM
(t) = (A
0
+ kx(t))e

0
The signal x(t) is reconstructed from the equivalent complex baseband of the HF signal
by demodulation with
x(t) =
1
k
([g
AM
(t)[ − A
0
) (B.3)
The complex baseband signal can be pictured as a time-varying phasor or vector in a
rotating complex plane. The rotating plane can be seen as a coordinate system for the
vector, which rotates with the angular velocity ω
c
.
171
Appendix B. Aeronautical Radio Channel Modeling and Simulation
Figure B.2.: Multipath propagation in an aeronautical radio scenario (Source: [162]).
In order to represent the HF signal as a discrete-time signal, it must be sampled with
a frequency of more than twice the carrier frequency. This leads to a large number
of samples and thus makes numerical simulation difficult even for very short signal
durations. Together with the carrier frequency ω
c
the complex envelope g
AM
(t) fully
describes the HF signal x
AM
(t). The complex envelope g
AM
(t) has the same bandwidth
[−f
m
; f
m
] as the baseband signal x(t). As a consequence it can be sampled with a much
lower sampling frequency, which facilitates efficient numerical simulation without
loss of generality. Most of the existing channel simulations are based on the complex
baseband signal representation.
B.2.2. Mobile Radio Propagation Channel
Proakis [65] defines the communication channel as “. . . the physical medium that is
used to send the signal from the transmitter to the receiver.” Radio channel modeling
usually also includes the transmitting and receiving antennas in the channel model.
B.2.2.1. Multipath Propagation
The transmitting medium in radio communications is the atmosphere or free space,
into which the signal is coupled as electromagnetic energy by an antenna. The received
electromagnetic signal can be a superposition of a line-of-sight path signal and multiple
waves coming from different directions. This effect is known as multipath propagation.
Depending on the geometric dimensions and the properties of the objects in a scene,
an electromagnetic wave can be reflected, scattered, diffracted or absorbed on its way
to the receiver.
From hereon we assume, without loss of generality, that the ground station transmits
and the aircraft receives the radio signal. The effects treated in this paper are identical
for both directions. As illustrated in Figure B.2, reflected waves have to travel a longer
distance to the aircraft and therefore arrive with a time-delay compared to the line-of-
sight signal. The received signal is spread in time and the channel is said to be time
dispersive. The time delays correspond to phase shifts in between the superimposed
172
B.2. Basic Concepts
waves and lead to constructive or destructive interference depending on the position
of the aircraft. As both the position and the phase shifts change constantly due to
the movement of the aircraft, the signal undergoes strong pseudo-random amplitude
fluctuations and the channel becomes a fading channel.
The multipath spread T
m
is the time delay between the arrival of the line-of-sight
component and the arrival of the latest scattered component. Its inverse B
CB
=
1
T
M
is the coherence bandwidth of the channel. If the frequency bandwidth W of the
transmitted signal is larger than the coherence bandwidth (W > B
CB
), the channel
is said to be frequency selective. Otherwise, if W < B
CB
, the channel is frequency
nonselective or flat fading. This means that all the frequency components of the
received signal are affected by the channel in the same way [65].
B.2.2.2. Doppler Effect
The so-called Doppler effect shifts the frequency content of the received signal due
to the movement of the aircraft relative to the transmitter. The Doppler frequency
f
D
, which is the difference between the transmitted and the received frequency, is
dependent on the angle of arrival α of the electromagnetic wave relative to the heading
of the aircraft.
f
D
= f
D,max
cos(α)
The maximum Doppler frequency f
D,max
, which is the largest possible Doppler shift, is
given by
f
D,max
=
v
c
f
c
(B.4)
where v is the aircraft speed, f
c
the carrier frequency and c = 3 10
8 m
s
the speed of
light.
The reflected waves arrive not only with different time-delays compared to the
line-of-sight signal, but as well from different directions relative to the aircraft heading
(Figure B.2). As a consequence, they undergo different Doppler shifts. This results in a
continuous distribution of frequencies in the spectrum of the signal and leads to the
so-called Doppler power spectral density or simply Doppler spectrum.
B.2.2.3. Channel Attenuation
The signal undergoes significant attenuation during transmission. The path loss is
dependent on the distance d and the obstacles between transmitter and receiver. It
is proportional to
1
d
p
, with the pathloss exponent p in the range of 2 ≤ p < 4. In the
optimal case of line-of-sight free space propagation p = 2.
B.2.2.4. Additive Noise
During transmission additive noise is imposed onto the signal. The noise results,
among others, from thermal noise in electronic components, from atmospheric noise
or radio channel interference, or from man-made noise such as engine ignition noise.
173
Appendix B. Aeronautical Radio Channel Modeling and Simulation
(a) Rayleigh PDF
40 RAYLEIGH AND RICE PROCESSES AS REFERENCE MODELS
cR :=
ρ
2

2
0
. (3.18)
From the limit ρ → 0, i.e., cR → 0, the Rice process ξ(t) results in the Rayleigh process
ζ(t), whose statistical amplitude variations are described by the Rayleigh distribution
[Pap91]
pζ(x) =



x
2
0
e

x
2
σ2
0 , x ≥ 0 ,
0 , x < 0 .
(3.19)
The probability density functions pξ(x) and pζ(x) according to (3.17) and (3.19) are
shown in the Figures 3.3(a) and 3.3(b), respectively.
(a) (b)
0 1 2 3 4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
x
p
ζ
(
x
)
σ
o
2
=1/2
σ
o
2
= 1
σ
o
2
= 2
Figure 3.3: The probability density function of (a) Rice and (b) Rayleigh processes.
As mentioned before, the exact shape of the Doppler power spectral density Sµµ(f)
has no effect on the probability density of the absolute value of the complex Gaussian
random process, i.e., ξ(t) = |µρ(t)|. Analogously, this statement is also valid for
the probability density function of the phase ϑ(t) = arg{µρ(t)}, where ϑ(t) can be
expressed with (3.1), (3.2), and (3.4) as follows
ϑ(t) = arctan

µ2(t) + ρ sin (2πfρt + θρ)
µ1(t) + ρ cos (2πfρt + θρ)

. (3.20)
In order to confirm this statement, we study the probability density function pϑ(θ; t)
of the phase ϑ(t) given by the following relation [Pae98d]
(b) Rice PDF
40 RAYLEIGH AND RICE PROCESSES AS REFERENCE MODELS
cR :=
ρ
2

2
0
. (3.18)
From the limit ρ → 0, i.e., cR → 0, the Rice process ξ(t) results in the Rayleigh process
ζ(t), whose statistical amplitude variations are described by the Rayleigh distribution
[Pap91]
pζ(x) =



x
2
0
e

x
2
σ2
0 , x ≥ 0 ,
0 , x < 0 .
(3.19)
The probability density functions pξ(x) and pζ(x) according to (3.17) and (3.19) are
shown in the Figures 3.3(a) and 3.3(b), respectively.
(a)
0 1 2 3 4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
x
p
ξ
(
x
)
ρ= 0 (Rayleigh)
ρ= 1/2
ρ= 1 σ
o
2
=1
(b)
Figure 3.3: The probability density function of (a) Rice and (b) Rayleigh processes.
As mentioned before, the exact shape of the Doppler power spectral density Sµµ(f)
has no effect on the probability density of the absolute value of the complex Gaussian
random process, i.e., ξ(t) = |µρ(t)|. Analogously, this statement is also valid for
the probability density function of the phase ϑ(t) = arg{µρ(t)}, where ϑ(t) can be
expressed with (3.1), (3.2), and (3.4) as follows
ϑ(t) = arctan

µ2(t) + ρ sin (2πfρt + θρ)
µ1(t) + ρ cos (2πfρt + θρ)

. (3.20)
In order to confirm this statement, we study the probability density function pϑ(θ; t)
of the phase ϑ(t) given by the following relation [Pae98d]
(c) Jakes PSD
ELEMENTARY PROPERTIES OF RICE AND RAYLEIGH PROCESSES 37
-100 -50 0 50 100
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
f (Hz)
S
µ
i
µ
i
(
f
)
(b)
τ
Figure 3.1: (a) Jakes power spectral density Sµiµi
(f) and (b) the corresponding
autocorrelation function rµiµi
(τ) (fmax = 91 Hz, σ
2
0 = 1).
will play an important role in the following, where fc denotes the 3-dB-cut-off
frequency.
Theoretical investigations in [Bel73] have shown that the Doppler power spectral
density of aeronautical channels has a Gaussian shape. Further information on the
measurements concerning the propagation characteristics of aeronautical satellite
channels can be found, for example, in [Neu87]. Although no absolute correspondence
to the obtained measurements could be proved, (3.11) can in most cases very well be
used as a sufficiently good approximation [Neu89]. For signal bandwidths up to some
10 kHz, the aeronautical satellite channel belongs to the class of frequency-nonselective
mobile radio channels [Neu89].
Especially for frequency-selective mobile radio channels, it has been shown [Cox73]
that the Doppler power spectral density of the far echoes deviates strongly from
the shape of the Jakes power spectral density. Hence, the Doppler power spectral
density is approximately Gaussian shaped and is generally shifted from the origin of
the frequency plane, because the far echoes mostly dominate from a certain direction
of preference. Specifications for frequency-shifted Gaussian power spectral densities
for the pan-European, terrestrial, cellular GSM system can be found in [COS86].
The inverse Fourier transform results for the Gaussian power spectral density (3.11)
in the autocorrelation function
rµiµi
(τ) = σ
2
0
e

π
fc √
ln 2
τ

2
. (3.12)
In Figure 3.2, the Gaussian power spectral density (3.11) is illustrated with the
corresponding autocorrelation function (3.12).
Characteristic quantities for the Doppler power spectral density Sµiµi
(f) are the
average Doppler shift B
(1)
µiµi
and the Doppler spread B
(2)
µiµi
[Bel63]. The average Doppler
shift (Doppler spread) describes the average frequency shift (frequency spread) that
(d) Gaussian PSD
38 RAYLEIGH AND RICE PROCESSES AS REFERENCE MODELS
-200 -100 0 100 200
0
1
2
3
4
5
6
7
x 10
-3
f (Hz)
S
µ
i
µ
i
(
f
)
(b)
Figure 3.2: (a) Gaussian power spectral density Sµiµi
(f) and (b) corresponding
autocorrelation function rµiµi
(τ) (fc =

ln 2fmax, fmax = 91 Hz, σ
2
0 = 1).
a carrier signal experiences during transmission. The average Doppler shift B
(1)
µiµi
is
the first moment of Sµiµi
(f) and the Doppler spread B
(2)
µiµi
is the square root of the
second central moment of Sµiµi
(f). Consequently, B
(1)
µiµi
and B
(2)
µiµi
are defined by
B
(1)
µiµi
:=


−∞
fSµiµi
(f)df


−∞
Sµiµi
(f)df
(3.13a)
and
B
(2)
µiµi
:=


−∞
(f −B
(1)
µiµi
)
2
Sµiµi
(f)df


−∞
Sµiµi
(f)df
, (3.13b)
for i = 1, 2, respectively. Equivalent — but often easier to calculate — expressions for
(3.13a) and (3.13b) can be obtained by using the autocorrelation function rµiµi
(τ) as
well as its first and second time derivative at the origin, i.e.,
B
(1)
µiµi
:=
1
2πj
·
˙ rµiµi
(0)
rµiµi
(0)
and B
(2)
µiµi
=
1

˙ rµiµi
(0)
rµiµi
(0)

2

¨ rµiµi
(0)
rµiµi
(0)
,(3.14a, b)
for i = 1, 2, respectively.
For the important special case where the Doppler power spectral densities Sµ1µ1
(f)
and Sµ2µ2
(f) are identical and symmetrical, ˙ rµiµi
(0) = 0 (i = 1, 2) holds. Hence, by
using (3.7), we obtain the following expressions for the corresponding characteristic
quantities of the Doppler power spectral density Sµµ(f)
B
(1)
µµ
= B
(1)
µiµi
= 0 and B
(2)
µµ
= B
(2)
µiµi
=

β
2πσ0
, (3.15a, b)
Figure B.3.: Probability density functions (PDF) and power spectral densities (PSD,
f
D,max
= 91 Hz, σ
2
0
= 1) for Rayleigh and Rice channels (Source: [132]).
B.2.2.5. Time Dependency
Most of the parameters described in this section vary over time due to the movement
of the aircraft. As a consequence the response of the channel to a transmitted signal
also varies, and the channel is said to be time-variant.
B.2.3. Stochastic Terms and Definitions
The following section recapitulates some basic stochastic terms in order to clarify the
nomenclature and notation used herein. The reader is encouraged to refer to [132] for
exact definitions.
Let the event A be a collection of a number of possible outcomes s of a random
experiment, with the real number P(A) being its probability measure. A random
variable µ is a mapping that assigns a real number µ(s) to every outcome s. The
cumulative distribution function
F
µ
(x) = P(µ ≤ x) = P(¦s[µ(s) ≤ x¦)
is the probability that the random variable µ is less or equal to x. The probability
density function (PDF, or simply density) p
µ
(x) is the derivative of the cumulative
distribution function,
p
µ
(x) =
dF
µ
(x)
dx
The most common probability density functions are the uniform distribution, where
the density is constant over a certain interval and is zero outside, and the Gaussian dis-
tribution or normal distribution N(m
µ
, σ
2
µ
), which is determined by the two parameters
expected value m
µ
and variance σ
2
µ
.
With µ
1
and µ
2
being two statistically independent normally distributed random
variables with identical variance σ
2
0
, the new random variable ζ =
_
µ
2
1
+ µ
2
2
represents
a Rayleigh distributed random variable (Figure B.3(a)). Given an additional real
parameter ρ, the new random variable ξ =
_

1
+ ρ)
2
+ µ
2
2
is Rice or Rician distributed
(Figure B.3(b)). A random variable λ = e
µ
is said to be lognormally distributed. A
multiplication of a Rayleigh and a lognormally distributed random variable η = ζλ
leads to the so-called Suzuki distribution.
174
B.3. Radio Channel Modeling
A stochastic process µ(t, s) is a collection of random variables, which is indexed by
a time index t. At a fixed time instant t = t
0
, the value of a random process, µ(t
0
, s),
is a random variable. On the other hand, in the case of a fixed outcome s = s
0
of
a random experiment, the value of the stochastic process µ(t, s
0
) is a time function,
or signal, that corresponds to the outcome s
0
. As is common practice, the variable
s is dropped in the notation for a stochastic process and µ(t) written instead. With
µ
1
(t) and µ
2
(t) being two real-valued stochastic processes, a complex-valued stochastic
process is defined by µ(t) = µ
1
(t) + jµ
2
(t). A stochastic process is called stationary if
its statistical properties are invariant to a shift in time. The Fourier transform of the
autocorrelation function of such a stationary process defines the power spectral density
or power density spectrum of the stochastic process.
B.3. Radio Channel Modeling
Section B.2.2.1 illustrated multipath propagation from a geometrical point of view.
However, geometrical modeling of the multipath propagation is possible only to a very
limited extent. It requires detailed knowledge of the geometry of all objects in the scene
and their electromagnetic properties. The resulting simulations are time consuming
to set up and computationally expensive, and a number of simplifications have to be
made. Furthermore the results are valid for the specific situation only and cannot
always be generalized. As a consequence, a stochastic description of the channel and
its properties is widely used. It focuses on the distribution of parameters over time
instead of trying to predict single values. This class of stochastic channel models is the
subject of the following investigations.
In large-scale areas with dimensions larger than tens of wavelengths of the carrier
frequency f
c
, the local mean of the signal envelope fluctuates mainly due to shadow-
ing and is found to be approximately lognormally distributed. This slow fading is
important for channel availability, handover, and mobile radio network planning.
More important for the design of a digital transmission technique is the fast signal
fluctuation, the fast fading, which occurs within small areas. As a consequence, we
focus on models that are valid for small-scale areas, where we can assume the path
loss and the local mean of the signal envelope due to shading, etc. , to be constant.
Furthermore we assume for the moment a frequency nonselective channel and, for
mathematical simplicity, the transmission of an unmodulated carrier. A more extensive
treatment can be found in [132], on which this section is based on.
B.3.1. Stochastic Mutlipath Reference Models
The sum µ(t) of all scattered components of the received signals can be assumed to be
normally distributed. If we let µ
1
(t) and µ
2
(t) be zero-mean statistically independent
Gaussian processes with variance σ
2
0
, then the sum of the scattered components is
given in complex baseband representation as a zero-mean complex Gaussian process
µ(t) and is defined by
Scatter: µ(t) = µ
1
(t) + jµ
2
(t) (B.5)
175
Appendix B. Aeronautical Radio Channel Modeling and Simulation
The line-of-sight (LOS) signal component m(t) is given by
LOS: m(t) = A
0
e
j(2πf
D

0
)
, (B.6)
again in complex baseband representation. The superposition µ
m
(t) of both signals is
LOS+Scatter: µ
m
(t) = m(t) + µ(t) (B.7)
Depending on the surroundings of the transmitter and the receiver, the received
signal consists of either the scatter components only or a superposition of LOS and
scatter components. In the first case (i.e. (B.5)) the magnitude of the complex baseband
signal [µ(t)[ is Rayleigh distributed. Its phase ∠(µ(t)) is uniformly distributed over
the interval [−π; π). This type of a Rayleigh fading channel is predominant in regions
where the LOS component is blocked by obstacles, such as in urban areas with high
buildings, etc.
In the second case where a LOS component and scatter components are present
(i.e. (B.7)), the magnitude of the complex baseband signal [µ(t) + m(t)[ is Rice dis-
tributed. The Rice factor k is determined by the ratio of the power of the LOS and
the scatter components, where k =
A
2
0

2
0
. This Rice fading channel dominates the
aeronautical radio channel.
One can derive the probability density of amplitude and phase of the received signal
based on the Rice or Rayleigh distributions. As a further step, it is possible to compute
the level crossing rate and the average duration of fades, which are important measures
required for the optimization of coding systems in order to address burst errors. The
exact formulas can be found in [132] and are not reproduced herein.
The power spectral density of the complex Gaussian random process in (B.7) cor-
responds to the Doppler power spectral density when considering the power of all
components, their angle of arrival and the directivity of the receiving antenna. Assum-
ing a Rayleigh channel with no LOS component, propagation in a two-dimensional
plane and uniformly distributed angles of arrival, one obtains the so-called Jakes power
spectral density as the resulting Doppler spectrum. Its shape is shown in Figure B.3(c).
However, both theoretical investigations and measurements have shown that the as-
sumption that the angle of arrival of the scattered components is uniformly distributed
does in practice not hold for aeronautical channels. This results in a Doppler spectrum
which is significantly different from the Jakes spectrum [163]. The Doppler power
spectral density is therefore better approximated by a Gaussian power spectral density,
which is plotted in Figure B.3(d). For nonuniformly distributed angles of arrival, as
with explicit directional echos, the Gaussian Doppler PSD is unsymmetrical and shifted
away from the origin. The characteristic parameters describing this spectrum are the
average Doppler shift (the statistic mean) and the Doppler spread (the square root of
the second central moment) of the Doppler PSD .
B.3.2. Realization of the Reference Models
The above reference models are based on colored Gaussian random processes. The
realization of these processes is not trivial and leads to the theory of deterministic
176
B.3. Radio Channel Modeling
processes. Mostly two fundamental methods are applied in the literature in order to
generate colored Gaussian processes. In the filter method, white Gaussian noise is
filtered by an ideal linear time-invariant filter with the desired power spectrum. In the
Rice method an infinite number of weighted sinusoids with equidistant frequencies
and random phases are superimposed. In practice both methods can only approxi-
mate the colored Gaussian process. Neither an ideal filter nor an infinite number of
sinusoids can be realized. A large number of algorithms used to determine the actual
parameters of the sinusoids in the Rice method exist. The methods approximate the
Gaussian processes with a sum of a limited number of sinusoids, thus considering the
computational expense [132]. For the filter method on the other hand, the problem
boils down to filter design with its well-understood limitations.
B.3.3. Frequency Nonselective Channel Models
In frequency nonselective flat fading channels, all frequency components of the received
signal are affected by the channel in the same way. The channel is modeled by a
multiplication of the transmitted signal with a suitable stochastic model process. The
Rice and Rayleigh processes described in Section B.3.1 can serve as statistical model
processes.
However it has been shown that the Rice and Rayleigh processes often do not provide
enough flexibility to adapt to the statistics of real world channels. This has led to the
development of more versatile stochastic model processes such as the Suzuki process
and its variations (a product of a lognormal distribution for the slow fading and a
Rayleigh distribution for the fast fading), the Loo Model with its variations, and the
generalized Rice process.
B.3.4. Frequency Selective Channel Models
Where channel bandwidth and data rate increase, the propagation delays can no longer
be ignored as compared to the symbol interval. The channel is then said to be frequency
selective and (over time) the different frequency components of a signal are affected
differently by the channel.
B.3.4.1. Tapped Delay Line Structure
For the modeling of a frequency selective channel, a tapped delay line structure is
typically applied as reference model (Figure B.4). The ellipses model of Parsons and
Bajwa [133] shows that all reflections and scatterings from objects located on an ellipse,
with the transmitter and receiver in the focal points, undergo the same time delay. This
leads to a complex Gaussian distribution of the received signal components for a given
time delay, assuming a large number of objects with different reflection properties in
the scene and applying the central limit theorem. As a consequence, the tap weights
c
i
(t) of the single paths are assumed to be uncorrelated complex Gaussian processes.
It is shown in Section B.2.3 that the amplitudes of the complex tap weights are then
177
Appendix B. Aeronautical Radio Channel Modeling and Simulation
Figure B.4.: Tapped delay line structure as a frequency selective and time variant
channel model (Source: [132]). W is the bandwidth of the transmitted
signal, c
i
(t) are uncorrelated complex Gaussian processes.
either Rayleigh or Rice distributed, depending on the mean of the Gaussian processes.
An analytic expression for the phases of the tap weights can be found in [132].
B.3.4.2. Linear Time-Variant System Description
The radio channel can be modeled as a linear time-variant system, with input and
output signals in the complex baseband representation. The system can be fully
described by its time-variant impulse response h(t, τ). In order to establish a statistical
description of the input/output relation of the above system, the channel is further
considered as a stochastic system, with h(t, τ) as its stochastic system function.
These input/output relations of the stochastic channel can be significantly simplified
assuming that the impulse response h(t, τ) is wide sense stationary
1
(WSS) in its
temporal dependence on t, and assuming that scattering components with different
propagation delays τ are statistically uncorrelated (uncorrelated scattering (US)). Based
on these two assumptions, Bello proposed in 1963 the class of WSSUS models. They
are nowadays widely used and are of great importance in channel modeling. They are
based on the tapped delay line structure and allow the computation of all correlation
functions, power spectral densities and properties such as Doppler and delay spread,
etc., from a given scattering function. The scattering function may be obtained by the
measurement of real channels, by specification, or both. For example, the European
working group ‘COST 207’ published scattering functions in terms of delay power spec-
tral densities and Doppler power spectral densities for four propagation environments
which are claimed to be typical for mobile cellular communication.
1
Measurements have shown that this assumption is valid for areas smaller than tens of wavelengths of
the carrier frequency f
c
.
178
B.4. Aeronautical Voice Channel Simulation
B.3.5. AWGN Channel Model
The noise that is added to the transmitted signal during transmission is typically
represented as an additive white Gaussian noise (AWGN) process. The main parameter
of the model is the variance σ
2
0
of the Gaussian process, which together with the signal
power defines the signal-to-noise ratio (SNR) of the output signal [65]. The AWGN
channel is usually included as an additional block after the channel models described
earlier.
B.4. Aeronautical Voice Channel Simulation
This section aims to present three different simulators which implement the above
radio channel models. As mentioned earlier, the models are based on a small-scale
area assumption where path loss and shadowing are assumed to be constant. We first
define a simulation scenario based on which we show the simulators’ input parameters
and the resulting channel output. We use as example the aeronautical VHF voice radio
channel between a fixed ground station and a general aviation aircraft which is flying
at its maximum speed.
The input and output signals of all three simulators are equivalent complex baseband
signals, and the same Matlab-based pre- and post-processing of the signals is used for
all simulators. The processing consists of bandpass pre- and post-filtering, conversion
to and from complex baseband, and amplitude modulation and demodulation.
B.4.1. Simulation Scenario and Parameters
For air-ground voice communication in civil air traffic control, the carrier frequency f
c
is within a range from 118 MHz to 137 MHz, the ‘very high frequency’ (VHF) band.
The 760 channels are spaced 25 kHz apart. The channel spacing is reduced to 8.33 kHz
in specific regions of Europe in order to increase the number of available channels to a
theoretical maximum of 2280. According to specification, the frequency response of
the transmitter is required to be flat between 0.3 kHz to 2.5 kHz with a sharp cut-off
below and above this frequency range [97], resulting in an analog channel bandwidth
W = 2.2 kHz.
For the simulations, we assume a carrier with amplitude A
0
= 1, frequency f
c
=
120 MHz and initial phase ϕ
0
=
π
4
, a channel spacing of 8.33 kHz, a modulation depth
m = 0.8 and an input signal which is band-limited to f
l
= 300 Hz to f
m
= 2.5 kHz. For
the illustrations we use a purely sinusoidal baseband input signal x(t) = sin(2πf
a
t)
with f
a
= 500 Hz, which is sampled with a frequency of f
sa
= 8000 Hz and bandpass
filtered according to the above specification. Figure B.5 shows all values that the
amplitude modulated signal x
AM
(t) takes on during the observation interval in the
equivalent complex baseband representation g
AM
(t). The white circle represents the
unmodulated carrier signal, which is a single point in the equivalent complex baseband.
A short segment of the magnitude of the signal, [g
AM
(t)[, is also shown in the figure.
In the propagation model a general aviation aircraft with a speed of v = 60
m
s
is
assumed. Using (B.4), this results in a maximum Doppler frequency of f
D,max
= 24 Hz.
179
Appendix B. Aeronautical Radio Channel Modeling and Simulation
0 0.7 1
0
0.7
1
In−Phase: Re[g(t)]
Q
u
a
d
r
a
t
u
r
e
:

I
m
[
g
(
t
)
]
1 1.005 1.01
0
0.5
1
1.5
2
Time in s
M
a
g
n
i
t
u
d
e
:

|
g
(
t
)
|
Figure B.5.: Sinusoidal AM signal in equivalent complex baseband. Left: In-phase and
quadrature components. The white circle indicates an unmodulated carrier.
Right: The magnitude of g
AM
(t).
Given the carrier frequency f
c
, the wavelength is λ =
c
f
c
= 2.5 m. This distance λ is
covered in t
λ
= 0.0417 s. Furthermore, we assume that the aircraft flies at a height
of h
2
= 3000 m and at a distance of d = 10 km from the ground station. The ground
antenna is considered to be mounted at a height of h
1
= 20 m. The geometric path
length difference ∆l between the line-of-sight path and the dominant reflection along
the vertical plane on the horizontal flat ground evaluates to
∆l =
¸
h
2
1
+
_
h
1
d
h
1
+ h
2
_
2
+
¸
h
2
2
+
_
d −
h
1
d
h
1
+ h
2
_
2

_
d
2
+ (h
2
−h
1
)
2
= 11.5 m
which corresponds to a path delay of ∆τ = 38.3 ns. In a worst case scenario with a
multipath spread of T
m
= 10∆τ, the coherence bandwidth is still B
CB
= 2.6 MHz. With
B
CB
¸ W, according to Section B.2.2.1 the channel is surely frequency nonselective.
Worst-case multipath spreads of T
m
= 200 µs as reported in [163] cannot be explained
with a reflection in the vertical plane, but only with a reflection on far-away steep
slopes. In these rare cases, the resulting coherence bandwidth is in the same order of
magnitude as the channel bandwidth.
We cannot confirm the rule of thumb given in [163] where ∆l ≈ h
2
given d ¸
h
2
. For example, a typical case for commercial aviation where h
1
= 30 m, h
2
=
10000 m and d = 100 km results in a path difference of ∆l = 6.0 m. In the special case
of a non-elevated ground antenna with h
1
≈ 0 the path delay vanishes. In contrast,
large path delays only occur in situations with large h
1
and h
2
and small d such as in
air-to-air communication, which is not considered herein.
The Rician factor k is assumed to be k = 12 dB, which corresponds to a fairly strong
line-of-sight signal [163].
B.4.2. Mathworks Communications Toolbox Implementation
The Mathworks Communications Toolbox for Matlab [164] implements a multipath
fading channel model. The simulator supports multiple fading paths, of which the first
is Rice or Rayleigh distributed and the subsequent paths are Rayleigh distributed. The
180
B.4. Aeronautical Voice Channel Simulation
Doppler spectrum is approximated by the Jakes spectrum. As shown in Section B.3.1,
the Jakes Doppler spectrum is not suitable for the aeronautical channel. The preferable
Gaussian Doppler spectrum is unfortunately not supported by the simulator. The
toolbox provides a convenient tool for the visualization of the impulse and frequency
response, the gain and phasor of the multipath components, and the evolution of these
quantities over time.
In terms of implementation, the toolbox models the channel as a time-variant linear
FIR filter. Its tap-weights g(m) are given by a sampled and truncated sum of shifted
sinc functions. They are shifted by the path delays τ
k
of the k
th
path, weighted by
the average power gain p
k
of the corresponding path, and weighted by a random
process h
k
(n). The uncorrelated random processes h
k
(n) are filtered Gaussian random
processes with a Jakes power spectral density. This results in
g(m) =

k
sinc
_
τ
k
1/f
sa
−m
_
h
k
(n)p
k
.
The equation shows once again that when all path delays are small as compared to the
sample period, the sinc terms coincide. This results in a filter with only one tap and
consequently in a frequency nonselective channel.
In our scenario the channel is frequency-flat, and a model according to Section B.3.3
with one Rician path is appropriate. The only necessary input parameters for the
channel model are f
sa
, f
D,max
and k.
The output of the channel for the sinusoidal input signal as defined above is shown
in Figure B.6(a). The demodulated signal (using (B.3) and shown in Figure B.6(b))
reveals the amplitude fading of the channel due to the Rician distribution of the signal
amplitude. It is worthwhile noticing that the distance between two maxima is roughly
one wavelength λ. This fast fading results from the superposition of the line-of-sight
component and the multitude of scattered components with Gaussian distribution. As
shown in Figure B.6(c), bandpass filtering the received signal according to the specified
channel bandwidth does not remove the amplitude fading.
The toolbox also allows a model structure with several discrete paths similar to
Figure B.4. One can specify the delay and the average power of each path. A scenario
similar to the first one with two distinct paths is shown for comparison. We define one
Rician path with a Rician factor of k = 200. This means that it contains only the line of
sight signal and no scattering. We furthermore define one Rayleigh path with a relative
power of -12 dB and a time delay of ∆τ = 38.3 ns, both relative to the LOS path.
Due to the small bandwidth of our channel, the results are equivalent to the first
scenario. Figure B.7 shows the time-variation of the power of the two components,
with the total power being normalized to 0 dB.
B.4.3. The Generic Channel Simulator Implementation
The Generic Channel Simulator (GCS) is a radio channel simulator which was de-
veloped between 1994 and 1998 under contract of the American Federal Aviation
Administration (FAA). Its source code and documentation is provided in [165]. The
181
Appendix B. Aeronautical Radio Channel Modeling and Simulation
(a) Channel Outpout
0 1 2
−0.5
0
0.5
1
1.5
2
In−Phase: Re[g(t)]
Q
u
a
d
r
a
t
u
r
e
:

I
m
[
g
(
t
)
]
(b) Demodulation
0 1 2 3 4 5 6 7 8
−1.5
−1
−0.5
0
0.5
1
1.5
Time in units of t
λ
=0.042s
A
m
p
l
i
t
u
d
e
(c) Bandpass Filter
0 0.1 0.2 0.3
−1.5
−1
−0.5
0
0.5
1
1.5
Time in s
A
m
p
l
i
t
u
d
e
Figure B.6.: Received signal at different processing stages. Received signal (channel
output of Mathworks Communication Toolbox and an observation interval
of 2 s) (a) in equivalent complex baseband, (b) after demodulation, and
(c) after bandpass filtering.
0 0.2 0.4 0.6
−30
−25
−20
−15
−10
−5
0
Time in s
C
o
m
p
o
n
e
n
t
s

P
o
w
e
r

i
n

d
B
Figure B.7.: Power of the line-of-sight component (top) and the Rayleigh distributed
scattered components (bottom).
182
B.4. Aeronautical Voice Channel Simulation
(a) Channel Outpout (b) Demodulation
0 100 200 300
−2
−1
0
1
2
Time in s
A
m
p
l
i
t
u
d
e
Figure B.8.: Generic Channel Simulator: Received signal in an observation interval of
300 s (channel output) (a) in equivalent complex baseband and (b) after
demodulation.
GCS written in ANSI C and provides a graphical MOTIF-based interface and a com-
mand line interface to enter the model parameters. Data input and output files are in a
binary floating point format and contain the signal in equivalent complex baseband
representation. The last publicly available version of the software dates back to 1998.
This version requires a fairly complex installation procedure and a number of adjust-
ments in order to enable compiling of the source code on current operating systems.
We provide some advice on how to install the software on the Mac OS X operating
system and how to interface the simulator with Matlab [166].
The GCS allows the simulation of various types of mobile radio channels, the
VHF air/ground channel among others. Similar to the Mathworks toolbox, the GCS
simulates the radio channel by a time-variant IIR filter. The scatter path delay power
spectrum shape is approximated by a decaying exponential multiplying a zeroth-order
modified Bessel function, the Doppler power spectrum is assumed to have a Gaussian
shape.
In the following example we use a similar setup as in the second scenario in
Section B.4.2, a discrete line-of-sight signal and a randomly distributed scatter path
with a power of -12 dB. With the same geometric configuration as above, the GCS
software confirms our computed time delay of ∆τ = 38.3 ns.
Using the same parameters for speed, geometry, frequency, etc., as in the scenarios
described above, we obtain a channel output as shown in Figure B.8.
The time axis in the plot of the demodulated signal is very different as compared
to the one in Figure B.6(c). The amplitude fading of the signal has a similar shape as
before but is in the order of three magnitudes slower than observed in Section B.4.2.
This contradicting result cannot be explained by the differing model assumptions, nor
does it coincide with first informal channel measurements that we pursued.
We believe that the out-dated source code of the GCS has legacy issues which lead
to problems with current operating system and compiler versions.
183
Appendix B. Aeronautical Radio Channel Modeling and Simulation
(a) Channel Outpout
0 1 2
−0.5
0
0.5
1
1.5
2
In−Phase: Re[g(t)]
Q
u
a
d
r
a
t
u
r
e
:

I
m
[
g
(
t
)
]
(b) Demodulation and Filtering
0 0.1 0.2 0.3
−1.5
−1
−0.5
0
0.5
1
1.5
Time in s
A
m
p
l
i
t
u
d
e
Figure B.9.: Reference channel model with Jakes PSD: Received signal (channel out-
put) (a) in equivalent complex baseband and (b) after demodulation and
bandpass filtering.
B.4.4. Direct Reference Model Implementation
The third channel simulation is based on a direct implementation of the concept
described in Section B.3.3 with the Matlab routines provided in [132]. We model
the channel by multiplying the input signal with the complex random process µ
m
(t)
as given in (B.7). The random processes are generated with the sum of sinusoids
approach as discussed in Section B.3.2. Figure B.9 shows the channel output using a
Jakes Doppler PSD and a Rician reference model. The result is very similar to the one
obtained with the Mathworks toolbox.
In a second example, a Gaussian distribution is used for the Doppler power spectral
density instead of the Jakes model. The additional parameter f
D,co
describes the width
of the Gaussian PSD by its 3 dB cut-off frequency. The value is arbitrarily set to
f
D,co
= 0.3f
D,max
. This corresponds to a fairly small Doppler spread of B = 6.11 Hz,
compared to the Jakes PSD with B = 17 Hz. Figure B.10 shows the resulting discrete
Gaussian PSD. The channel output in Figure B.11 confirms the smaller Doppler spread
by a narrower lobe in the scatter plot. The amplitude fading is by a factor of two slower
than with the Jakes PSD.
B.5. Discussion
We further examine the path gain of the frequency nonselective flat fading channel
simulated in Section B.4.2. This path gain is plotted in Figure B.12(a) and corresponds
to the amplitude fading of the received signal shown in Figure B.6(b). Considering
the time-variant path gain as a time series, its magnitude spectrum (shown in Fig-
ure B.12(b)) reveals that the maximum frequency with which the path gain changes
approximately coincides with the maximum Doppler frequency f
D,max
=
v
c
f
c
of the
channel. The amplitude fading is band-limited to the maximum Doppler frequency.
This fact can be explained with the concept of beat as known in acoustics [167].
184
B.5. Discussion
−20 −10 0 10 20
0
0.5
1
x 10
−3
Doppler frequency in Hz
D
o
p
p
l
e
r

P
S
D
Figure B.10.: Discrete Gaussian Doppler PSD with f
D,max
= 24 Hz and a 3-dB-cut-off
frequency of f
D,co
= 7.2 Hz.
(a) Channel Outpout
0 1 2
−0.5
0
0.5
1
1.5
2
In−Phase: Re[g(t)]
Q
u
a
d
r
a
t
u
r
e
:

I
m
[
g
(
t
)
]
(b) Demodulation and Filtering
0 0.2 0.4 0.6
−1.5
−1
−0.5
0
0.5
1
1.5
Time in s
A
m
p
l
i
t
u
d
e
Figure B.11.: Reference channel model with Gaussian PSD: Received signal (channel
output) (a) in equivalent complex baseband and (b) after demodulation
and bandpass filtering.
(a) Path Gain
0 0.2 0.4 0.6
−15
−10
−5
0
5
Time in s
P
a
t
h

G
a
i
n

i
n

d
B
(b) Path Gain Spectrum
0 10 24 40 60
−60
−50
−40
−30
−20
−10
0
Frequency in Hz
D
F
T

M
a
g
n
i
t
u
d
e

i
n

d
B
Figure B.12.: Time-variant path gain and its magnitude spectrum for a flat fading
channel with f
D,max
= 24 Hz.
185
Appendix B. Aeronautical Radio Channel Modeling and Simulation
The superposition of two sinusoidal waves with slightly different frequencies f
1
and
f
2
leads to a special type of interference, where the envelope of the resulting wave
modulates with a frequency of f
1
− f
2
. The maximum frequency difference f
1
− f
2
between a scatter component and the carrier frequency f
c
with which we demodulate
in the simulation is given by the maximum Doppler frequency. This explains the
band-limitation of the amplitude fading.
In a real world system a potentially coherent receiver may demodulate the HF signal
with a reconstructed carrier frequency
ˆ
f
c
which is already Doppler shifted. In this
case, the maximum frequency difference between Doppler shifted carrier and Doppler
shifted scatter component is 2f
D,max
. This maximum occurs when the aircraft points
towards the ground station so that the LOS signal arrives from in front of the aircraft,
and when at the same time a scatter component arrives from the back of the aircraft
[163]. We can thereof conclude that the amplitude fading of the frequency nonselective
aeronautical radio channel is band-limited to twice the maximum Doppler frequency
f
D,max
.
For a channel measurement system as presented in Chapter 8 the maximum fre-
quency of the amplitude fading entails that the fading of the channel has to be
sampled with at least double the frequency, that means with a sampling frequency of
f
ms
≥ 4f
D,max
, to avoid aliasing. With the parameters for a general aviation aircraft used
throughout this appendix, this means that the amplitude scaling has to be sampled
with a frequency of f
ms
> 96 Hz or, in terms of an audio sample rate f
sa
= 8 kHz,
at least every 83 samples. For a measurement system based on maximum length
sequences (MLS, see Chapter 8) this means that the MLS length should be no longer
than L = 2
n
−1 = 63 samples.
A factor that is not considered in the presented channel models is the likely presence
of automatic gain controls (AGC) in the radio transmitter and receiver. Since the
presented models are propagation channel models, transceiver components are by
definition not part of these models. Transceiver modeling is in fact a discipline by
itself and is outside the scope of this work. However, it should be noted that an AGC
in the receiver might partially compensate for the amplitude modulation induced
by the fading channel. Figure B.13(a) and B.13(b) show the demodulated signal of
Section B.4.2 after the bandpass filtering and after the application of a simulated
automatic gain control with a cut-off frequency of 300 Hz. In theory, carrier wave
amplitude modulations with a frequency of less than f
l
= 300 Hz should not be caused
by the band-limited input signal but by the fading of the channel. An automatic gain
control could therefore detect all low-frequency modulations and compensate for all
fading with a frequency up to f
l
= 300 Hz. The actual implementations for automatic
gain controls and their parameters are however highly manufacturer-specific and very
little information is publicly available.
We considered throughout this appendix the case of a small general aviation aircraft
as used in the channel measurements presented in Chapter 8. It is also of interest to
briefly discuss the case of large commercial aviation passenger aircraft. The relevant
parameters are a typical cruising speed of around 250 m/s and a cruising altitude
of around 10000 m. In combination with the parameters of Section B.4.1 this results
in a maximum Doppler frequency of f
D,max
= 100 Hz. The upper band limit of the
186
B.6. Conclusions
(a) Bandpass Filter
0 0.1 0.2 0.3
−1.5
−1
−0.5
0
0.5
1
1.5
Time in s
A
m
p
l
i
t
u
d
e
(b) Automatic Gain Control
0 0.1 0.2 0.3
−1.5
−1
−0.5
0
0.5
1
1.5
Time in s
A
m
p
l
i
t
u
d
e
Figure B.13.: Received signal (a) before and (b) after an automatic gain control.
amplitude fading, equaling 2f
D,max
= 200 Hz, approaches the lower band limit of
the audio signal. The path delay of ∆l = 6.0 m or ∆τ = 20 ns results in a coherence
bandwidth of B
CB
= 5 MHz, and the channel is thus again frequency nonselective and
flat fading.
In this appendix, we dealt with the Doppler shift of the modulated carrier wave,
which leads in combination with the multipath propagation to signal fading. It is
worthwhile to note that also the modulating signal, that means the audio signal that is
represented in the amplitude of the carrier wave, undergoes a Doppler shift. However,
this Doppler shift is so small that it can be neglected in practice. For example, for a
sinusoidal audio signal with a frequency of 1000 Hz and an aircraft speed of 250 m/s
the maximum Doppler frequency using (B.4) is 8.3 10
−4
Hz.
B.6. Conclusions
We reviewed the most common radio propagation channel models and performed
simulations for the aeronautical voice radio channel. We conclude that due to its narrow
bandwidth the aeronautical voice radio channel is a frequency nonselective flat fading
channel. In most situations the frequency of the amplitude fading is band-limited to
the maximum Doppler frequency, which is a simple function of the aircraft speed and
the carrier frequency. The amplitude fading of the channel might in parts be mitigated
by an automatic gain control in the radio receiver.
187
Appendix B. Aeronautical Radio Channel Modeling and Simulation
188
Appendix C
Complementary Figures
This appendix presents complementary figures for Section 10.4, which are provided
for reference.
189
Appendix C. Complementary Figures
20 30 40 50 60 70
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
PSNR (in dB, wrt. processed signal, measured in 20ms window)
B
i
t

e
r
r
o
r

r
a
t
i
o


Original signal
Original signal with silence removed [sr]
sr+multiband compr. and limiting [mbcl]
sr+mbcl + unvoiced gain (+3dB) [ug]
sr+mbcl + unvoiced gain (+6dB)
sr+mbcl + waterm. floor (at −SNR−3dB) [wmfl]
sr+mbcl + wmfl (at −SNR) + ug (+3dB)
sr+mbcl + wmfl (at −SNR+3dB)
sr+mbcl + wmfl (at −SNR+3dB) + ug (+3dB)
0 10 20 30 40 50 60
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
SNR (in dB, wrt. processed signal)
B
i
t

e
r
r
o
r

r
a
t
i
o


Original signal
Original signal with silence removed [sr]
sr+multiband compr. and limiting [mbcl]
sr+mbcl + unvoiced gain (+3dB) [ug]
sr+mbcl + unvoiced gain (+6dB)
sr+mbcl + waterm. floor (at −SNR−3dB) [wmfl]
sr+mbcl + wmfl (at −SNR) + ug (+3dB)
sr+mbcl + wmfl (at −SNR+3dB)
sr+mbcl + wmfl (at −SNR+3dB) + ug (+3dB)
Figure C.1.: Noise robustness curves identical to Figure 10.4, but plotted against differ-
ent SNR measures.
190
Figure C.2.: Multiband compressor and limiter settings for transmitter-side speech
enhancement.
191
Appendix C. Complementary Figures
192
Bibliography
[1] K. Hofbauer, G. Kubin, and W. B. Kleijn, “Speech watermarking for analog
flat-fading bandpass channels,” IEEE Transactions on Audio, Speech, and Language
Processing, 2009, revised and resubmitted.
[2] K. Hofbauer and H. Hering, “Noise robust speech watermarking with bit syn-
chronisation for the aeronautical radio,” in Information Hiding, ser. Lecture Notes
in Computer Science. Springer, 2007, vol. 4567/2008, pp. 252–266.
[3] K. Hofbauer and G. Kubin, “High-rate data embedding in unvoiced speech,” in
Proceedings of the International Conference on Spoken Language Processing (INTER-
SPEECH), Pittsburgh, PY, USA, Sep. 2006, pp. 241–244.
[4] K. Hofbauer, H. Hering, and G. Kubin, “A measurement system and the TUG-
EEC-Channels database for the aeronautical voice radio,” in Proceedings of the
IEEE Vehicular Technology Conference (VTC), Montreal, Canada, Sep. 2006, pp. 1–5.
[5] K. Hofbauer, S. Petrik, and H. Hering, “The ATCOSIM corpus of non-prompted
clean air traffic control speech,” in Proceedings of the International Conference on
Language Resources and Evaluation (LREC), Marrakech, Morocco, May 2008.
[6] K. Hofbauer and G. Kubin, “Aeronautical voice radio channel modelling and
simulation—a tutorial review,” in Proceedings of the International Conference on
Research in Air Transportation (ICRAT), Belgrade, Serbia, Jul. 2006.
[7] K. Hofbauer, H. Hering, and G. Kubin, “Speech watermarking for the VHF radio
channel,” in Proceedings of the EUROCONTROL Innovative Research Workshop
(INO), Brétigny-sur-Orge, France, Dec. 2005, pp. 215–220.
[8] K. Hofbauer and H. Hering, “Digital signatures for the analogue radio,” in
Proceedings of the NASA Integrated Communications Navigation and Surveillance
Conference (ICNS), Fairfax, VA, USA, 2005.
193
Bibliography
[9] K. Hofbauer, “Advanced speech watermarking for secure aircraft identification,”
in Proceedings of the EUROCONTROL Innovative Research Workshop (INO), Brétigny-
sur-Orge, France, Dec. 2004.
[10] M. Gruber and K. Hofbauer, “A comparison of estimation methods for the VHF
voice radio channel,” in Proceedings of the CEAS European Air and Space Conference
(Deutscher Luft- und Raumfahrtkongress), Berlin, Germany, Sep. 2007.
[11] H. Hering and K. Hofbauer, “From analogue broadcast radio towards end-to-end
communication,” in Proceedings of the AIAA Aviation Technology, Integration, and
Operations Conference (ATIO), Anchorage, Alaska, USA, Sep. 2008.
[12] H. Hering and K. Hofbauer, “Towards selective addressing of aircraft with voice
radio watermarks,” in Proceedings of the AIAA Aviation Technology, Integration, and
Operations Conference (ATIO), Belfast, Northern Ireland, Sep. 2007.
[13] I. Cox, M. Miller, J. Bloom, J. Fridrich, and T. Kalker, Digital Watermarking and
Steganography, 2nd ed. Morgan Kaufmann, 2007.
[14] W. Bender, D. Gruhl, N. Morimoto, and A. Lu, “Techniques for data hiding,”
IBM Systems Journal, vol. 35, no. 3/4, pp. 313–336, 1996.
[15] N. Cvejic and T. Seppanen, Digital Audio Watermarking Techniques and Technologies:
Applications and Benchmarks. IGI Global, 2007.
[16] I. J. Cox, J. Kilian, F. T. Leighton, and T. Shamoon, “Secure spread spectrum
watermarking for multimedia,” IEEE Transactions on Image Processing, vol. 6,
no. 12, pp. 1673–1687, Dec. 1997.
[17] D. Kirovski and H. S. Malvar, “Spread-spectrum watermarking of audio signals,”
IEEE Transactions on Signal Processing, vol. 51, no. 4, pp. 1020–1033, Apr. 2003.
[18] N. Lazic and P. Aarabi, “Communication over an acoustic channel using data
hiding techniques,” IEEE Transactions on Multimedia, vol. 8, no. 5, pp. 918–924,
Oct. 2006.
[19] X. He and M. S. Scordilis, “Efficiently synchronized spread-spectrum audio
watermarking with improved psychoacoustic model,” Research Letters in Signal
Processing, vol. 8, no. 1, pp. 1–5, Jan. 2008.
[20] M. H. M. Costa, “Writing on dirty paper,” IEEE Transactions on Information Theory,
vol. 29, no. 3, pp. 439–441, May 1983.
[21] I. J. Cox, M. L. Miller, and A. L. McKellips, “Watermarking as communications
with side information,” Proceedings of the IEEE, vol. 87, no. 7, pp. 1127–1141, Jul.
1999.
[22] B. Chen and G. W. Wornell, “Quantization index modulation: A class of provably
good methods for digital watermarking and information embedding,” IEEE
Transactions on Information Theory, vol. 47, no. 4, pp. 1423–1443, May 2001.
194
Bibliography
[23] J. J. Eggers, R. Bäuml, R. Tzschoppe, and B. Girod, “Scalar Costa scheme for
information embedding,” IEEE Transactions on Signal Processing, vol. 51, no. 4, pp.
1003–1019, Apr. 2003.
[24] F. Perez-Gonzalez, C. Mosquera, M. Barni, and A. Abrardo, “Rational dither
modulation: A high-rate data-hiding method invariant to gain attacks,” IEEE
Transactions on Signal Processing, vol. 53, no. 10, pp. 3960–3975, Oct. 2005.
[25] I. D. Shterev, “Quantization-based watermarking: Methods for amplitude scale
estimation, security, and linear filtering invariance,” Ph.D. dissertation, Delft
University of Technology, 2007.
[26] X. Wang, W. Qi, and P. Niu, “A new adaptive digital audio watermarking based
on support vector regression,” IEEE Transactions on Audio, Speech, and Language
Processing, vol. 15, no. 8, pp. 2270–2277, Nov. 2007.
[27] Q. Cheng and J. Sorenson, “Spread spectrum signaling for speech watermarking,”
in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), vol. 3, Salt Lake City, UT, USA, May 2001, pp. 1337–1340.
[28] M. Hagmüller, H. Hering, A. Kröpfl, and G. Kubin, “Speech watermarking
for air traffic control,” in Proceedings of the European Signal Processing Conference
(EUSIPCO), Vienna, Austria, Sep. 2004, pp. 1653–1656.
[29] M. Hatada, T. Sakai, N. Komatsu, and Y. Yamazaki, “Digital watermarking based
on process of speech production,” in Multimedia Systems and Applications V, ser.
Proceedings of SPIE, 2002, vol. 4861, pp. 258–267.
[30] S. Chen and H. Leung, “Speech bandwidth extension by data hiding and phonetic
classification,” in Proceedings of the IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), vol. 4, Honolulu, Hawaii, USA, Apr. 2007,
pp. 593–596.
[31] M. Celik, G. Sharma, and A. M. Tekalp, “Pitch and duration modification for
speech watermarking,” in Proceedings of the IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), vol. 2, Philadelphia, PA, USA,
Mar. 2005, pp. 17–20.
[32] A. Sagi and D. Malah, “Bandwidth extension of telephone speech aided by data
embedding,” EURASIP Journal on Advances in Signal Processing, vol. 2007, no.
64921, 2007.
[33] B. Geiser, P. Jax, and P. Vary, “Artificial bandwidth extension of speech supported
by watermark-transmitted side information,” in Proceedings of the European Con-
ference on Speech Communication and Technology (EUROSPEECH), Lisbon, Portugal,
Sep. 2005, pp. 1497–1500.
[34] S. Sakaguchi, T. Arai, and Y. Murahara, “The effect of polarity inversion of speech
on human perception and data hiding as an application,” in Proceedings of the
195
Bibliography
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
vol. 2, Istanbul, Turkey, Jun. 2000, pp. 917–920.
[35] L. Girin and S. Marchand, “Watermarking of speech signals using the sinusoidal
model and frequency modulation of the partials,” in Proceedings of the IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1,
Montreal, Canada, May 2004, pp. 633–636.
[36] Y.-W. Liu and J. O. Smith, “Audio watermarking through deterministic plus
stochastic signal decomposition,” EURASIP Journal on Information Security, vol.
2007, no. 1, pp. 1–12, 2007.
[37] A. Gurijala and J. Deller, “On the robustness of parametric watermarking of
speech,” in Multimedia Content Analysis and Mining, ser. Lecture Notes in Com-
puter Science. Springer, 2007, vol. 4577/2007, pp. 501–510.
[38] R. C. F. Tucker and P. S. J. Brittan, “Method for watermarking data,” U.S. Patent
US 2003/0 028 381 A1, Feb. 6, 2003.
[39] S. Chen and H. Leung, “Concurrent data transmission through PSTN by CDMA,”
in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS),
Island of Kos, Greece, May 2006, pp. 3001–3004.
[40] S. Chen, H. Leung, and H. Ding, “Telephony speech enhancement by data hiding,”
IEEE Transactions on Instrumentation and Measurement, vol. 56, no. 1, pp. 63–74,
Feb. 2007.
[41] R. Gray, A. Buzo, A. Gray, Jr., and Y. Matsuyama, “Distortion measures for speech
processing,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28,
no. 4, pp. 367–376, Aug. 1980.
[42] G. Kubin, B. S. Atal, and W. B. Kleijn, “Performance of noise excitation for
unvoiced speech,” in Proceedings of the IEEE Workshop on Speech Coding for Telecom-
munications, Saint-Adele, Canada, Oct. 1993, pp. 35–36.
[43] D. Schulz, “Improving audio codecs by noise substitution,” Journal of the Audio
Engineering Society, vol. 44, no. 7/8, pp. 593–598, Jul. 1996.
[44] A. Takahashi, R. Nishimura, and Y. Suzuki, “Multiple watermarks for stereo
audio signals using phase-modulation techniques,” IEEE Transactions on Signal
Processing, vol. 53, no. 2, pp. 806–815, Feb. 2005.
[45] H. M. A. Malik, R. Ansari, and A. A. Khokhar, “Robust data hiding in audio
using allpass filters,” IEEE Transactions on Audio, Speech, and Language Processing,
vol. 15, no. 4, pp. 1296–1304, May 2007.
[46] H. Matsuoka, Y. Nakashima, T. Yoshimura, and T. Kawahara, “Acoustic OFDM:
Embedding high bit-rate data in audio,” in Advances in Multimedia Modeling, ser.
Lecture Notes in Computer Science. Springer, 2008, vol. 0302-9743, pp. 498–507.
196
Bibliography
[47] X. Dong, M. Bocko, and Z. Ignjatovic, “Data hiding via phase manipulation of
audio signals,” in Proceedings of the IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), vol. 5, Montreal, Canada, May 2004, pp.
377–380.
[48] P. Y. Liew and M. A. Armand, “Inaudible watermarking via phase manipulation
of random frequencies,” Multimedia Tools and Applications, vol. 35, no. 3, pp.
357–377, Dec. 2007.
[49] H. Pobloth and W. Kleijn, “On phase perception in speech,” in Proceedings of the
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
vol. 1, Phoenix, AZ, USA, Mar. 1999, pp. 29–32.
[50] H. Pobloth, “Perceptual and squared error aspects in speech and audio coding,”
Ph.D. dissertation, Royal Institute of Technology (KTH), Stockholm, Sweden,
Nov. 2004.
[51] S. Haykin, Communication Systems, 4th ed. Wiley, 2001.
[52] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Wiley,
2006.
[53] H. Fastl and E. Zwicker, Psychoacoustics, 3rd ed. Springer, 2006.
[54] S. van de Par, A. Kohlrausch, R. Heusdens, J. Jensen, and S. H. Jensen, “A
perceptual model for sinusoidal audio coding based on spectral integration,”
EURASIP Journal on Applied Signal Processing, vol. 2005, no. 9, pp. 1292–1304,
2005.
[55] M. D. Swanson, B. Zhu, A. H. Tewfik, and L. Boney, “Robust audio watermarking
using perceptual masking,” Signal Processing, vol. 66, no. 3, pp. 337–355, 1998.
[56] A. Sagi and D. Malah, “Data embedding in speech signals using perceptual mask-
ing,” in Proceedings of the 12th European Signal Processing Conference (EUSIPCO’04),
Vienna, Austria, Sep. 2004.
[57] S. Chen and H. Leung, “Concurrent data transmission through analog speech
channel using data hiding,” IEEE Signal Processing Letters, vol. 12, no. 8, pp.
581–584, 2005.
[58] W. B. Kleijn and K. K. Paliwal, Eds., Speech Coding and Synthesis. Elsevier, 1995.
[59] T. Painter and A. Spanias, “Perceptual coding of digital audio,” Proceedings of the
IEEE, vol. 88, no. 4, pp. 451–515, Apr. 2000.
[60] X. Huang, A. Acero, and H.-W. Hon, Eds., Spoken Language Processing: A Guide to
Theory, Algorithm and System Development. Pearson, 2001.
[61] R. C. Beattie, A. Zentil, and D. A. Svihovec, “Effects of white noise on the most
comfortable level for speech with normal listeners,” Journal of Auditory Research,
vol. 22, no. 1, pp. 71–76, 1982.
197
Bibliography
[62] H. J. Larson and B. O. Shubert, Probabilistic models in engineering sciences. Wiley,
1979, vol. 1.
[63] N. M. Blachman., “Projection of a spherical distribution and its inversion,” IEEE
Transactions on Signal Processing, vol. 39, no. 11, pp. 2544–2547, Nov. 1991.
[64] R. E. Blahut, Principles and Practice of Information Theory. Addison-Wesley, 1987.
[65] J. G. Proakis and M. Salehi, Communication Systems Engineering, 2nd ed. Prentice-
Hall, 2002.
[66] J. Aldis and A. Burr, “The channel capacity of discrete time phase modulation in
AWGN,” IEEE Transactions on Information Theory, vol. 39, no. 1, pp. 184–185, Jan.
1993.
[67] G. Ungerboeck, “Channel coding with multilevel/phase signals,” IEEE Transac-
tions on Information Theory, vol. 28, no. 1, pp. 55–67, Jan. 1982.
[68] R. Padovani, “Signal space channel coding: Codes for multilevel/phase/fre-
quency signals,” Ph.D. dissertation, University of Massachusetts (Amherst), 1985.
[69] D.-S. Kim, “Perceptual phase quantization of speech,” IEEE Transactions on Speech
and Audio Processing, vol. 11, no. 4, pp. 355–364, Jul. 2003.
[70] K. K. Paliwal and L. Alsteris, “Usefulness of phase spectrum in human speech
perception,” in Proceedings of the International Conference on Spoken Language
Processing (INTERSPEECH), Geneva, Switzerland, Sep. 2003, pp. 2117–2119.
[71] E. Moulines and J. Laroche, “Non-parametric techniques for pitch-scale and
time-scale modification of speech,” Speech Communication, vol. 16, no. 2, pp.
175–205, February 1995.
[72] H. S. Malvar, “Lapped transforms for efficient transform/subband coding,” IEEE
Transactions on Acoustics, Speech, and Signal Processing, vol. 38, no. 6, pp. 969–978,
Jun. 1990.
[73] H. S. Malvar, “Extended lapped transforms: properties, applications, and fast
algorithms,” IEEE Transactions on Signal Processing, vol. 40, no. 11, pp. 2703–2714,
Nov. 1992.
[74] H. S. Malvar, Signal Processing with Lapped Transforms. Artech House, 1992.
[75] S. Shlien, “The modulated lapped transform, its time-varying forms, and its
applications to audio coding standards,” IEEE Transactions on Speec and Audio
Processing, vol. 5, no. 4, July 1997.
[76] Y.-P. Lin and P. Vaidyanathan, “A Kaiser window approach for the design of
prototype filters of cosine modulated filterbanks,” IEEE Signal Processing Letters,
vol. 5, no. 6, pp. 132–134, Jun. 1998.
198
Bibliography
[77] P. Boersma and D. Weenink. (2008) PRAAT: doing phonetics by computer (version
5.0.20). [Online]. Available: http://www.praat.org/
[78] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren,
and V. Zue. (1993) TIMIT acoustic-phonetic continuous speech corpus. CD-ROM.
Linguistic Data Consortium. [Online]. Available: http://www.ldc.upenn.edu/
Catalog/CatalogEntry.jsp?catalogId=LDC93S1
[79] P. Vary and R. Martin, Digital Speech Transmission. Wiley, 2006.
[80] P. Boersma, “Accurate short-term analysis of the fundamental frequency and
the harmonics-to-noise ratio of a sampled sound,” Proceedings of the Institute of
Phonetic Sciences, vol. 17, pp. 97–110, 1993.
[81] A. J. Jerri, “The Shannon sampling theorem—its various extensions and applica-
tions: A tutorial review,” Proceedings of the IEEE, vol. 65, no. 11, pp. 1565–1596,
Nov. 1977.
[82] R. G. Vaughan, N. L. Scott, N. L. Scott, D. R. White, and D. R. White, “The theory
of bandpass sampling,” IEEE Transactions on Signal Processing, vol. 39, no. 9, pp.
1973–1984, Sep. 1991.
[83] Y. Wu, “A proof on the minimum and permissible sampling rates for the first-
order sampling of bandpass signals,” Digital Signal Processing, vol. 17, no. 4, pp.
848–854, Mar. 2007.
[84] E. Ayanoglu, N. R. Dagdeviren, G. D. Golden, and J. E. Mazo, “An equalizer
design technique for the PCM modem: A new modem for the digital public
switched network,” IEEE Transactions on Communications, vol. 46, no. 6, pp.
763–774, Jun. 1998.
[85] J. Yen, “On nonuniform sampling of bandwidth-limited signals,” IRE Transactions
on Circuit Theory, vol. 3, no. 4, pp. 251–257, Dec. 1956.
[86] J. R. Barry, E. A. Lee, and D. G. Messerschmitt, Digital Communication, 3rd ed.
Springer, 2004.
[87] S. Haykin, Adaptive Filter Theory, 4th ed. Prentice-Hall, 2002.
[88] A. Kohlenberg, “Exact interpolation of band-limited functions,” Journal of Applied
Physics, vol. 24, no. 12, pp. 1432–1436, Dec. 1953.
[89] ITU-T Recommendation P.48: Specification for an Intermediate Reference System, Inter-
national Telecommunication Union, Nov. 1988.
[90] ITU-T Recommendation G.191: Software Tools for Speech and Audio Coding Standard-
ization, International Telecommunication Union, Sep. 2005.
[91] S. Verdú and T. S. Han, “A general formula for channel capacity,” IEEE Transac-
tions on Information Theory, vol. 40, no. 4, pp. 1147–1157, Jul. 1994.
199
Bibliography
[92] ITU-T Recommendation P.862.x: Perceptual Evaluation of Speech Quality (PESQ),
International Telecommunication Union, Oct. 2007.
[93] K. Hofbauer. (2008) Demonstration files: Original and watermarked ATC speech.
[Online]. Available: http://www.spsc.tugraz.at/people/hofbauer/wmdemo1/
[94] J. Makhoul, “Spectral linear prediction: Properties and applications,” IEEE
Transactions on Acoustics, Speech, and Signal Processing, vol. 23, no. 3, pp. 283–296,
Jun. 1975.
[95] M. Nilsson, B. Resch, M.-Y. Kim, and W. B. Kleijn, “A canonical representation
of speech,” in Proceedings of the IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), vol. 4, Honolulu, Hawaii, USA, Apr. 2007, pp.
849–852.
[96] F. Gardner and W. Lindsey, “Guest editorial: Special issue on synchronization,”
IEEE Transactions on Communications, vol. 28, no. 8, pp. 1105–1106, Aug. 1980.
[97] Airborne VHF Communications Transceiver, Aeronautical Radio Inc. (ARINC) Char-
acteristic 716-11, Jun. 2003.
[98] L. E. Franks, “Carrier and bit synchronization in data communication–a tutorial
review,” IEEE Transactions on Communications, vol. 28, no. 8, pp. 1107–1121, Aug.
1980.
[99] P. Moulin and R. Koetter, “Data-hiding codes,” Proceedings of the IEEE, vol. 93,
no. 12, pp. 2083–2126, Dec. 2005.
[100] G. Sharma and D. J. Coumou, “Watermark synchronization: Perspectives and a
new paradigm,” in Proceedings of the IEEE Conference on Information Sciences and
Systems, Princeton, NJ, USA, Mar. 2006, pp. 1182–1187.
[101] D. J. Coumou and G. Sharma, “Watermark synchronization for feature-based
embedding: Application to speech,” in Proceedings of the IEEE International
Conference on Multimedia and Expo, Toronto, Canada, Jul. 2006, pp. 849–852.
[102] M. Schlauweg, D. Pröfrock, and E. Müller, “Soft feature-based watermark decod-
ing with insertion/deletion correction,” in Information Hiding, ser. Lecture Notes
in Computer Science. Springer, 2007, vol. 4567/2008, pp. 237–251.
[103] R. A. Scholtz, “Frame synchronization techniques,” IEEE Transactions on Commu-
nications, vol. 28, no. 8, pp. 1204–1213, Aug. 1980.
[104] H. Meyr and G. Ascheid, Synchronization in Digital Communications. Wiley, 1990,
vol. 1.
[105] F. M. Gardner, Phaselock Techniques, 3rd ed. Wiley, 2005.
[106] Y. R. Shayan and T. Le-Ngoc, “All digital phase-locked loop: concepts, design
and applications,” IEE Proceedings F: Radar and Signal Processing, vol. 136, no. 1,
pp. 53–56, Feb. 1989.
200
Bibliography
[107] B. Kim, “Dual-loop DPLL gear-shifting algorithm for fast synchronization,” IEEE
Transactions on Circuits and Systems—Part II: Analog and Digital Signal Processing,
vol. 44, no. 7, pp. 577–586, Jul. 1997.
[108] P. A. V. Hall and G. R. Dowling, “Approximate string matching,” ACM Computing
Surveys, vol. 12, no. 4, pp. 381–402, Dec. 1980.
[109] G. Ungerboeck, “Fractional tap-spacing equalizer and consequences for clock
recovery in data modems,” IEEE Transactions on Communications, vol. 24, no. 8,
pp. 856–864, Aug. 1976.
[110] D. Artman, S. Chari, and R. Gooch, “Joint equalization and timing recovery in a
fractionally-spaced equalizer,” in Proceedings of the Asilomar Conference on Signals,
Systems and Computers, vol. 1, Pacific Grove, CA, USA, Oct. 1992, pp. 25–29.
[111] D. van Roosbroek, EATMP Communications Strategy - Volume 2 - Technical Descrip-
tion, 6th ed. EUROCONTROL EATMP Infocentre, 2006.
[112] R. Kerczewski, J. Budinger, and T. Gilbert, “Technology assessment results of
the Eurocontrol/FAA future communications study,” in Proceedings of the IEEE
Aerospace Conference, Big Sky, MT, USA, Mar. 2008, pp. 1–13.
[113] Manual of Radiotelephony, International Civil Aviation Organization Doc 9432
(AN/925), Rev. 3, 2006.
[114] H. Hering, M. Hagmüller, and G. Kubin, “Safety and security increase for air
traffic management through unnoticeable watermark aircraft identification tag
transmitted with the VHF voice communication,” in Proceedings of the 22nd Digital
Avionics Systems Conference (DASC 2003), Indianapolis, USA, 2003.
[115] H. Hering and K. Hofbauer, “System architecture of the onboard aircraft identi-
fication tag (AIT) system,” Eurocontrol Experimental Centre, EEC Note 04/05,
2005.
[116] M. Celiktin and E. Petre, AIT Initial Feasibility Study (D1–D5). EUROCONTROL
EATMP Infocentre, 2006.
[117] J. B. Metzger, A. Stutz, and B. Kauffman, ARINC Voice Services Operating Procedures
Handbook. Aeronautical Radio Inc. (ARINC), Apr. 2007, rev. S.
[118] M. Sajatovic, J. Prinz, and A. Kroepfl, “Increasing the safety of the ATC voice com-
munications by using in-band messaging,” in Proceedings of the Digital Avionics
Systems Conference (DASC), vol. 1, Indianapolis, IN, USA, Oct. 2003, pp. 4.E.1–8.
[119] M. Hagmüller and G. Kubin, “Speech watermarking for air traffic control,”
Eurocontrol Experimental Centre, EEC Note 05/05, 2005.
[120] K. Hofbauer and S. Petrik, “ATCOSIM air traffic control simulation speech
corpus,” Graz University of Technology, Tech. Rep. TUG-SPSC-2007-11, May
2008. [Online]. Available: http://www.spsc.tugraz.at/ATCOSIM
201
Bibliography
[121] J. J. Godfrey. (1994) Air traffic control complete. CD-ROM. Linguistic Data Con-
sortium. [Online]. Available: http://www.ldc.upenn.edu/Catalog/CatalogEntry.
jsp?catalogId=LDC94S14A
[122] J. Segura, T. Ehrette, A. Potamianos, D. Fohr, I. Illina, P.-A. Breton, V. Clot,
R. Gemello, M. Matassoni, and P. Maragos. (2007) The HIWIRE database, a noisy
and non-native english speech corpus for cockpit communication. DVD-ROM.
[Online]. Available: http://www.hiwire.org/
[123] S. Pigeon, W. Shen, and D. van Leeuwen, “Design and characterization of the non-
native military air traffic communications database (nnMATC),” in Proceedings
of the International Conference on Spoken Language Processing (INTERSPEECH),
Antwerp, Belgium, 2007.
[124] L. Graglia, B. Favennec, and A. Arnoux, “Vocalise: Assessing the impact of data
link technology on the R/T channel,” in Proceedings of the Digital Avionics Systems
Conference (DASC), Washington D.C., USA, 2005.
[125] C. Arnoux, L. Graglia, and D. Pavet. (2005) VOCALISE - the today use of VHF as
a media for pilots/controllers communications. [Online]. Available: http://www.
cena.aviation-civile.gouv.fr/divisions/ICS/projets/vocalise/index_en.html
[126] L. Benarousse, E. Geoffrois, J. Grieco, R. Series, H. Steeneken, H. Stumpf, C. Swail,
and D. Thiel, “The NATO native and non-native (N4) speech corpus,” in Pro-
ceedings of the RTO Workshop on Multilingual Speech and Language Processing, no.
RTO-MP-066, Aalborg, Denmark, Sep. 2001, pp. 1.1–1.3.
[127] K. Maeda, S. Bird, X. Ma, and H. Lee, “Creating annotation tools with the
annotation graph toolkit,” in Proceedings of the International Conference on Language
Resources and Evaluation (LREC), Paris, France, 2002.
[128] C. Mallett. (2008) Autohotkey - free mouse and keyboard macro program with
hotkeys and autotext. Computer program. [Online]. Available: http://www.
autohotkey.com/
[129] Designators for Aircraft Operating Agencies, Aeronautical Authorities and Services,
International Civil Aviation Organization Nr. 8585/93, 1994.
[130] F. Schiel and C. Draxler, Production and Validation of Speech Corpora. Bastard
Verlag, 2003.
[131] A. Campos Domínguez, “Pre-processing of speech signals for noisy and band-
limited channels,” Master’s thesis, Royal Institute of Technology (KTH), Stock-
holm, Sweden, Mar. 2009, unpublished.
[132] M. Pätzold, Mobile Fading Channels. Modelling, Analysis and Simulation. Wiley,
2002.
[133] J. D. Parsons, The Mobile Radio Propagation Channel. Wiley, 2000.
202
Bibliography
[134] BAE Systems Operations, “Literature review on terrestrial broadband VHF
radio channel models,” B-VHF, Deliverable D-15, 2005. [Online]. Available:
http:///www.B-VHF.org
[135] D. W. Allan, N. Ashby, and C. C. Hodge, “The science of timekeeping,” Agilent
Technologies, Application Note AN 1289, 2000.
[136] MicroTrack 24/96 User Guide, M-Audio, 2005. [Online]. Available: http://www.
m-audio.com
[137] GPS 35-LVS Technical Specification, Garmin, 2000. [Online]. Available: http://
www.garmin.com
[138] eTrex Legend Owner’s Manual and Reference Guide, Garmin, 2005. [Online]. Avail-
able: http://www.garmin.com
[139] B. Sklar, Digital Communications, 2nd ed. Prentice-Hall, 2001.
[140] D. Foster. (2006) GPX: the GPS exchange format. [Online]. Available: http://
www.topografix.com/gpx.asp
[141] W. S. Cleveland and S. J. Devlin, “Locally-weighted regression: An approach to
regression analysis by local fitting,” Journal of the American Statistical Association,
vol. 83, no. 403, pp. 596–610, 1988.
[142] M. D. Felder, J. C. Mason, and B. L. Evans, “Efficient dual-tone multifrequency de-
tection using the nonuniform discrete Fourier transform,” IEEE Signal Processing
Letters, vol. 5, no. 7, 1998.
[143] J. O. Smith, Mathematics of the Discrete Fourier Transform (DFT) with Audio Applica-
tions, 2nd ed. W3K Publishing, 2007.
[144] MATLAB Version 7.4.0.287 (R2007a), The MathWorks, 2007.
[145] M. Gruber, “Channel estimation for the voice radio – Basics for a measurements-
based aeronautical voice radio channel model,” Master’s thesis, FH JOANNEUM
- University of Applied Sciences, Graz, Austria, 2007.
[146] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-Time Signal Processing,
2nd ed. Prentice Hall, 1999.
[147] M. J. Levin, “Optimum estimation of impulse response in the presence of noise,”
IRE Transactions on Circuit Theory, vol. 7, pp. 50–56, 1960.
[148] D. D. Rife and J. Vanderkooy, “Transfer-function measurement with maximum-
length sequences,” Journal of the Audio Engineering Society, vol. 37, no. 6, pp.
419–444, 1989.
[149] J. Schwaner. (1995) Alternator induced radio noise. Sacramento Sky Ranch Inc.
[Online]. Available: http://www.sacskyranch.com/altnoise.htm
203
Bibliography
[150] Series 200 Single-Channel Communication System, Rohde & Schwarz, Munich,
Germany, data sheet.
[151] C. Bagwell. (2008) SoX - Sound eXchange (version 13.0.0). [Online]. Available:
http://sox.sourceforge.net/
[152] U. Zölzer, DAFX: Digital Audio Effects. Wiley, 2002.
[153] D. Mazzoni and R. Dannenberg. (2008) Audacity: The free, cross-platform sound
editor (version1.3.6). [Online]. Available: http://audacity.sourceforge.net/
[154] Core Audio Overview, Apple, Jan. 2007. [Online]. Available: http://developer.
apple.com/documentation/MusicAudio/Conceptual/CoreAudioOverview/
[155] S. Harris. (2006) Steve Harris’ LADSPA plugins (swh-plugins-0.4.15). [Online].
Available: http://plugin.org.uk/
[156] Designators for Aircraft Operating Agencies, Aeronautical Authorities and Services,
International Civil Aviation Organization Nr. 8585/138, 2006.
[157] Designators for Aircraft Operating Agencies, Aeronautical Authorities and Services,
International Civil Aviation Organization Nr. 8585/107, 1998.
[158] R. Lane, R. Deransy, and D. Seeger, “3rd continental RVSM real-time simulation,”
Eurocontrol Experimental Centre, EEC Report 315, 1997.
[159] Digital Aeronautical Flight Information File (DAFIF), 6th ed. National Geospatial-
Intelligence Agency, Oct. 2006, no. 0610, electronic database.
[160] Location Indicators, International Civil Aviation Organization Nr. 7910/122, 2006.
[161] D. Seeger and H. O’Connor, “S08 ANT-RVSM 3rd continental real-time simula-
tion pilot handbook,” Dec. 1996, Eurocontrol internal document.
[162] E. Haas. Communications systems. [Online]. Available: http://www.kn-s.dlr.de/
People/Haas/
[163] E. Haas, “Aeronautical channel modeling,” IEEE Transactions on Vehicular Technol-
ogy, vol. 51, no. 2, pp. 254–264, 2002.
[164] MATLAB Communications Toolbox, 3rd ed., The MathWorks, 2004.
[165] K. Metzger. The generic channel simulator. [Online]. Available: http://www.eecs.
umich.edu/genchansim/
[166] K. Hofbauer. The generic channel simulator. [Online]. Available: http://www.
spsc.tugraz.at/people/hofbauer/gcs/
[167] M. Dickreiter, Handbuch der Tonstudiotechnik. KG Saur, 1997, vol. 1.
204

ii

iv

In the aeronautical application the scheme is highly robust and outperforms current state-of-the-art speech watermarking methods. a watermarking system could overcome existing limitations. This thesis investigates the embedding of digital side information. security and efficiency in ATC. Based on this finding. legacy system enhancement. Applied to the ATC voice radio. In contrast to conventional watermarking methods. The theoretical results show that the performance could be increased even further by incorporating perceptual masking. Our implementation embeds the watermark in a subband of narrowband speech at a watermark bit rate of the order of magnitude of 500 bit/s. both of which were produced specifically for this purpose and constitute contributions on their own. We show that the resulting theoretical watermark capacity far exceeds the capacity of conventional watermarking channels. speech watermarking. The adaptive equalization based watermark detector is able to recover the watermark data in the presence of channels with non-linear phase. and ultimately increase safety. so called watermarks. air traffic control. this field of application allows embedding of the data in perceptually irrelevant signal components. The system is evaluated experimentally using an ATC speech corpus and radio channel measurement results. into speech signals. watermark capacity. desynchronization and additive noise. we present a general purpose blind speech watermarking algorithm that embeds watermark data in the phase of non-voiced speech segments by replacing the excitation signal of an autoregressive signal representation.Abstract Air traffic control (ATC) voice radio communication between aircraft pilots and controllers is subject to technical and functional constraints owing to the legacy radio system currently in use worldwide. analog radio channel. digital watermarking. time-variant bandpass filtering. voice radio. v . Keywords: Information hiding. watermark synchronization. amplitude gain modulation.

vi .

Diese beiden Datensammlungen stellen jeweils einen eigenständigen Beitrag dar. Basierend auf dieser Erkenntnis wird ein universelles blindes Wasserzeichen-Verfahren vorgestellt. indem das Anregungssignal eines autoregressiven Signalmodells durch ein Wasserzeichen-Signal ersetzt wird. Die Implementierung bettet die Seiteninformationen mit einer Bitrate von circa 500 bit/s in einem Teilband des schmalbandigen Sprachsignals ein. Wasserzeichensynchronisierung. Es zeigt sich. Die vorliegenden theoretischen Ergebnisse zeigen. zeitvarianter Bandpass-Filterung. Einbettung digitaler Seiteninformationen. das die Daten in die Phase von stimmlosen Sprachlauten einbettet. nicht-linearer Phase. Der Wasserzeichen-Detektor basiert auf adaptiver Entzerrung und detektiert die eingebetteten Daten sogar bei einer Übertragung des Signals über Kanäle mit zeitvarianter Kanalverstärkung. Im Gegensatz zu herkömmlichen Wasserzeichen-Verfahren erlaubt dieser Einsatzbereich ein Einbetten der Daten in Signalanteilen. dass dies zu einer erheblichen Steigerung der Kapazität des verdeckten Datenkanals führt. Desynchronisierung und additivem Rauschen. dass durch eine Berücksichtigung von gehörbezogenen Maskierungsmodellen die Leistungsfähigkeit weiter gesteigert werden kann. die sich aus dem althergebrachten und weltweit eingesetzten analogen Sprechfunk-System ergeben. Das Verfahren ist hochrobust und übertrifft in der Flugfunk-Anwendung leistungsmäßig die modernsten existierenden Wasserzeichen-Verfahren. Sprechfunk. Diese Arbeit beschäftigt sich mit der Einbettung digitaler Seiteninformation. Stichwörter: Digitale Wasserzeichen. in Sprachsignale. so genannter Wasserzeichen. Wasserzeichenkanalkapazität. die vom menschlichen Gehör nicht wahrgenommen werden können. Angewandt auf den analogen Flugfunk könnte ein solches WasserzeichenSystem bestehende Einschränkungen aufheben und letztendlich die Effizienz sowie die aktive und passive Sicherheit in der Flugsicherung steigern. Sprachwasserzeichen. analoger Sprechfunkkanal vii .Zusammenfassung Die Sprechfunk-Kommunikation zwischen Piloten und Fluglotsen unterliegt technischen und funktionalen Einschränkungen. Flugfunk. Flugsicherung. Zur experimentellen Validierung des Systems wurden Flugfunkkanal-Messungen durchgeführt und ein für die Flugsicherung spezifischer Sprachkorpus erstellt.

viii .

Your guidance not only had a major impact on the outcome of this work. namely Tuan. as much as your hospitality. Sarwar. The names are too many to mention. Many thanks also to Prof. Mario and Eri at INO. the application of watermarking to ATC. Erhard and many more at SPSC. Daniel. and the former INO team at Eurocontrol. My deepest thanks to my parents Konrad and Annemarie who always encouraged and supported me throughout the many years of my studies. I am equally indebted to my advisors Prof. for the meaningful discussions and for creating an enjoyable working environment. my warmest thanks to my beloved girlfriend Agata for your incredible love and support. and without your initiative and persistence this entire project would have never been possible. Martin. and I had the chance to work at and experience three different laboratories. but I am happy to give my warm thanks to all my former office mates. Finally. professionalism and hard-working nature. Bastiaan Kleijn. the SIP group at KTH Stockholm. your dedication. Your keen interest and belief in my work and your continuing support and encouragement has always meant a lot to me. is a brainchild of yours. Your knowledge. flexibility.Acknowledgments Many people have contributed to my journey towards this dissertation. and Ermin and Gustav at SIP. The journey towards my degree had also been a journey throughout Europe. The basic idea of this thesis. ix . Marc Bourgois and Martina Jürgens of the former INO team at Eurocontrol for the continuing support. I am most deeply indebted to my mentor Horst Hering at Eurocontrol. Gernot Kubin and Prof. A deserved thanks is given to my colleagues of the SPSC lab at TU Graz. It is with great pleasure that I take this opportunity to acknowledge the support I have received. creativity and keen intuitive insights. but also contributed greatly to my intellectual and personal growth. patience and placidness have truly impressed and inspired me. Vu Duong.

x .

. . . Thesis Outline . . . Introduction . . . . . . 2. . . . . . . . . . . . . . . Experimental Comparison . . . . . . . . . . . . . . . . . . . . 1. . . . . List of Publications . . . . . . . . . Digital Watermarking . . . . . . . 4. . . . . . . . . . . . . . Theory . . . . . . . . . . . . . . Watermarking Based on Ideal Costa Scheme 3. . .Contents 1. . . . Watermark Capacity in Speech 3. . . . . . . . . . . . Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. . . 9 11 11 12 14 14 17 18 18 19 23 44 46 47 48 55 59 63 67 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Watermarking Based on Phase Modulation . . .1. . . . . . . . . . . . . . . . . . . Conclusions . . . . . . .3. . . . .5. . . . . 4. 1 1 3 5 6 I. Related Work—General Watermarking 2. . . . . . . . . . .3. . .4. . . . xi . . . . . . . . 1. . . . . . . . . . . Contributions . . . . . . . . . . . . . . Introduction 1. . . . . . . . . . . . . . . . . . Motivation . . . . . . . . . . . . . . . .6. . . . .4. . . . . . . . . . . .3. . . . . .2. . Speech Watermarking . . . . . . . . . . . . . . Watermarking Based on Auditory Masking . . . . . . . . . . 3. . . . . . . . . Discussion . . . . . . . . . . Our Approach . . . . . . . . . . . .2. . . . . . . 4. . . . . . . . . . . 3. . . . . . . . . . . Implementation . . . . . . . . . 1. . . . . . .4. . . . . . . . . . . .1. . . . . . . . . . . . Related Work—Speech Watermarking . . . . 3. . . . . . . . Watermarking Non-Voiced Speech 4. . . . . . . . . . . . . . . . 3. . . . . . . . . Conclusions . .2. . . . . . . . . 2. . . . . . . . .3. . . .2. . .5. . . . . . .4. . . . . . . . . .1. . . . . . . . 3. . . . . .1. . . . . . . . . . . . 4. . . . . . . . . . . . . Background and Related Work 2. . . . . . .

. . . . . . . . . . . . . 69 70 73 78 83 . Voice Radio Channel Measurements 8. Conclusions . Conclusions . . Gain Modulation Robustness . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . Problem Statement . . . . . . . . . . . . . . .4. . . . .5. . Desynchronization Robustness . . 5. . . .3. . . .2. . . . . . . . . . . . .2. . . . . . . . . . . Watermark Synchronization 5. . . . . . . . Introduction .5. Parameter Estimation Implementation 9. . . . . . . .2. . . Experimental Watermark Robustness Evaluation 10. . . 8. . . . . . . . . . Background and Related Work . . . . . . . . . 10. . . .5. . Experimental Results and Discussion 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8. . . . . . . . . . . . . . . . . . 5. . . .1. . . . . . . . Speech Watermarking for Air Traffic Control 6. . . . . . . . . . . . . . . . . . . . . . . Experimental Results and Discussion 5. . . . .4. . . . . . . . . 6. . . . . . Air Traffic Control 6. . . . . . . . . 85 87 88 89 91 95 96 97 102 105 107 111 112 112 119 121 122 125 126 128 129 137 143 147 148 149 151 151 154 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proposed Data and Channel Model . . . . . . . . . . . . . . . .3. . . . . . 7. . . 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1. . . . Conclusions . . . . . Orthographic Transcription . . . . . . . .3. . . . . . . . . . . . . . . . . . . . ATCOSIM Recording and Processing . . Implementation . . . . . . . . . . . . . . . . Conducted Measurements . . . . . . . . . . . . Theory . . . . . . . . . . . . . . . . . . . . . . Conclusions 157 xii . . Introduction . . . . . . . . . . . . . . . 6. High-Level System Requirements . . . . . 9. . . . .Contents 5. . . . .4.3. . . . . . . . 8. . . . . . . . . . 11. . . . . . . .1. . .5. . . . . . The TUG-EEC-Channels Database . . . . . . . . . . . . . . .2. . . Validation and Distribution 7. . . . . . 10. . . . Filtering Robustness . . II. . . . . . . . . . . . . .1. . . . . 7. . . . . . . . . . . . . . . . . Noise Robustness . . . . . . . . . . . .2. . . . . . . . . . . . . . . . . . . . . . . ATCOSIM Structure. 7. . . . . . . . 10. . . . . . . . . . . . . . . Data Model and Parameter Estimation 9.4. . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8. . . . . . . . . . . . 8. . . . . . Introduction . . . . . . . 9. . . . .3. . 9. . .3. . . . . . . . . . . . . . . . . . . .1. . . . . . . . ATC Simulation Speech Corpus 7. . Measurement System for the Aeronautical Voice Radio Channel . . . . . 7. . .2. . . . . . . . . . . .4. .1. . . . . . . . . . . . . . . . . . . 10. .

. . Amendments to the Transcription Format . . Aeronautical Voice Channel Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . .2. .2. .1. . . . . . . Complementary Figures Bibliography . . . . . . 169 170 170 175 179 184 187 189 193 xiii . . . .1. . . . . .5. Radio Channel Modeling . . . . . . . . . . . . 165 B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. . . Transcription Format . . Aeronautical Radio Channel Modeling and Simulation B. . . . . . . . . . . . . . . . . . . . . .Contents Appendix 161 A. . . . . . . . . . . . . . 161 A. . . C. . Basic Concepts . . . . B. . . . . . . B. . . . . . . . . . . . . . . . . . . . . . . . .3. . . . . . . . . . . . . . . . . . . . . . . B. . . . . . . . . . B. . .6. . . ATCOSIM Transcription Format Specification 161 A. . Conclusions . .4. . . . . . . . . . Discussion . . . . . . . . . . .

Contents xiv .

. . . 5. . Quantization-based watermarking as an additive process. . . . . . . . .2. . . . . 3. . . . . Synchronization system based on spectral line method.1. 4. . . Masking threshold power to signal power ratio for a single utterance.1. . . The joint density f Xa . . 4. 3.11. . . . . . . . Watermark detector including synchronization. . .6. . . . Magnitude response of the bandpass. . . 4.6. . . . .1. 3. . . .List of Figures 1. . 5. . Quantization-based watermarking as sphere packing problem. . . . 3. Pulse shapes for zero ISI between data-carrying samples. . . . Speech watermarking system for the air/ground voice radio. . . 3. 3. . . 3. . . . Comparison of derived watermark capacities in speech. . 3. . 3. . . 2.1.5. 2. . . Randomization of the phase of masked DFT coefficients. . . . . . Frame and block structure of the embedded signal. . . . . Speech signal with original and randomized ELT coefficients. .7. . . . . . . . 4. . . 5.3. 3. . System robustness in the presence of various channel attacks. . . . . Comparison of the derived phase modulation watermark capacities. . . . 4. . . .4. . . .1. . . . . . . . . . . .2. . . . b) integrated over an interval ∆a. . . . .4. . . . . . . . . . . Masking threshold for a signal consisting of three sinusoids. . . . . . . . . . . Difference between the derived phase modulation watermark capacities. . . . . . . . . Overview of different watermarking principles. . . . . . . . .7. . . . . . . . . . .8. . . equalization and watermark detection. . . . . . . 4. . . . . Timing phase synchronization sensitivity. . .2. .9. .2. 4. . .5.3. . . . . . 2 12 13 19 20 21 22 28 38 38 40 41 43 45 49 51 54 58 58 61 62 74 75 79 xv . . . . Generation of a passband watermark signal for a telephony channel. . . . .3. IRS and aeronautical channel filter. 3. . . . . . . Generic watermarking model. . . . . Second-order all digital phase-locked loop. . .Xb ( a. . . . . . . Speech signal with original and randomized STFT phase coefficients. Watermark embedding in non-voiced speech. .10. . . .

. . . . . . . . . 123 Visualization of the aircraft track based on the recorded GPS data.8. . . . . 7. Phase estimation error and bit error ratio for various types of node offsets. . 8. . . .4. . . 9. . . Estimated frequency responses. . . Overview of estimation procedure.2. . 8. 8. 117 General aviation aircraft and onboard transceiver used for measurements. .12. . 10. . . . . . 157 xvi . f 9. 9. . . . . . . .4. . . .7. . . . . . . DC component of the received signal recording. . . ˆ 9. .1. . . . . . . . . DFT magnitude spectrum of the estimation error signal. . Embedding scheme robustness against sinusoidal gain modulation. . 5. A basic audio channel measurement system. . .6. . . . . .1. . 8. . . 10. . 5. . . . . . . . . . . . . . . . . Speech watermarking system for the aeronautical voice radio. . . . . . . . . . . 11. . . . . 123 127 128 129 131 132 135 136 137 137 139 141 142 144 149 150 150 152 153 9. . . . . . . 9. . . . 103 8. .119 Ground-based transceivers used for measurements.1. . . . . . . . . 116 Overview and circuit diagram of measurement system and interface box. 112 Overview of the proposed baseband channel measurement system. . . . . .13. . 8. . . 8. Two periodicity measures as a function of the resampling factor f2 . Implementation of the gain normalization.1. . . . Relationship among the different chapters of this thesis. . . . . . . . . 9. . . . . . . . A short speech segment with push-to-talk signal. . . . . . . . . 80 81 82 82 83 91 92 93 6.List of Figures 5. . . . . . . . . . 10. . . Signal and noise estimation errors. . .6. LTI filter plus noise channel model. .8.5. . 120 Frequency response of the static back-to-back voice radio channel. . . . . . . . . .5. 101 7. . . . Estimation error to output ratio Ey . . . . . 10. . . . . . . . . . . . . . . Research focus within the field of data communications.3. . 115 Complete voice channel measurement system on-board an aircraft. .2. 8. . . .1. . . . .9. Comparison between ideal and implemented synchronization. . . . . . . 10. DFT magnitude spectrum of the time-variant gain g. . .7. . . . . . . Frame grid detection using short signal segments. Identification of the transmitting aircraft. . . .9. . Overall system robustness in the presence of a timing phase offset. . .11. . . .3. .6. . . . Robustness of active frame detection. . . 0 9. . . Watermark embedding scheme robustness against AWGN. . . .4. . 5.3. . . . 8. . . . . . . .7. . 5. . . . . . . . System robustness in the presence of various transmission channel filters. . . . 9. . .2.4. . . . . . . . . . .1. . . . Proposed data and channel model. . .2. 9. . .3. . . . . . . . Screen-shot of the transcription tool TableTrans. . .10. Separation between voice radio channel model and measurement errors. . . . . . . 9. . . . . . . System robustness in the presence of gain modulation and gain control. . . . . .5. . 100 7. 6. . 122 Evolution of different measured frequency bins over time. Time-variation of the estimated impulse responses. .5. 9.3. Enhanced robustness and intelligibility by dynamic range compression. . . . . . . . . . . 6. . . . . . . . . . . .2. . . . Control room and controller working position (recording site). . . . . . . . . . . 9. . . . . . . . . .8. . . . . . . . .

. . . .1. . . .2. . . . B. . . . . .List of Figures B. 190 C. . Received signal using reference channel model with Gaussian PSD. . . .2. . . . .1. . B. . . . Tapped delay line channel model. . . .4. Time-variant path gain of a flat fading channel. . . . . 171 172 174 178 180 182 182 183 184 185 185 185 187 C. . . . . B. . . . . . . . . . .7. . . . . . . . 191 xvii . B. . . . Noise robustness curves similar to Figure 10. . B. . . . . Received signal using reference channel model with Jakes PSD. . . . . . . . . . . . . .3. . Multiband compressor and limiter settings. . . . . .4. Probability density functions and PSDs for Rayleigh and Rice channels. . . B. . . . . B. .13. .5. . . . . . . . . . . . Discrete Gaussian Doppler PSD. . . . . . . . . . . . . .8. . . . . . . . . . B. Sinusoidal AM signal in equivalent complex baseband. . . . . .11. . .12. . . . .6. . Multipath propagation in an aeronautical radio scenario. . . . .9. . . . B. . . .10. . . . . . . . . . . . . . . . . B. Received signal at different processing stages. B. B. . Signal spectra of the baseband and the HF signal. Received signal using the Generic Channel Simulator. . . . . . . Received signal before and after an automatic gain control. . . . Power of line-of-sight and scattered components. . .

List of Figures xviii .

. .3. Causes for Capacity Gap and Attributed Rate Losses . . . . . . . . . . . . . . . . . Filter Estimation Results . . . . . . . . . . . 106 9. . . . Examples of Controller Utterance Transcriptions . . . . . . 135 9. . .1. . . . 140 9. Feature Comparison of Existing ATC-Related Speech Corpora . . .List of Tables 4. Filter and Noise Estimation Results . . . . . . . Transmission Channel and Hidden Data Channel Model . 50 65 66 7. . . . . . . . . . .1. . . 98 7.3. . . . . . .2. . .1. . . .3. 4. Gain Modulation Frequency and Amplitude . . . . . . . . . . . . . . . . . . 145 xix . . . 4. . . . . . . . . . . . . .2.2. . Comparison With Reported Performance of Other Schemes . . . . . . . . Key Figures of the ATCOSIM Corpus . . . . . . . . 104 7. . . . .

List of Tables xx .

data.1. difficult technological requirements (such as large system scales and safety-critical compatibility. It is commonly agreed that such features could improve safety. While technology has rapidly advanced and satellite communication facilitates broadband Internet access in the passenger cabin. Still today. including an enormous base of airborne and ground legacy systems worldwide. The ATC voice radio lacks basic features that are taken for granted in any modern communication system. Broadband Internet access and mobile wireless communication. long life-cycles for aeronautical systems. security and efficiency in air traffic control. international regulations. last but not least. Motivation The advances in modern information technology revolutionized the way we communicate. the air/ground voice radio for pilots and controller communication has not been changed since its introduction in the early forties of the last century. and it is not always possible to deploy or use state-of-the-art communication technologies. Since. This is due to a wide range of reasons.Chapter 1 Introduction 1. Unfortunately. despite its advantages. there is no foreseeable time frame for the 1 . or video transmission. availability and robustness requirements). be it for voice. a lack of common political willingness or determination. are ubiquitous and seemingly available anytime and anywhere. The world is full of legacies. this is not entirely correct. this communication is based on pre-World War II technology using amplitude-modulation analog radio transmissions on shared broadcast channels. The problem addressed in this thesis is the voice radio communication between aircraft and civil air traffic control (ATC) operators. such as caller identification or selective calling. The large technological gap between communication systems available in the cabin for passenger services and in the cockpit for air traffic control purposes is not likely to disappear in the foreseeable future. and.

1. the characteristics of the aeronautical radio channel. and the detection of the watermark data in the degraded signal. such as caller identification. In order to be legacy system compatible and to avoid the necessity of a separate transmitter. constitutes the core of this thesis. On the receiving end. Many potential features. 2 . or position reporting. call routing. such side information should not be noticeable also with unequipped legacy receivers. The transmission of the watermarked speech over a poor-quality aeronautical radio channel poses an additional difficulty due to narrow bandwidth. We deal with the general problem of imperceptibly embedding digital data into speech. It investigates the theoretical watermark capacity in speech signals and proposes a new speech watermarking method and a complete system implementation.1. often referred to as information (or data) hiding or digital watermarking. there is relatively little prior work in the field of speech watermarking. such side information should be transmitted in-band within the speech channel as shown in Figure 1. Solutions are provided for a wide range of sub-problems. which outperform the current state-of-the-art methods. fading and time-variance of the channel. video and audio watermarking. This thesis covers a wide range of aspects of the aforementioned problem. However. Watermarking speech signals with a high embedded data rate is a difficult problem due to the narrow bandwidth of speech. It considers the full communication chain from source to sink. it is attractive to retrofit or extend the current analog system with additional features. introduction of digital ATC voice radio communication. and one can draw from a large pool of knowledge gained in the context of speech coding.: Speech watermarking system for the air/ground voice radio.e. noise corruption. the production and perception of speech is well understood. including the particular characteristics of the input speech. authentication. selective calling.Chapter 1.e. Identification) ATC Display Figure 1. and consider the special case of embedding into ATC speech that is transmitted over an analog aeronautical radio channel. The first part of this thesis deals with speech watermarking in a general context. can conceptually be reduced to a transmission of digital side information that accompanies the transmitted analog speech signal. The imperceptible embedding of side information into another signal. the embedding of the watermark. Compared to image. Identification) Data (i. Introduction Aircraft System TA G Watermark Embedder TX ATC System RX Watermark Detector Data (i.

amplitude modulation and additive noise. A marker-based synchronization scheme robustly detects the location of the embedded watermark data without the occurrence of insertions or deletions. the synchronization between the synthesis and the analysis systems. 1. Starting from the general capacity of the ideal Costa scheme. In light of the potential application to analog aeronautical voice radio communication. Part I considers the general problem of embedding digital side information in speech signals. The adaptive equalization-based watermark detector not only compensates for the vocal tract filtering.2. but also recovers the watermark data in the presence of non-linear phase and bandpass filtering. robust and high-rate watermarking. and watermark synchronization.1. Thesis Outline The second part presents contributions to the domain of air traffic control. Different ways to derive approximate and exact expressions for the watermark capacity when modulating the signal’s DFT phase are presented and compared to an estimation of the capacity when modulating host signal frequency components that are perceptually masked. Thesis Outline This thesis is divided into two parts. After a short review of related work in Chapter 2. and an evaluation of the robustness of the proposed watermarking system in the aeronautical application. the following three chapters address the issues of watermark capacity estimation. Chapter 5 discusses different aspects of the synchronization between watermark embedder and detector. Chapter 4 presents a blind speech watermarking algorithm that embeds the watermark data in the phase of non-voiced speech by replacing the excitation signal of an autoregressive speech signal representation. it presents an empirical study of the aeronautical voice radio channel. which facilitates robustness against bandpass filtering channels.2. making the watermarking scheme highly robust. The watermark signal is embedded in a frequency subband. In particular. it shows that a large improvement in theoretical watermark capacity is possible if the application at hand allows watermarking in perceptually irrelevant signal components. Chapter 3 investigates the theoretical watermark capacity in speech given an additive white Gaussian noise transmission channel. as well as the 3 . a database of ATC operator speech. Parameters required for the experimental comparison of both methods are estimated from a database of ATC speech signals. which is presented in Chapter 7. we present experimental results for embedding a watermark in narrowband speech at a bit rate of 450 bit/s. We derive several sets of pulse shapes that prevent intersymbol interference and that allow the creation of the passband watermark signal by simple filtering. We examine the issues of timing recovery and bit synchronization.

Introduction data frame synchronization. Chapter 7 presents the ATCOSIM Air Traffic Control Simulation Speech corpus. a speech database of ATC operator speech. as well as listening tests within the ATC domain. the transmitted and received signals are recorded in parallel to GPS-based timing signals. Using a fixed frame grid and the embedding of preambles. ATC language studies. the subsequent chapters present a radio channel measurement system. For the purpose of synchronization. a DC offset and additive noise. To estimate the model parameters. ATCOSIM is a contribution to ATC-related speech corpora. Possible applications of the corpus are.5 %-points compared to ideal synchronization. The database includes orthographic transcriptions and additional information on speakers and recording sessions. Chapter 8 presents a system for measuring time-variant channel impulse responses and a database of such measurements for the aeronautical voice radio channel. A collection of recordings of MLS transmissions is generated during flights with a general aviation aircraft. For the simpler linear prediction-based detector we present a timing recovery mechanism based on the spectral line method which achieves near-optimal performance.Chapter 1. The measurements cover a wide range of typical flight situations as well as static back-to-back calibrations. The model is derived from various effects that can be observed in the database. Bit synchronization and synthesis/analysis synchronization are not an issue when using the adaptive equalization-based watermark detector of Chapter 4. we compare six well-established FIR filter identification techniques and conclude that best results are obtained using the method of Least 4 . Part II of this thesis contains contributions to the domain of air traffic control (ATC). Maximum length sequences (MLS) are transmitted over the voice channel with a standard aeronautical radio and the received signals are recorded. The corpus is publicly available and provided free of charge. and a corresponding estimation method. Evaluated with the full watermarking system. The flight path of the aircraft is accurately tracked. such as different filter responses. speech recognition and speaker identification. an ATC speech corpus. the active frame detection performs nearoptimal with the overall bit error ratio increasing by less than 0. database and model. The resulting database is available under a public license free of charge. Chapter 9 proposes a data model to describe the data in the TUG-EEC-Channels database. among others. which were recorded during ATC real-time simulations. a time-variant gain. It consists of ten hours of speech data. After a brief overview of the application of speech watermarking in ATC and related work in Chapter 6. a sampling frequency offset. the information-carrying frames are detected in the presence of preamble bit errors with a ratio of up to 10 %. the design of ATC speech transmission systems. and an evaluation of the robustness of the proposed watermarking application.

Chapter 11 concludes both parts of this thesis and suggests directions for future research. with the modeling error being smaller than the channel’s noise. The data model achieves a fit with the measured data down to an error of -40 dB. we conclude that the measured channel is frequencynonselective. unpublished • a battery-powered portable measurement system for mobile audio channels – presented in Chapter 8 – published in [4] • a database of aeronautical voice radio channel measurements 5 . gain modulation and additive noise.3. We also provide simple methods to estimate and compensate the sampling frequency offset and the time-variant gain. Furthermore we show that pre-processing of the speech signal with a dynamic range controller can improve the watermark robustness as well as the intelligibility of the received speech. desynchronization. time-variant filtering and desynchronization – presented in Chapter 4 and Chapter 10 – published in [1][2][3] • methods for watermark synchronization on the sampling.3.1. The data contains a small amount of gain modulation (flat fading). The observed noise levels are in a range from 40 dB to 23 dB in terms of SNR. Chapter 10 shows the robustness of the proposed watermarking method in the aeronautical application using the channel model derived in Chapter 9. gain modulation. bit and frame level – presented in Chapter 5 – published in [1][2] • watermark capacity estimations for phase modulation based and frequency masking based speech watermarking – presented in Chapter 3 – recent results. 1. We experimentally demonstrate the robustness of the method against filtering. but several factors indicate that it is not a result of radio channel fading. Applying the model to select parts of the database. Contributions The main contributions of this thesis are: • a novel speech watermarking scheme that outperforms current state-of-the-art methods and is robust against AWGN. Contributions Squares. Its source could not be conclusively established.

” in Proceedings of the International Conference on Language Resources and Evaluation (LREC).” in Information Hiding. Petrik. Serbia.Chapter 1.tugraz.” in Proceedings of the IEEE Vehicular Technology Conference (VTC). Springer. Hofbauer. 2005. USA. France. “Aeronautical voice radio channel modelling and simulation—a tutorial review. K. Kubin. Montreal. Introduction – presented in Chapter 8 – published in [4] – publicly available at http://www. May 2008. and did the major part in writing the paper.” in Proceedings of the EUROCONTROL Innovative Research Workshop (INO). the author of this thesis did the major part of the theoretical work. pp. Hofbauer and G. pp. 2007. revised and resubmitted. K. 2009. Hofbauer. Marrakech. K. G. “A measurement system and the TUGEEC-Channels database for the aeronautical voice radio.spsc. and G. Speech. Dec. [1] K. Hofbauer and G. “High-rate data embedding in unvoiced speech. Hofbauer. K. pp. Hofbauer and H. Jul. B. Hering.at/TUG-EEC-Channels • a data model and an estimation technique for measured aeronautical voice radio channel data – presented in Chapter 9 – recent results. 241–244. Belgrade. S.spsc. “Noise robust speech watermarking with bit synchronisation for the aeronautical radio. Pittsburgh. 2006. conducted all experiments. Lecture Notes in Computer Science. unpublished • an air traffic control simulation speech corpus – presented in Chapter 7 – published in [5] – publicly available at http://www. Morocco.” in Proceedings of the International Conference on Spoken Language Processing (INTERSPEECH).tugraz. Kubin. K. Kubin. H.” IEEE Transactions on Audio. Hering. and W. 215–220. Brétigny-sur-Orge. [2] [3] [4] [5] [6] [7] 6 . 4567/2008. Kubin. “The ATCOSIM corpus of non-prompted clean air traffic control speech. “Speech watermarking for the VHF radio channel. Hering. and Language Processing. PY. Kleijn. Hering. Sep. Sep. 2006. and H.at/ATCOSIM 1. vol. “Speech watermarking for analog flat-fading bandpass channels. 252–266. 2006. H.” in Proceedings of the International Conference on Research in Air Transportation (ICRAT). Canada. Hofbauer. Kubin. ser. K.4. List of Publications First-Author Publications For the following papers. 1–5. pp. and G.

und Raumfahrtkongress). The work of the following two papers is not included in this thesis. Hofbauer and H. Berlin. and Operations Conference (ATIO). 2008. Hering and K. List of Publications [8] K. K. “A comparison of estimation methods for the VHF voice radio channel. Hofbauer. 2004. Hering and K. Alaska. Northern Ireland.” in Proceedings of the EUROCONTROL Innovative Research Workshop (INO). “From analogue broadcast radio towards end-to-end communication. USA. and Operations Conference (ATIO). Sep. 2007. Hofbauer. [9] Co-Author Publications For the following paper. Gruber and K. 2007.4. 2005. Anchorage. Integration. France. Hofbauer. The author helped with the implementation of the experiments and the writing of the papers. and the writing of the paper. the author of this thesis helped with and guided the theoretical work. 7 . the experiments.” in Proceedings of the AIAA Aviation Technology.1. Sep. [12] H. [10] M. Hofbauer. Belfast. Sep. Brétignysur-Orge. VA.” in Proceedings of the NASA Integrated Communications Navigation and Surveillance Conference (ICNS). [11] H. “Advanced speech watermarking for secure aircraft identification. Dec. Germany. “Digital signatures for the analogue radio.” in Proceedings of the CEAS European Air and Space Conference (Deutscher Luft. Integration. Hering. USA. “Towards selective addressing of aircraft with voice radio watermarks. Fairfax.” in Proceedings of the AIAA Aviation Technology.

Introduction 8 .Chapter 1.

Part I. Speech Watermarking 9 .

.

the allowed perceptual distortion induced by the watermark.1 Any modification to the watermarked signal in between the watermark embedding and the watermark detection constitutes a ‘channel attack’ and might render the watermark undetectable. A generic model for digital watermarking is shown in Figure 2. and the watermark detection errors induced by the channel attack. The watermark capacity is the maximum information rate of the embedded message given the host signal. Digital Watermarking Watermarking is the process of altering a work or data stream to embed an additional message that is not or hardly perceptible.1. perceptual fidelity and robustness against intentional or unintentional channel attacks. video. archiving. Practical applications for digital watermarking range from copy prevention and traitor tracing to broadcast monitoring. We only consider the case of blind watermarking. text or speech signal.1. 11 . the term ‘overt embedded communications’ is used if the existence of an embedded message is known. A brief and very accessible introduction to watermarking including a presentation of early schemes can be found in [14]. the work or data stream to be altered is called the ‘host signal’. The selection of an appropriate operating point is highly application-dependent. and legacy system enhancement. and may be an image. An exhaustive treatment of the underlying concepts of watermarking is provided in [13]. where the original host signal is not available at the watermark detector.Chapter 2 Background and Related Work 2. 1 In [13]. The ‘watermark’ is the embedded message and is expected to cause minimal perceptual degradation to the host signal. or ‘steganography’ if the existence of the embedded message is concealed. the term ‘watermarking’ is used only if the embedded message is related to the host signal. Otherwise. Most watermarking methods are subject to a fundamental trade-off between watermark capacity. audio. In digital watermarking.

However. A number of methods. The watermark channel capacity then becomes CW. 2. It allows to consider watermarking as a communication problem. Reducing the embedding dimensionality of distortion-compensated QIM to one (sample-wise embedding) and using scalar uniform quantizers in combination with a watermark scale factor results in the popular scalar Costa scheme (SCS) [23]. The capacity 2 is independent of the host signal power σH . for example the rational dither modulation scheme [24]. with potential spectral shaping of the watermark signal (Figure 2. 22]. Related Work—General Watermarking In recent years. a power-constrained watermark signal and an additive white Gaussian noise (AWGN) channel attack [20]. non-linear operations. mixing or resampling.1) assuming in- and depends solely on the watermark-to-attack-noise power ratio dependent identically distributed (IID) host data. or an additive embedding of the watermark in a transform domain of the host signal (Figure 2. 19]. have been proposed to counteract this vulnerability. As an example.2e) using different vector quantizers or dither signals that depend on the watermark data. These methods are often referred to as quantization index modulation (QIM). quantization-based methods embed the information by requantizing the signal (Figure 2. 12 .2d) or a transform domain representation (Figure 2. Striving to approach this capacity with practical schemes.g.2c). with the host signal being side information that is available to the embedder but not to the detector. Background and Related Work Host Signal Watermark Embedder Payload Message Channel Attacks Received Host Signal Watermark Detector Detected Payload Figure 2.2a). quantization-based methods are still sensitive to many simple signal processing operations such as linear filtering.: Generic watermarking model.Chapter 2. digital watermarking techniques achieved significant progress [13. Costa = σ2 1 log2 1 + W 2 2 σN 2 σW 2 . but is vulnerable to amplitude scaling. 17. σN (2. Early methods considered watermarking as a purely additive process (Figure 2. but suffered from the inherent interference between the watermark and the host signal [16. 18. and thus watermarking without host signal interference is possible [21. Costa’s work on communication with side information [20] enabled a breach with traditional additive watermarking.2b). This type of system was used in many practical implementations and proved to be highly robust. [25]. SCS is optimally robust against the AWGN attack. spread spectrum watermarking adds a pseudo-random signal vector representing a watermark message to a transformation of the original host signal.2. [26]). 15]. which are the subjects of numerous recent investigations (e.1.

and. (c) adding the data to a transform domain signal representation e(n). 13 .: A watermarked signal s(n) is generated from the original speech signal s(n) and the watermark data signal w(n) by (a) adding the data to s(n).2.2. (b) adding adaptively filtered data to s(n).2. (f) modulating e(n) with the data. ê (g) 1/T w ŝ s T ê 1/T ŝ ˆ Figure 2. as proposed in this work. (g) replacing parts of e(n) by the data. (d) quantizing s(n) according to the data. (e) quantizing the transform e(n) according to the data. Related Work—General Watermarking (a) s w (b) ŝ s w H (c) ŝ w s T e ê 1/T ŝ (d) w s (e) ŝ w Q s T e Q ê 1/T ŝ (f) w s T e Mod.

From this we develop in Chapter 4 a practical watermarking scheme that is based on the above concept of replacing signal components (see Section 2. we combine the watermarking theory presented in the preceding sections with a well-known principle of speech perception. spread-spectrum type systems used to be the method of choice for robust embedding [27. However. Many audio watermarking algorithms use auditory masking.2g). by modulating the frequency of selected partials of a sinusoidal speech or audio model [35. and the linear prediction residual [33]. Various quantizationbased methods have been proposed in recent years. While this is a valid approach also for speech signals. and most often frequency masking. Related Work—Speech Watermarking Most watermarking methods use a perceptual model to determine the permissible amount of embedding-induced distortion. In a generalization of QIM. it has previously been shown that MSE is not a suitable distortion measure for 14 . In speech watermarking. The achievable bit rates range from a few bits to a few hundred bits per second. discrete Hartley transform coefficients [32]. or by modifying the linear prediction model coefficients of the speech signal using long analysis and synthesis windows [37]. as in audio watermarking. with varying robustness against different types of attacks.Chapter 2. 2. one can even replace certain speech components by a perceptually similar watermark signal (Figure 2. or fail to thoroughly exploit state-ofthe-art watermarking theory or the characteristic features of speech perception. Methods have been proposed that exchange time-frequency components of audio signals above 5 kHz that have strong noise-like properties by a spread spectrum sequence [38].1) can be interpreted as a mean squared error (MSE) host signal distortion constraint and essentially determines the watermark capacity. some methods modulate the speech signal or one of its components according to the watermark data (Figure 2.4. we claim that capacity remains unused by not considering the specific properties of speech and the way it is perceived. leading to a substantial improvement in theoretical capacity.2f) by estimating and adjusting the polarity of speech syllables [34]. The above-mentioned algorithms are limited either in terms of their embedding capacity or in their robustness against the transmission channel attacks expected in the aeronautical application. This is due to the fact that the methods are designed for particular channel attacks (such as perceptual speech coding). 2 The watermark power σW in (2. the pitch period [31]. Our Approach In our approach. Taking quantization and modulation one step further.3) and using common speech coding and digital communications techniques. as the perceptual model for watermark embedding. or replace signal components that are below a perceptual masking threshold by a watermark signal [39. applying QIM or SCS to autoregressive model parameters [29. 40]. 28]. Background and Related Work 2. 30]. focus on security considerations concerning hostile channel attacks.3. It is thus indispensable to tailor a speech watermarking algorithm to the specific signal. 36].

we assume an autoregressive (AR) speech signal model and constrain the AR model spectrum and the temporal envelope to remain unchanged. Also. Note that in contrast to (2. We do so in non-voiced speech only.2. since in voiced speech certain phase spectrum changes are in fact audible [49. in particular for non-voiced speech (denoting all speech that is not voiced. 2 Thus.4.2) This high SNR approximation is derived in Section 3. 46]. σN 2 σW 2 . are not suitable for narrowband speech [45. While watermarking in the phase domain is not a completely new idea. There is a one-to-one mapping between the signal phase and the model’s excitation. (2. the large theoretical watermark capacity given by (2. This effect is also exploited in audio coding for perceptual noise substitution [43]. 15 .2) has not been recognized before. It is a long-known fact that. 50]. To obtain a perceptually transparent and practical implementation of our phase embedding approach.3. as for example flipping the polarity of the signal results in a large MSE but in zero perceptual distortion. Applying the concept of replacing certain signal components of Section 2. which allows us to modify the signal phase by modifying the excitation.4. Our Approach speech signals [41]. 47. Compared to MSE this PSD constraint is far less restrictive and results in a watermark channel capacity CW. 48]. instead of the MSE distortion (or watermark power σW ) constraint. we propose to constrain the watermarked signal to have the same power spectral density (PSD) as 2 the host signal (with power σH ) but allow arbitrary modification (or replacement) of the phase.1) the capacity is no longer determined by the MSE distortion to channel noise ratio but by the host signal to channel noise ratio or SNR 2 σH 2 . we exchange the model excitation by a watermark signal. the ear is insensitive to the signal’s phase [42].2. Phase ≈ 1 log2 4 2 σH 2 σN + 1 log2 4 4π e . or are restricted to relatively subtle phase modifications [14. previously proposed methods either require the availability of the original host signal at the detector [44]. comprising unvoiced speech and pauses) and blocks of short duration. σN and the watermark signal has the same power as the host signal.

Chapter 2. Background and Related Work 16 .

This chapter presents recent results that have not yet been published. Starting from the general capacity of the ideal Costa scheme.Chapter 3 Watermark Capacity in Speech This chapter investigates the theoretical watermark capacity in speech given an additive white Gaussian noise transmission channel. Parameters required for the experimental comparison of both methods are estimated from a database of ATC speech signals. it shows that a large improvement in theoretical watermark capacity is possible if the application at hand allows watermarking in perceptually irrelevant signal components. 17 . Different ways to derive approximate and exact expressions for the watermark capacity when modulating the signal’s DFT phase are presented and compared to an estimation of the capacity when modulating host signal frequency components that are perceptually masked.

Since perceptual coding aims to remove perceptually irrelevant signal components.2. ‘watermarking with side information’ or ‘informed embedding’ is the underlying principle for most quantization-based methods. The ICS. Using well-established speech perception properties. Besides watermarking in perceptually relevant components. Given that in the past most watermarking research was driven by the need of the multimedia content industry for copyright protection systems. An experimental comparison of the different capacities is performed in Section 3. A similar discussion can be found in [13]. all capacities C are given in bits per sample (or bits per independent symbol. In quantization-based watermarking the host signal is quantized using different vec- 18 . the maximum number of watermark bits transmitted per second is C times twice the channel bandwidth in Hz. many algorithms aim at a robustness against perceptual coding attacks. Thus. equivalently).2. also termed ‘dirty paper watermarking’. we aim to estimate the watermark capacity in speech given an additive white Gaussian noise (AWGN) channel attack. watermarking must occur in perceptually relevant components. watermarking in perceptually irrelevant signal components is possible.Chapter 3. the maximum number of independent symbols per second (the maximum symbol rate or Nyquist rate) is twice the channel bandwidth (e. For a band-limited channel. Given the aeronautical application presented in Chapter 6. we focus on watermarking in perceptually irrelevant components and derive the watermark capacities in speech on the basis of frequency masking (Section 3. Watermark Capacity in Speech 3. the capacity specifies how many watermark bits can be embedded in an independent host signal sample or symbol.5. we start with a brief review of the capacity of watermarking in perceptually relevant signal components and the Ideal Costa Scheme (ICS) [13]. In watermarking for analog legacy system enhancement. there is a limit to how much perceptually relevant components can be altered without degrading perceptual quality to an unacceptable level.4). In the remainder of this chapter.g. Consequently. 3.1. robustness against perceptual coding is not a concern. Watermarking Based on Ideal Costa Scheme As a basis for comparison. which opens a door to significant capacity improvements. and it is possible to derive an achievable upper bound for the watermark capacity. It is within this context where quantization-based watermarking and its capacity evolved. we show that the watermark capacity in speech far exceeds the watermark capacity of conventional quantization-based watermarking.3) and phase perception (Section 3. However. [51]). but otherwise the important distinction between watermarking in perceptually relevant or irrelevant components is seldom brought forward. We achieve this improvement by proposing a clear distinction between watermarking in perceptually relevant signal components (or domains) and watermarking in perceptually irrelevant signal components. Introduction In this chapter.. and its capacity is reviewed in Section 3.

The watermark capacity evaluates to [51.1) and is inherently limited by the MSE distortion constraint. frequency masking (or simultaneous masking) describes the effect that a signal is not 19 . which roughly means the number of ‘noise spheres’ N that fit into the ‘allowed-distortion sphere’ W. 13] CICS = 1 log2 L VY VW = σ2 1 log2 1 + W 2 2 σN bit/sample (3. 3.3. and in high dimensions the received signal vector lies with high probability inside a hypersphere N with radius rN = 2 LσN centered at s [52]. In particular. w s Q ŝ s ŝ Figure 3. which has a power constraint σW that corresponds to an MSE criterion for the distortion of the host signal.3. 2 Assuming a white Gaussian host signal vector s of length L with variance σH . Consequently. 52.1. The number of distinguishable watermark messages given the channel noise is thus VY/VW . Figure 3.3. We review in the following the watermark capacity derivation given this MSE distortion criterion. The Gamma function Γ is defined as ˆ Γ(z + 1) = zΓ(z) = z 0 ∞ tz−1 e−t dt.2). Watermarking Based on Auditory Masking Auditory masking describes the psychoacoustical principle that some sounds are not perceived in the temporal or spectral vicinity of other sounds [53]. tor quantizers that depend on the watermark data.1 depicts the quantization 2 as an addition of a host signal dependent watermark. Watermarking Based on Auditory Masking m m Watermark Signal Gen.: Quantization-based watermarking as a host-signal-adaptive additive process. all possible watermarked signals within ˆ 2 2 W lie after transmission within a hypersphere Y with radius rY = L(σW + σN ) (see Figure 3. the watermarked signal s must lie within an ( L − 1)-sphere W centered around s with ˆ volume L π L/2 rW VW = Γ ( L + 1) 2 and radius rW = 2 LσW . 2 The transmitted signal is subject to AWGN with variance σN .

[56. and thus the watermark capacity. are commonly used in practical audio watermarking schemes. 8]. The remainder of this subsection derives an estimate of the theoretical watermark capacity that can be achieved based on frequency masking. the watermark signal is inaudible if it is kept below the masking threshold..g.g. The same approach is frequently applied in perceptual speech and audio coding: The 20 .2.. the watermark signal is not audible as long as the spectral envelope of the watermark signal is at or below the masking threshold. [39]). Watermark Capacity in Speech si+1 Sphere Y with radius rY Sphere N with radius rN w ŝ s Sphere W with radius rW Reconstruction point of quantization cell si Figure 3. Perceptual models in general. If the watermark signal (or quantization error) is white. sinusoidal components below the masking threshold are not audible because they are masked by the components above the masking threshold.3 shows the spectrum of a signal consisting of three sinusoids. [55.g. Ch. However. and auditory masking in particular.. 28]). Theory In quantization-based watermarking. or by removing masked components and inserting a watermark signal in replacement (e. 57]). and the watermark power σW in (3. the principle of auditory masking is exploited either by varying the quantization step size or embedding strength in one way or the other (e.1) represents a MSE distortion criterion on the speech signal. Figure 3.mask . the watermark signal corresponds to the quanti2 zation error induced by the watermarking process. Each reconstruction point ( ) within the sphere W denotes a different watermark message. and the masking threshold (or masking curve) derived using the van-de-Par masking model [54]. and the principles are well understood [13. by spectrally shaping the watermark signal according to the signal-dependent masking threshold. 3.1.: Quantization-based watermarking and the Ideal Costa Scheme shown as sphere packing problem. One can in2 crease the watermark power to σW.Chapter 3.3. According to the masking model. the watermark is perceived as background noise and 2 the permissible level (the watermark power σW ) is application-dependent. Besides more traditional approaches such as the spectral shaping of spread spectrum watermarks (e. According to the model. audible in the presence of a simultaneous louder masker signal at nearby frequencies.

with a small experiment. the maskee signal is scaled to be a factor κ below the calculated masking threshold.2) CMask = log2 1 + 2 2 σN 2 with typically σW. in this experiment the maskee is a wideband noise signal. We generated a noise signal that is uncorrelated to the speech signal 21 . quantization step-size is chosen adaptively such that the resulting quantization noise remains below the masking threshold [58. 59]. Watermarking Based on Auditory Masking 70 DFT magnitude (in dBSPL) 60 50 40 30 20 10 0 0 500 1000 1500 2000 2500 Frequency (in Hz) Signal spectrum Masking curve 3000 3500 4000 Figure 3.mask 2 σW . In a frequency band from 100 Hz to 8000 Hz 2 we measure the average power σH of the speech signal as well as the permissible power 2 σW. Using (3. Experimental Results 2 We estimate the permissible watermark power σW. The maximum value of κ at which the maskee signal is still inaudible was determined in an informal experiment. which would not be entirely masked. While the used masking model describes the masking of sinusoids by sinusoidal or white noise maskers.5 dBSPL .mask and.3. we calculate the masking thresholds in frames of 30 ms (overlapping by 50%). the watermark capacity results in 2 σW. Using ten randomly selected utterances (one per speaker) of the ATCOSIM corpus (see Chapter 7).mask of the watermark signal. The masking thresholds are calculated based on [54].2.3. the second of them being masked and not audible. To compensate for this shortcoming. with the spectral envelope of the watermark signal set to a factor κ below the signal-dependent masking threshold.1). thus. 3. the watermark capacity.: Masking threshold for a signal consisting of three sinusoids. which corresponds to a comfortable speech signal level for normal listeners [61].3.3. using a listening-threshold-in-quiet as provided in [60]. and a listening level of 82.mask 1 bit/sample (3.

4. the permissible watermark power 2 2 is σW.3) Compared to (3. σN 22 .mask/σH using κ = −12 dB for the utterance “swissair eight zero six contact rhein radar one two seven decimal three seven”. The resulting noise signal was added to the speech signal with varying gains κ.mask ≈ κσH .mask/σH Masked−to−signal power ratio (in dB) −8 −10 −12 −14 −16 0 1 2 3 Time (in s) 4 5 6 2 2 Figure 3. (3. corresponding to one fourth in amplitude or one 16th in terms of power. or by generating in the DFT domain a complex white Gaussian noise with unit variance and multiplying this with the masking threshold. The estimated average watermark capacity using (3.2) then results in CMask ≈ σ2 1 log2 1 + κ H 2 2 σN ≈ 2 1 1 σH log2 1 + 2 2 16 σN . 2 2 The masked-to-signal power ratio σW.: Scaled masking threshold power to signal power ratio σW. and the maximum value of κ at which the noise is still masked was found to be κ ≈ −12 dB. This can be done by either setting the DFT magnitude of the watermark signal to the masking threshold and randomizing the DFT phase.mask/σH for all frames of an utterance is shown in Figure 3.1). CMask does not depend on an MSE distortion criterion but is given by the transmission channel’s signal-to-noise ratio (SNR) 2 σH 2 . Watermark Capacity in Speech (a) Time-Domain Speech Signal 1 Amplitude 0 −1 0 1 2 3 Time (in s) 4 5 6 (b) Masked-to-signal power ratio 2 2 σW. and has a spectral envelope that corresponds to the masking threshold of the speech signal.4.Chapter 3. Averaged over all frames and utterances.

Then. 3. In contrast to the DFT coefficients. Definitions of Phase The notion of a signal’s phase is frequently used in literature. Interpreting lapped transforms as filterbanks. but ˆ allow arbitrary modification of the phase.1.1. This notion of phase is used in our practical watermarking scheme of Chapter 4. DFT Phase The DFT phase denotes the argument (or phase angle) of the complex coefficient of discrete Fourier transform (DFT) domain signal model as defined in Section 3.2.1. The following capacity derivations consider only a single DFT frame of length L of the speech signal. and the intended meaning of phase widely varies depending on the underlying concepts. MLT and ELT Phase In Section 3. A mapping of an AR model phase to a DFT phase requires a short-term Fourier transform (STFT) of the signal using a well-defined transform size. two exact solutions are derived.8 modulated and extended lapped transform (MLT and ELT) signal representations are applied. we consider the variances of the subband 23 . The following subsections present four different ways to derive the watermark capacity: First. we consider a particular realization of a white Gaussian excitation signal of an autoregressive (AR) speech signal model as defined in Section 4.4. The three phase definitions have in common that they represent a signal property that can be modified without modifying some sort of a spectral envelope of the signal. Watermarking Based on Phase Modulation We consider the case of watermarking by manipulating the phase spectrum of unvoiced speech. which are interrelated but nevertheless represent different quantities. one based on the joint probability density function (PDF) between input and output phase angle. AR Model Phase Primarily. Preliminaries 3.4.1. we deal in the remainder of this thesis with three different ‘phases’. In general.4. while keeping the power spectrum unchanged. Depending on the most suitable representation.1.4.4. two high SNR approximations are presented.3. However.1 as the phase of a signal. It was previously shown that for unvoiced speech the ear is not able to distinguish between different realizations of the Gaussian excitation process as long as the spectral and temporal envelope of the speech signal is maintained [42]. This means that instead of an MSE distortion constraint we constrain s to have the same power spectrum as s. there is no standard definition. the lapped transform coefficients are real and no direct notion of phase exists. Watermarking Based on Phase Modulation 3. and one based on the conditional PDF of the complex output symbol given the complex input symbol. and one using sphere packing. window length and overlap.4. the phase of a signal is a quantity that is closely linked to a certain signal representation. one using the information theoretic definition of capacity.

e. 3.1 2 ˜ The tilde mark (˜) denotes complex variables.4. Thus. which makes the DFT a unitary transform. no Nyquist frequency coefficient exists. Watermark Capacity in Speech signals within a short window as the quantities that determine the spectral envelope. modifying the MLT or ELT phase of a signal means replacing the original subband signals by different subband signals with identical short-term power envelopes. and L + 1 coeffi2 cients are independent. the DFT representation has 2L real coefficients R. we model the AR model phase replacement performed in [42] and Chapter 4 by replacing the phase of a DFT domain signal representation.g. PDFs of the Random Variables In the following paragraphs.8) we consider the discrete Fourier transform (DFT) of a single frame of length L of an unvoiced speech signal. of which only L are independent. ˜ ˜ ˜ resulting in the noisy watermarked signal s + n with DFT coefficient Y = X + N and ˆ output phase angle Θ. x = re jϕ = a + jb. we derive the PDFs of the involved random variables (RV). each complex DFT coefficient X has variance σ2 .1. 1 For L even.4 (except Section 3. and consider the particular realizations of the subband signals as the phase of the signal. which represents the watermark signal. 3. Assuming a white host signal with variance 2 . respectively. 24 . whereas boldface letters denote real valued vectors.4. The watermarked 2 ˜ signal is subject to AWGN n with DFT coefficient N. is imposed onto a DFT coefficient of the host signal s. and L+1 coefficients are independent. resulting in the watermarked DFT coefficient X = rH e jΦ . An input phase angle Φ. 1 and modify the DFT phase while keeping the DFT magnitude constant. with s.4. the DC coefficient is real. where the L odd-numbered dimensions represent the real part Rrl and and the L even-numbered dimensions represent the imaginary part Rim of the ˜ complex DFT coefficients X = Rrl + jRim . Each ˜ frame is assumed to have L complex coefficients X.Chapter 3.2.4. the coefficients representing DC and Nyquist frequency are real-valued.3. DFT Domain Signal Model For all capacity derivations of Section 3. While its original magnitude rH is maintained. Considering ˜ the real and imaginary parts of X as separate dimensions. We use a √ L weighting for the DFT and its inverse. As a simplification we assume in the following L independent 2 2 complex coefficients.1. and each real coefficient R has ˜ σH H variance 2 σH 2 . phase angle Ψ and variance σN .1. For odd L. Watermarking Model In order to facilitate theoretical watermark capacity derivations.4. s and n representing a single frame of the ˆ host signal. which for large L leads to a negligible difference in the capacity estimates. 3. the watermarked signal and the channel noise. the phase angle is replaced by the watermark phase ˜ angle Φ..

i.4. of a complex zero-mean WGN signal with variance 2 σN is characterized by 2 ˜ N ∼ CN (0. Nb ∼ N (0. the input phase angle is restricted to the interval [0. and (3. This results in [62.5) 25 . Note that h( N ) = h( Na ) + h( Nb ). p. f Np (r. ˜ ˜ Noise Coefficient N (complex RV) Each complex DFT coefficient Nc . b) = f Na ( a) f Nb (b) 2 ˜ and h( Na ) = h( Nb ) = 1 log2 (πeσN ).e. 2π [.][63] 1 2π f Nϕ ( ϕ ) = 0 0 ≤ ϕ < 2π else f Nr (r ) = r r2 exp − 2 σ2 2σ = 2r r2 exp − 2 2 σN σN (3. To maximize its differential entropy h.Nb ( a. Φ is assumed to be uniformly distributed with PDF f Φ ( ϕ) = and differential entropy h(Φ) = log2 (2π ). σN ). the subscript c denoting Cartesian coordinates. f Nc ( x ) = ˜ ˜ ˜ 1 | x |2 exp − 2 2 πσN σN ˜ with | x |2 = a2 + b2 = r2 . ϕ) = f Nc ( x ).Nb ( a. 201 ff. The density f Nc ( x ) is not the density in polar coordinates. ˜ ˜ ˜ ˜ ˜ because a transformation of variables is required.3. But f N ( x ) is really just a short-hand notation for the joint PDF f Na . f Nc ( x ) = ˜ ˜ 2 σN 2 ) = = 1 2 πσN exp − exp − a2 2 σN b2 2 σN 2 σN 2 ) 1 2 πσN f Na . b) of the ˜ ˜ independent marginal densities Na ∼ N (0. Watermarking Based on Phase Modulation Input Phase Φ (scalar RV) Given the DFT domain signal model. which maximizes the watermark channel capacity C. since the 2 joint differential entropy is the sum of the differential entropies if the variables are independent.4) 1 2π 0 0 ≤ ϕ < 2π else 2 ˜ h( N ) = log2 (πeσN ).

ϕ) = f Nr (r ) f Nϕ ( ϕ) = ˜ r r2 exp − 2 2 πσN σN (3. with r = a2 + b2 ϕ = arctan b a and the inverse transformation a = r cos ϕ b = r sin ϕ.Nϕ (r. the PDFs of the independent RVs X ϕ and Xr are f X ϕ ( ϕ) = 1 2π 0 0 ≤ ϕ < 2π else f Xr ( r ) = δ ( r − r H ) f Xr . ϕ) = JN · f Na .6) ˜ and f Np ( x ) = r · f Nc (r ). b) = ∂(r. Also. b).4) by consid˜ ˜ ering the polar coordinates (r. Nr ˜ ˜ and Nϕ are independent.5) results in f Nr (r ) = 2πr · f Nc ( x ). comparing (3. Watermark Capacity in Speech where f Nr (r ) is the Rayleigh distribution with parameter σ (in our case with σ2 = and mean and variance µ Nr 2 σNr 2 σN 2 ) π σ √ = N π 2 2 π 2 π 2 = (2 − )σ = (1 − )σN . ϕ) cos ϕ −r sin ϕ sin ϕ r cos ϕ =r ˜ Signal Coefficient X (complex RV) We consider a single complex DFT coefficient ˜ X = rH e jΦ with rH being a deterministic constant. r sin ϕ) = r r2 exp − 2 2 πσN σN . ϕ) = 1 2π δ (r 0 − rH ) 0 ≤ ϕ < 2π else 26 .4) and (3. The Jacobian determinant J of the transform is JN = and the joint density is f Nr . ϕ) as a vector function (or transform) of the Cartesian coordinates ( a.X ϕ (r. 258] ˜ f Np ( x ) = f Nr . p. This polar representation can be derived from (3.Nb (r cos ϕ. In polar coordinates. and their joint distribution is [62.Chapter 3. 2 4 = σ In this special case. ∂( a.Nϕ (r.

With the transform a = r cos ϕ b = r sin ϕ and the inverse transform r = a2 + b2 ϕ = arctan b a and its Jacobian JX = ∂(r.8) 1 − ( rb )2 H −1 2 for |b| ≤ rH else. b) = JX · f Xr .7) and the dependent marginal densities in Cartesian coordinates are  −1 2  1 a 2 1 − ( rH ) for | a| ≤ rH f Xa ( a) = πrH 0 else f Xb ( b ) =   0 1 πrH (3. tan b a = δ a2 + b2 − rH = 2π a2 + b2 1 = δ a2 + b2 − rH 2πrH √ 1 (3.X ϕ a2 + b2 .5 can be approximated by the length of the tangent within the strip. The probability P that a point lies within the strip ∆a is the arc length ∆c times the 1 density 2πrH of the arc. Using sin β = the arc length results in ∆c = 1− A rH 2 A f = = rH ∆c c2 − (∆a)2 ∆c −1 2 ∆a.3.4. P= 1 ∆c. times a factor of 2 to also account for the lower semicircle. πrH 27 . Watermarking Based on Phase Modulation and describe a two-dimensional circularly symmetric distribution [63]. b) ∂r ∂a ∂ϕ ∂a ∂r ∂b ∂ϕ ∂b =√ 1 a2 + b2 the joint density in Cartesian coordinates is f Xa . The arc length ∆c within an infinitesimally small strip of width ∆a as shown in Figure 3.Xb ( a. The marginal densities are derived as follows. ϕ) = ∂( a.

P and the probability density ∆a gives in the limit ∆a → 0 the marginal PDF ˆ f Xa ( a ) = ∞ b=−∞ f Xa . 28 .Xb ( a. and f Y ( x ) = f Y (r ) = f X ( x ) ∗∗ f N ( x ) ˜ ˜ ˜ ˜ ˜ ˜ ˜ where ∗∗ denotes 2D-convolution. b) integrated over an interval ∆a.Chapter 3.5.Xb ( a. −rH ˜ Noisy Signal Coefficient Y (complex RV) The noise-corrupted watermarked coeffi˜ is a sum of two independent complex RV. 1 f Xa ( a)da = πrH ˆ rH a=−rH 1− a rH 2 −1 2 da = = 1 rH arcsin πrH a rH rH = 1. In Cartesian coordinates the summation of the two complex RVs can be carried out on a by component basis. with Ya = Xa + Na Yb = Xb + Nb .: The joint density f Xa . b)db =    The marginal density integrates to 1 since ˆ ∞ a=−∞  0 1 πrH 1− a rH 2 −1 2 for | a| ≤ rH else. with cient Y ˜ ˜ ˜ Y = X + N = rY e jΘ . Watermark Capacity in Speech b Δa β Δc f β A Δa rH a Figure 3.

5.9) The additional factor 1 compared to the usual definition results from only half of the 2 ˜ complex DFT coefficients X being independent.4. Even though ˜ the two capacities are information is only embedded in the phase angle Φ of X.5 and Section 3. Ya and Yb are also dependent. the conditional PDF f Y |X (y| x ) can be easily stated given ˜ ˜ ˜ ˜ ˜ ˜ and Y. the circular arrangement of X Output Phase Θ (scalar RV) ˜ signal Y is The output phase Θ of the observed watermarked Yb = Yϕ . and the joint density is not the product of the marginal densities. the density of the first sum is ˆ ∞ f Ya ( a) = f Xa ( a) ∗ f Na ( a) = f Xa (m) f Na ( a − m)dm = m=−∞ ˆ ∞ = which evaluates to 1 f Ya ( a) = πrH 1 2 πσN m=−∞ f Xa ( a − m) f Na (m)dm. and C= 1 ˜ ˜ sup I ( X. The capacity is defined as the mutual information between the channel input and the channel output. because the input phase is uniformly distributed and the channel noise is Gaussian.3. ˆ m 1− m rH 2 1 −2 exp − ( m − a )2 2 σN dm. However. Approaches to Capacity Calculation Given the DFT-based signal and watermarking model. since in the later case the watermark detector can also respond to the amplitude of the received signal. one based on the channel input/output phase angles Φ and Θ. It appears difficult to obtain closed-form ˜ ˜ expressions for the joint and conditional PDFs of X and Y. The expression for f Yb (b) is equivalent.1.8). ˜ Θ = arg(Y ) = arctan f Θ ( ϕ) = 1 2π 0 0 ≤ ϕ < 2π else and h(Θ) = log2 (2π ).4. as shown in Section 3.6. Ya and is again uniformly distributed. Θ). We calculate the capacity based on: 29 . Y ) 2 f ˜ (x) ˜ X or C= 1 sup I (Φ. 3. Thus. maximized over all possible input distributions. two watermark capacities can be calculated. Watermarking Based on Phase Modulation Using (3.4. and ˜ ˜ one based on the channel input/output complex symbols X and Y.4. Since Xa and Xb are dependent. No analytic solution to this integral is known. not identical. 2 f Φ ( ϕ) (3.

which effectively scale the SNR of the i’th subchannel from host signal.4.4.1. ∀i : pi = 1. Θ) over all distributions of Φ. the observed phase Θ is then Θ = Φ + Ψ.10) σ2 . High SNR Capacity Approximation Using Mutual Information 2 ˜ We first consider the complex DFT coefficient X of a white host signal with power σH . and the overall capacity is the sum of the subchannel capacities. We introduce band power weights pi .10) as Riemann sum results equivalently in ˆ f s /2 f =0 to pi σH .4. with 2 σN/2 σ2 Ψ∼N 0. MI-angle is the maximum mutual information I (Φ. p( f ) ≡ 1 as in the discrete case.3. The mutual information is maximized when Φ (and consequently 30 . Substituting L by f s and considering (3. the noise by N σH ˜ component of N in angular direction is a scalar Gaussian random variable N with N power 2 . the difference in differential entropy between Y and N (used in Section 3. In the case of high SNR. the extension to a colored host signal is straightforward and shown for the two high SNR approximations of Section 3.2 and 3.6). The capacity can be calculated individually for each DFT channel. 3.4. The discrete band power weights pi could also be considered as the samples of a scaled power spectral density p( f ) of the host signal. with i =1 ∑ pi = L/2 L 2 2 σH 2 σN (3.4. Given a transmitted phase Φ. The phase noise Ψ induced by the channel is then also Gaussian.11) For a white host signal. The channel capacity CPhase. ˜ ˜ 2.4.3). the formal definition of I based on the input’s PDF and the conditional PDF of the output given the input (used in Section 3.2.4. 3. wrapped into 0 . . 2 ˜ The Gaussian channel noise N with power σN changes both the amplitude and the ˜ phase of the received signal. The ‘displacement’ N of X in angular direction caused ˜ corresponds to a phase change tan( ϕ) = N .2 (used in Section 3.4. 2 (3. 2π.4.6. and we can also assume ϕ ≈ tan( ϕ). Colored Host Signal Spectrum While most of the following derivations assume a white host signal spectrum. Watermark Capacity in Speech 1. Using a white 2 N p ( f )d f = fs . the sphere packing analogy of Section 3.5 and 3. .4.2). 3. 2 σH . 3.Chapter 3.

Θ) = log2 2 4 2 σH 2 σN + 1 log2 4 4π e .3. Instead of the watermarked signal s being constrained ˆ to a hypersphere volume around s. = 1 log2 2 2 σH 2 σN + 1 log2 2 4π e ˜ and with only half of the complex coefficients X being independent we obtain CPhase.4.4. Θ) = h(Θ) − h(Θ|Φ) = h(Θ) − h(Θ − Φ|Φ) = h(Θ) − h(Ψ|Φ) = h(Θ) − h(Ψ) since Φ and Ψ are independent. 3. colored = = 1 log2 4 2 σH 2 σN + 1 log2 4 4π e + 1 2 fs ˆ f s /2 f =0 log2 ( p( f )) d f . For a non-white host signal with band power weights pi .55 bit. Watermarking Based on Phase Modulation also Θ since Ψ is Gaussian) is uniformly distributed between 0 and 2π as defined above.12) with 1 log2 4π ≈ 0. Θ) between the received phase Θ and the transmitted phase Φ then evaluates to I (Φ. MI-angle. 1 1 The factor 2L results from the factor 1 of (3. MI-angle. MI-angle.6 instead of the discrete band power weights pi results in CPhase. high-SNR = 1 1 I (Φ.4. (3. but requiring a longer proof [64]. The mutual information I (Φ. colored = 1 log2 2L ∏ pi σH 2 i =1 N 2 σH 2 σN L/2 σ2 + 1 log2 4 4π e 4π e = 1 2L L/2 i =1 = 1 log2 4 + 1 log2 4 + ∑ log2 ( pi ) . High SNR Capacity Approximation Using Sphere Packing An estimate of the watermark capacity can also be obtained using the same line of arguments as in Section 3. it is now constrained to lie on an L -dimensional 2 31 . The maximum mutual information is then given by I (Φ.1. The same result is obtained elsewhere for phase shift keying (PSK) in AWGN. Using the scaled host signal PSD p( f ) of Section 3. the capacity results in CPhase.3. MI-angle. high-SNR.12) and a factor L/2 given by the number of 4 product or summation terms. high-SNR. Θ) = h(Θ) − h(Ψ) = log2 (2π ) − σ2 1 log2 (2πe N ) 2 2 2σH . The subscript Phase.2. high-SNR denotes a high-SNR 4 e approximation of the phase modulation watermark capacity derived using the mutual information (MI) between the input and output phase angles.

For large L and using Sterling’s approximation. and one being parallel to the manifold. high-SNR = 1 log2 L AY VN Γ ( L + 1) 4 L L/4 (3. 32 . the power spectrum constraint corresponds to a fixed amplitude ˜ of the complex DFT coefficients X. one being orthogonal to the above manifold.15) which evaluates for a white host signal (pi ≡ 1) to CPhase. 3 This also reflects the fact that the channel noise changes the power spectrum of the received signal. In terms of the real and imaginary coefficients Rrl and Rim this means that for every 2-dimensional subspace of corresponding ( Rrl . (3. The noise component parallel to the manifold (with power determines the number of distinguishable watermark messages. the noise can be split up into two components N of equal power 2 . A projection of the noise onto a dimensional subspace (line. The surface of this manifold is 2 AH = ∏ 2πrH .16) The subscript SP denotes a watermark capacity derived using the sphere packing approach. Γ ( L + 1) 4 2 L σN 2 2 . We need to determine the constellation of the observed signal and the number of distinguishable watermark messages that fit onto the manifold. Γ( L + 1) can be expressed 4 2 The projection of the L-dimensional noise (with spherical normal PDF) onto the L -dimensional subspace 2 L 2 -dimensional would result in the volume of the L 2 hypersphere with radius rN = L 2 2 σN . hyperplane.13) 2 The transmitted signal is subject to AWGN with variance σN . The ‘size’ of a watermark cell on the L dimensional manifold is 2 the volume VN of the L -dimensional hypersphere with radius rN = 2 N (π L 2 ) L/4 2 VN = . and (3. SP.Chapter 3. Rim ) 2 the watermarked signal must lie on a circle with radius rH = pi σH . . high-SNR 1 = log2 4 2 1 σH + 2 2 σN 1 1 + log2 (16π ) + log2 4 L . SP. . with pi as defined above (for a white signal pi ≡ 1). Watermark Capacity in Speech manifold which contains all points in the L-dimensional space that have the same power spectrum as s. The orthogonal component effectively σ2 increases the radii of the circles from rH to rY = 2 σN 2 ) 2 pi σH + 2 σN 3 2 . plane. which is equivalent to a high SNR assumption. The watermarked signal s must thus lie on the ˆ L -dimensional manifold that is defined by these circles. .14) σ2 The capacity is CPhase.2 Averaged over all dimensions and given the circular symmetry of the distributions. i =1 L/2 (3. In the DFT domain. ) is something else than a projection onto a curved manifold. Separating the noise into a parallel and an orthogonal component requires that the manifold is locally flat with respect to the size of the noise sphere.

(3. high-SNR = 1 log2 4 2 1 σH + 2 2 σN + 1 log2 4 4π e . substituting (3.4.16) evaluates to CPhase. SP. Given For a non-white host signal.18) is negligible and the result is identical to (3.18) The difference between (3. SP. SP. 2L (3.14) into (3.17) converges to zero and (3. Again for large L.13) and (3. colored = 1 log2 2L  2 σN 2 ·Γ L 4 = ∏ i =1 L/2 σ2 1 + pi H 2 2 σN + 1 1 log2 (16π ) + log2 4 L  +1  =   Γ ( L + 1) 4 L L/4 (3. colored = 1 2L L/2 i =1 ∑ log2 σ2 1 + pi H 2 2 σN + 1 log2 4 4π e .3.4.4. CPhase. high-SNR = 1 log2 4 2 1 σH + 2 2 σN + 1) ≈ πL 2 L 4e L 4 + 1 log2 4 4π e + 1 [log2 (πL) + 1] . (3. SP. the augend 1 2 in (3.12). high-SNR.16) simplifies to CPhase. 413] f Yp ( x )| X =rH e j0 = f Yr . Exact Capacity Using Input/Output Phase Angles ˜ ˜ Assuming a deterministic input signal X = rH e j0 . that means ∃i : pi = 1.19) which for large L again using Sterling’s approximation simplifies to CPhase.20) Using the scaled host signal PSD p( f ) of Section 3.6 results in CPhase. and numerical simulations show that for L > 70 the error in C induced by the Sterling approximation is below 10−4 . the joint PDF of Y is [65. SP. p. high-SNR. ϕ)| X =rH e j0 = ˜ ˜ ˜ ˜ 2 r2 + rH − 2rrH cos( ϕ) r exp − 2 2 πσN σN 33 .4. colored = 1 = 2 fs ˆ f s /2 f =0 log2 σ2 1 + p( f ) H 2 2 σN df + 1 log2 4 4π e .15) results in 2  ∏ L/2 2π pi σH + 1 log2  i=1 L  2 L σN 4 πL 2 2 2 σH 2 σN 1. high-SNR. Watermarking Based on Phase Modulation as Γ( L 4 and (3.18) is smaller than 10−2 bit for L > 480. 3.17) Sterling’s approximation is asymptotically accurate.Yϕ (r.1.17) and (3. the last term on the right-hand side in (3.

244] ˆ 2π ˆ 2π f Φ. ϕ < 2π else. This corresponds to M-ary phase shift keying (PSK) using a constellation with M symbols.5. The complex output sample y has the conditional PDF f Y | X (y| xm ) = ˜ ˜ ˜ ˜ ˜ ˜ 1 | y − x m |2 exp − 2 2πσ 2σ2 34 . A similar derivation leading to the same result can be found in [66].21) 2 The required integrations can be carried out numerically. ϕ)dr = = r2 1 exp − H 2 2π σN + rH cos( ϕ) 2 4πσN exp − 2 rH rH cos( ϕ) . Θ) can then be calculated using [64. ϕ ) we also obtain the joint PDF f Φ. sin2 ( ϕ − ϕ ) erfc − 2 σN σN + rH cos( ϕ − ϕ ) 2 4πσN With f Θ|Φ ( ϕ| ϕ ) · f Φ ( ϕ ) = f Φ. The mutual information I (Φ. MI-angle = I (Φ. ϕ ) = 0 1 2π f Θ|Φ ( ϕ | ϕ ) if 0 ≤ ϕ.Θ ( ϕ . p. and the capacity is derived in [67. f Φ. Exact Capacity Using Discrete Input/Output Symbols Assume the input phase Φ consists of M discrete and uniformly spaced values.Θ ( ϕ.Chapter 3.Θ ( ϕ .9) with only half of the DFT coefficients being independent. Θ). We also note that f Yϕ (which is the same as f Θ ) given a deterministic input ϕ is the conditional PDF f Θ|Φ ( ϕ| ϕ ) and write f Θ|Φ ( ϕ | ϕ ) = r2 1 exp − H 2 2π σN + exp − 2 rH rH cos( ϕ − ϕ ) .Yϕ (r. Watermark Capacity in Speech and the marginal PDF f Yϕ ( ϕ)| X =rH e j0 is [66] ˜ ˆ f Yϕ ( ϕ)| X =rH e j0 = ˜ ∞ r =0 f Yr . 3. (3. Θ) = dϕ dϕ. 68].4. sin2 ( ϕ) erfc − 2 σN σN Provided the circular symmetry of the arrangement. ˜ ˜ We assume M discrete complex channel inputs xm .Θ ( ϕ. ϕ) log2 f Φ ( ϕ ) f Θ ( ϕ) ϕ =0 ϕ =0 Using (3. We briefly restate this derivation since it is the basis for the capacity derivation in Section 3. ϕ) I (Φ.4. the capacity evaluates to 1 CPhase.6. changing the assumption of a ˜ ˜ deterministic X = rH e j0 to a deterministic X = rH e jϕ corresponds to replacing ( ϕ) by ( ϕ − ϕ ) in the above two equations and represents a simple rotation.

log2 ˜∞ ˜ ˜ again with only half of the DFT coefficients X being independent and where y=−∞ . Watermarking Based on Phase Modulation N ˜ with σ2 = 2 and xm = rH exp j 2πm .4.sym.4.. discrete = 1 1 log2 ( M ) − 2 2M EW ˜ log2 M −1 m =0 ∑ M −1 n =0 ∑ exp − ˜ ˜ ˜ ˜ | w + x m − x n |2 − | w |2 2 σN .e.sym. dyb dy a . . the capacity is maximum if the channel inputs 1 ˜ xm are equiprobable with probability f X ( xm ) = M and the supremum in the above term ˜ ˜ 2 ˜ ˜ ˜ ˜ can be omitted. Exact Capacity Using Input/Output Complex Symbols We can generalize the capacity derivation of Section 3. . We further substitute Y := W + Xm with W ∼ CN (0.5 from a discrete set of input ˜ symbols to a continuous input distribution with X = rH e jΦ as defined in Section 3. MI-c. resulting in f Y | X (y| xm ) = f W (w) = ˜ ˜ ˜ ˜ ˜ ˜ and CPhase. dy = ya =−∞ yb =−∞ . σN ).22) The capacity can be calculated numerically using Monte Carlo simulations by gen˜ erating a large number of realizations for W and calculating and averaging C. i. discrete M −1 1 = sup ∑ f X ( xm ) ˜ ˜ 2 f ˜ ( x ) m =0 ˜ X ∞ ¨ ˜ y=−∞ f Y |X (y| xm )· ˜ ˜ ˜ ˜ f Y | X (y| xm ) ˜ ˜ ˜ ˜ M −1 ∑ n =0 f X ( x n ) f Y | X ( y | x n ) ˜ ˜ ˜ ˜ ˜ ˜ ˜ dy..1 35 . MI-c.3. dy ˜ denotes an improper double integral over the two dimensions of the two-dimensional ˜∞ ´∞ ´∞ ˜ ˜ variable y. (3. 3.4. . y=−∞ .sym.. The capacity of the discrete-input continuousM output memoryless channel is σ2 CPhase. . discrete = 1 M −1 1 log2 ( M ) − ∑ 2 2M m=0 ∞ ¨ M −1 ˜ ˜ ˜ ˜ | w + x m − x n |2 − | w |2 log2 ∑ exp − 2 σN n =0 ˜ 1 | w |2 exp − 2 2 πσN σN ˜ f W ( w ) dw ˜ ˜ ˜ w=−∞ ˜ where the integration can be replaced by the expected value with respect to W and CPhase..4. Given the desired signal constellation. The ˜ ˜ variables xm and xn are discrete and uniformly distributed.sym. MI-c. . . MI-c. and one can sum over all its M values.6. discrete ˜ denotes a phase modulation watermark capacity derived using the mutual information (MI) between a set of discrete complex input symbols and the complex output symbols.. The subscript Phase.

and the supremum in the above term can be omitted. the capacity is maximum if the channel inputs ˜ x are uniformly distributed on the circle with radius rH (with a density f X ( x ) as ˜ ˜ described above). σN ).Chapter 3.sym.Xb ( a. The capacity of the continuous-input continuous-output memoryless channel is [64. MI-c. = − f X ( xm )· ˜ ˜ 2 ˜ xm =−∞  ∞ ∞ ¨ ¨ ˜ ˜ ˜ ˜ | w + x m − x n |2 − | w |2 log2  f X ( xn ) exp − ˜ ˜ 2 σN ˜ w=−∞ ˜ xn =−∞ ˜ | w |2 1 exp − 2 2 πσN σN ˜ ˜ where the integration over w can be replaced by the expected value with respect to W and ∞ ¨ 1 CPhase. MI-c. Watermark Capacity in Speech ˜ with a complex channel input x and f X ( xn ) = f Xa . b) = ˜ ˜ 1 δ 2πrH a2 + b2 − rH ˜ as derived in (3. We further 2 ˜ ˜ ˜ ˜ substitute Y := W + Xm with W ∼ CN (0. resulting in f Y | X (y| xm ) = f W (w) = ˜ ˜ ˜ ˜ ˜ ˜ and ∞ ¨ 1 CPhase. p. MI-c. = − f X ( xm )· ˜ ˜ 2 ˜ x =−∞  m ∞ ¨  ˜ ˜ ˜ ˜ | w + x m − x n |2 − | w |2 EW log2  f X ( xn ) exp − ˜ ˜ ˜ 2  σN ˜ xn =−∞ ˜ ˜ ˜ d x n  f W ( w ) dw d x m ˜ ˜    ˜ ˜ d xn  d xm  36 . 1 1 ˜ ˜ = sup I ( X. Y ) = sup 2 f ˜ (x) 2 f ˜ (x) ˜ ˜ X X ∞ ¨ f X ( xm )· ˜ ˜ f Y | X (y| xm ) ˜ ˜ ˜ ˜ ˜ xm =−∞ ∞ ¨ ˜ y=−∞ f Y |X (y| xm ) log2 ˜ ∞ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ xn =−∞ f X ( xn ) f Y | X ( y | xn ) d xn ˜ ˜ dy d x m . The complex output sample y has the conditional PDF f Y |X (y| x ) = ˜ ˜ ˜ ˜ σ2 ˜ ˜ 1 | y − x |2 exp − 2 2πσ 2σ2 N ˜ with σ2 = 2 and x = rH e jϕ .sym. Given the desired signal constellation. 252] CPhase.7).sym.

Comparison of Derived Phase Modulation Capacities The derived phase modulation capacities of Section 3.4. 1 =− 2 ∞ ¨ f X ( xm )· ˜ ˜ ˜ xm =−∞ EW ˜ ˜ and the same for xm 1 CPhase.3. e. MI-c. MI-c.4.sym. S 3. MI-angle of (3.7.sym.sym. (3. whereas in the later case only the phase angle is known. resulting in CPhase.sym.23) are shown in Figure 3.6. i. = − Exm ˜ 2 EW ˜ log2 Exn ˜ exp − ˜ ˜ ˜ ˜ | w + x m − x n |2 − | w |2 2 σN ˜ ˜ ˜ ˜ | w + x m − x n |2 − | w |2 2 σN ˜ d xm log2 Exn ˜ exp − .23) The capacity C can be calculated numerically using Monte Carlo simulations by ˜ ˜ ˜ generating a large number of realizations for xm . 1 =− 4πrH  2π ˆ 2π ˆ ∞ ¨ f W (w)· ˜ ˜ ˜ ϕm =0 w=−∞ For numerical integration it is further possible to replace the integrals over ϕ by summations.  f ( ϕ) dϕ ≈ S −1 i =0 ∑ f (iϕ∆ ) ϕ∆ with ϕ∆ = 2π and. w and xn . MI-c.6 are shown for different channel SNR CPhase. The differences of the capacities relative to It is the upper bound on the information rate when ignoring any watermarking-induced constraint. The reason for CPhase.sym.. being slightly larger than CPhase. S = 256.e. MI-c. and calculating and averaging C. or by numerical integration. MI-c.7.2 to Section 3.4. 2 σH 2 σN in Figure 3. The figures also include the general capacity of the power-constrained AWGN channel given by the Shannon–Hartley theorem [64] CShannon = σ2 1 log2 1 + H 2 2 σN . 37 . we can replace the ˜ ˜ ˜ two double integrals over x by single integrals along the circles. To simplify the numerical integration.4.21) is that in the former case the detector has access to both the amplitude and the phase angle of the received symbol. Watermarking Based on Phase Modulation ˜ and the same for xn CPhase. given ˜ that f X ( x ) is zero for all x in the complex plane except on a circle. of (3.g. 2π ˆ ϕ =0  log2  ϕ n =0 ˜ 1 | w + rH exp − 2πrH e jϕm − rH 2 σN e jϕn |2 ˜ − | w |2  ˜ dϕn  dw dϕm .

MI−angle.sym. −10 0 10 Channel SNR (in dB) 20 30 40 Figure 3. MI−c.5 2 1. 38 .7.5 0 −20 CPhase. high−SNR CPhase.1 −20 CPhase.sym.6.: Comparison of the derived phase modulation watermark capacity expressions at different channel SNR 2 σH 2 σN (see also Figure 3.sym. high−SNR.sym. Watermark Capacity in Speech 4 CShannon 3.sym. MI−c.05 −0. L=200 CPhase.sym.25 0. MI−c. in bits/sample) 0.15 0. SP. 0. high−SNR. . MI−angle CPhase. SP.: Watermark capacity difference between the derived expressions and CPhase.Chapter 3. MI−c.3 CShannon (difference to CPhase. Watermark capacity C −10 0 10 Channel SNR (in dB) 20 30 40 Figure 3. discrete (M=8) CPhase. discrete (M=8) CPhase.2 0. MI−angle CPhase. MI−c.05 0 −0. SP.. high−SNR. high−SNR CPhase. Sterling CPhase.5 Watermark capacity C (in bits/sample) 3 2. SP. MI−angle.1 0. high−SNR.5 1 0.. Sterling CPhase. MI-c. L=200 CPhase.7).

Phase Watermarking in Voiced Speech Given the large capacity of phase modulation based watermarking in unvoiced speech. 70]. For example. we randomized the phase angle of the complex DFT/STFT coefficients and examined the perceptual degradation as a function of the window length. 69.8 shows the long-term high-frequency-resolution DFT spectrum (using a measurement window length of 1 s) of the original signal and the phase-randomized signal created using the 137 ms window. independent of the position of the analysis window. however. and also the pitch would not be constant 39 . Using a long window corresponds to changing the phase less frequently in time. 50. the window length determines the time/frequency resolution trade-off of the STFT. perceptually transparent phase manipulations are much more difficult to achieve.4. and 2) the spectral envelope of the signal. It is important to maintain 1) the phase continuity of the harmonic components across the analysis/synthesis frames. Watermarking Based on Phase Modulation 3. Phase Modulation with the Short-Time Fourier Transform (STFT) In this experiment. Using the same window as synthesis window.4. and also results in an increased frequency resolution.4. If the phase relation between the different DFT bins is not maintained. it would be attractive to extend this approach also to voiced speech. the phase randomization leads to significant perceptual distortion and makes the speech signal sound like produced in a ‘rain barrel’. To evaluate the suitability of the signal’s phase for watermarking. For real speech signals such long windows lead to unacceptable smearing in the time domain. We explored possibilities to manipulate or randomize the signal phase in speech while maintaining the two aforementioned perceptual requirements for voiced speech. Using a short analysis/synthesis window with a window length of 137 ms or 20 pitch cycles. since the signal’s phase is indeed perceivable [49.1.3. The perceptual degradation decreases when using much longer analysis/synthesis windows in the order of magnitude of 200 pitch cycles.8.8. 3. including the spectral peaks of the harmonics and the deep valleys in-between them. It follows a brief outline of the performed experiments. It is apparent that the phase randomization leads to a widening of the spectral peaks due to the limited frequency resolution of the analysis/synthesis window. the inverse transform is not a pure sinusoid anymore but a mixture of several sinusoids with adjacent frequencies and arbitrary phase relations. the sum of the inverse DFT transforms of the individual frames (‘overlap/add synthesis’) results in perfect reconstruction of the original signal [71]. which will show that even under idealistic assumptions it is not possible to fully randomize the phase of voiced speech without severely degrading perceptual quality or resorting to perceptual masking. However. a pure sinusoidal signal is typically represented by several DFT bins due to the spectral shape of the analysis window. In voiced speech. Figure 3. a fully periodic synthetic speech signal with constant pitch of 146 Hz was transformed into DFT domain using a pitch-synchronous short-time Fourier transform (STFT) with a ‘square root Hann’ analysis window with 50 % overlap and a window length that is an even multiple of the pitch cycle length.

the frequency coefficients are not 40 .3). for such long periods. if a DFT coefficient is masked anyhow. then there is no need to maintain the magnitude of the coefficient and one can proceed with the general masking-based approach discussed in Section 3. and the STFT representation contains twice as many coefficients as the time domain signal. To verify this. With a threshold of 6 dB (randomizing all coefficients 6 dB below the masking curve. A threshold of 0 dB (randomizing all coefficients below the masking curve) resulted in 83 % of the coefficients being randomized and a certain roughness in the sound.3. namely those peaks where a part of the initial peak is masked. We used as test signal a male speech sustained vowel ‘a’ at a sampling frequency of 16 kHz and a listening level of 82. Watermark Capacity in Speech 70 Long−term DFT magnitude (in dB) 60 50 40 30 20 10 0 0 100 200 300 Original signal Signal with randomized phase 400 500 600 Frequency (in Hz) 700 800 900 1000 Figure 3.: Long-term high-frequency-resolution DFT spectrum with a measurement window length of 1000 ms of 1) a constant pitch periodic speech signal and 2) the same signal with randomized STFT-domain phase coefficients (using an analysis/synthesis window length of 137 ms). and the previous roughness in the sound was barely noticeable anymore.9) 64 % of the phase coefficients were randomized. shown in Figure 3. and 50 % overlap) the frequency masking curve using the van-de-Par masking model [54] (see also Section 3. Independent of window length and masking. we calculated for every frame of the STFT representation (using a window length of 27 ms or approximately four pitch cycles.8. we randomized only the phase of those DFT coefficients whose magnitude is a certain threshold below the masking curve. the speech signal is oversampled by a factor of two. Then.5 dBSPL . However. which again results from a widening of spectral peaks (measured over a longer window). with using a 50% overlap to avoid discontinuities at the frame boundaries.Chapter 3. As a consequence for watermarking. The ‘rain barrel effect’ can be eliminated by modulating only those coefficients that are perceptually less important.

the ELT is a perfect reconstruction cosine-modulated filterbank defined by the same basis functions. we denote with modifying the MLT or ELT phase the replacement of the original transform domain subband signals by watermark subband signals with identical short-term power envelopes. where M denotes the transform size or number of filter channels. The orthogonality of the ELT and MLT assures 41 .3.8. Watermarking Based on Phase Modulation 80 Short−term DFT magnitude (in dBSPL) 70 60 50 40 30 20 10 0 0 1000 2000 3000 4000 5000 Frequency (in Hz) 6000 7000 8000 Signal spectrum Masking curve Masking curve − 6dB Randomized coefficients Figure 3. 74]. The idea for watermarking is to use an ELT with few coefficients (filter channels) and long windows.4. but generalized from a fixed window length L = 2M to a variable window length that is typically larger than 2M. we discuss the use of extended and modulated lapped transforms (ELT and MLT. Just as the more popular MLT.1.1.4. 74. In the following.2. since this preserves the spectral envelope of the host signal. 3. [72.: Randomization of the phase of those coefficients DFT that are masked and have a magnitude of less than 6 dB below the masking curve. 73. This is discussed in the following subsection. As discussed in Section 3.9.4. Extended Lapped Transform (ELT) The ELT is a critically sampled orthogonal lapped transform with an arbitrary window length L [73. and it is not possible to independently randomize all phase coefficients and re-obtain the same coefficients after resynthesis and analysis. Phase Modulation with Lapped Orthogonal Transforms The use of a critically sampled orthogonal transform for analysis and synthesis allows to obtain a perfect reconstruction of the embedded watermark data. An approach to solve this problem is the use of a critically sampled orthogonal transform instead of the STFT. 75]) for watermarking by phase modulation. and to modify again the signal’s phase in the frequency domain. independent.

With As = 50 dB this results in a window length of 768 ms (L = 6144). namely those with the largest power. there is still a significant widening of the spectral peaks (see Figure 3. makes the speech signal more similar to colored noise. 75]. We transformed a synthetic stationary constant pitch speech signal into the frequency domain using an ELT as described in [74] with M filter channels and with a prototype filter (window shape) with stopband attenuation As .10). Decreasing M. which is a commonly used window length for speech signal coding.6 Hz). We evaluate this approach with a small experiment. the perceptual results become almost identical for equal effective window lengths (considering at the ELT window only its time-domain ‘main lobe’). the results are very similar. At a sampling frequency of 8 kHz. A low number of coefficients (which equals the update interval or the hop size) and long windows implicate that at any given time there is a large number of overlapping windows. But even with a frequency resolution as high as this. an MLT with M = 64 has a window length of 16 ms. designed using the Kaiser window approach of [76]. L = 2M. Just as in the STFT case. There are fundamentally contradicting requirements on the parameter M. and in order to accurately maintain the sharp spectral peaks in voiced speech. M must be large.Chapter 3. such a long window length results in unacceptable smearing in the time domain. and even though there is an overlap of 24 different windows at any given time. to obtain approximately ten subbands in-between two harmonics of a speech signal with a pitch of 150 Hz. then also the sum of the overlapping windows has the desired magnitude spectrum. Using this window length and again randomizing the coefficients results in a very noise-like 42 . the randomized MLT sounds slightly more noisy than the randomized ELT. For example. Watermark Capacity in Speech that the embedded information is recoverable. The window length L is a result of M and As . the number of filter channels determines the frequency resolution of the filter bank. On the one hand. Modulated Lapped Transform (MLT) Reducing the window length L of the ELT to twice the transform size M. Applying the same phase randomization as with the ELT. This was confirmed with a small experiment that kept 10 % of the M = 256 coefficients unmodified. For identical M. The modified signal was then transformed back to time domain. one must set M = 256 at fs a sample rate of f s = 8 kHz (resulting in a subband width of 2M ≈ 15. Since there is no direct notion of phase in the frequency domain of the ELT—it is a real-valued transform—we measured the variance in each filter channel and replaced the original signal in each frequency channel by Gaussian noise with equal variance. Additionally. no matter of the position the spectrum being measured at. it should be possible to recreate the magnitude spectrum of voiced speech while continuously manipulating the phase spectrum. As such. the ‘rain barrel effect’ at large M can be eliminated by not randomizing the perceptually most important subchannels. and using a sine window results in the modified lapped transform (MLT) [74. This widening shows itself with the same ‘rain barrel effect’ as when using the STFT. For large M. The hypothesis for this experiment is that if each window has the desired magnitude spectrum. for example to M = 64 and a window length of 112 ms (As = 30 dB).

3.4. Watermarking Based on Phase Modulation
80 Long−term DFT magnitude (in dB) 70 60 50 40 30 20 10 0 0 100 200 300 400 Frequency (in Hz) 500 600 700 Original signal Signal with randomized ELT subbands

Figure 3.10.: Constant pitch speech signal with original and randomized ELT coefficients using 256 filter channels and a 768 ms-window. The randomization leads to a widening of the spectral peaks. (The plotted DFT spectrum is measured using a 1000 ms-window.) speech signal. This effect can again be mitigated by randomizing only the perceptually less important coefficients. However, even when keeping 50 % of the coefficients unmodified (again those with the largest power), a perceptual distortion is still audible. Because of the short window length, the distortion is audible as noise. It is expected that the use of a masking model instead of the power criterion to select the relevant coefficients would decrease the audibility of the distortion. 3.4.8.3. Conclusion We conclude that in voiced speech, in contrast to unvoiced speech, it is not possible to fully randomize the signal’s phase without 1) severely degrading perceptual quality or 2) resorting to perceptual masking. Time-frequency transforms allow direct access to the signal’s phase, but in order to avoid discontinuities at frame boundaries, block transforms can only be used with overlapping windows. While the STFT has the most intuitive notion of phase, it is an oversampled representation whose phase coefficients are not independent and, thus, difficult to embed and re-extract. This problem can be solved with ELTs and MLTs, but they are subject to the same fundamental trade-off between time and frequency resolution as the STFT. Perceptual artifacts can be eliminated by restricting the phase modulations to masked signal components. However, this is not a useful approach for watermarking, since masked components can be replaced altogether instead of only their phase being modulated. Note that phase modifications are possible if performed at a much lower rate than

43

Chapter 3. Watermark Capacity in Speech presented herein, for example by randomizing only the initial phase of each voiced segment. Several such methods were previously proposed but are limited in the achievable data rate [14, 47, 48, 45, 46].

3.5. Experimental Comparison
In this section, we compare the derived watermark capacities of the ideal Costa scheme (ICS), watermarking based on frequency masking, and watermarking based on phase modulation.

3.5.1. Experimental Settings
The experimental settings for each method were chosen as follows. Ideal Costa Scheme The capacity in (3.1) was evaluated using two different scenarios, 1) assuming a fixed mean squared error distortion, CICS, MSE=-25dB :
2 σW 2 σH

dB

= −25 dB,

and 2) assuming a fixed watermark to channel noise ratio (WCNR), CICS, WCNR=-3dB :
2 σW 2 σN

dB

= −3 dB.

While the first case corresponds to a fixed perceptual distortion of the clean watermarked signal (before channel), the second case corresponds to a fixed perceptual distortion of the noise-corrupted watermarked signal (after channel, assuming that the watermark signal is masked by the channel noise). Frequency Masking For watermarking based on frequency masking, the watermark 2 capacity CMask of (3.3) is used, with the watermark power σW,mask being κ = −12 dB below the masking threshold of the masking model. Phase Modulation In principle, the capacity of watermarking by phase modulation 2 2 2 is given by (3.23) and can be expressed as a function CPhase (σH , σN ), with σH being 2 equivalent to rH . However, two additional factors need to be taken into consideration when dealing with real-world speech signals: First, as we showed in Section 3.4.8, the phase modulation is possible in unvoiced speech only, and second, the energy in a speech signal is unevenly distributed between voiced and unvoiced regions. For the experiment, it was assumed that a fraction γ of the speech signal is unvoiced speech, the average power (or variance) of which is a multiplicative factor ρ above or below 2 the average speech signal power σH . The watermark capacity then evaluates to
2 2 CPhase, speech = γ · CPhase (ρσH , σN ).

(3.24)

44

3.5. Experimental Comparison
3 CShannon Watermark capacity C (in bits/sample) 2.5 CICS, MSE=−25dB CICS, WCNR=−3dB CMask 2 CPhase, ATCOSIM CPhase, TIMIT 1.5

1

0.5

0 −20

−10

0

10 Channel SNR (in dB)

20

30

40

Figure 3.11.: Watermark capacity for quantization, masking and phase modulation based speech watermarking. The abbreviations are explained in Section 3.5.1.

We determined the parameters γ and ρ with a small experiment, again using ten randomly selected utterances (one per speaker) of the ATCOSIM corpus (see Chapter 7). After stripping silent regions (defined by a 43ms-Kaiser-window-filtered intensity curve being more than 30 dB below its maximum for longer than 100ms), each utterance 2 was individually normalized to unit variance, i.e., σH = 1. Then, a voiced/unvoiced segmentation was performed using PRAAT [77], and the total length and average power measured separately for all voiced and all unvoiced segments. The result is a fraction of γ = 47 % being unvoiced, and the average power in unvoiced being −4.6 dB below the overall signal power (ρ = 0.35). As a consequence, the unvoiced segments contain 16 % of the overall signal energy. Performing the same analysis on the 192 sentences of the TIMIT Core Test Set [78] results in γ = 38 %, ρ = 0.16 (−8 dB), and only 6 % of the total energy located in unvoiced segments. This difference can be attributed to the slower speaking style in TIMIT compared to the ATCOSIM corpus. The resulting watermark capacities based on the two databases and using (3.24) are denoted CPhase,ATCOSIM and CPhase,TIMIT .

3.5.2. Results
The watermark capacities CICS , CMask and CPhase for quantization, masking, and phase modulation based speech watermarking are shown in Figure 3.11. As in Figure 3.6, the plot includes for comparison the Shannon capacity CShannon of the AWGN channel.

45

Chapter 3. Watermark Capacity in Speech

3.6. Conclusions
Watermarking in perceptually relevant speech signal components, as it is required in certain applications that need robustness against lossy coding, is best performed using variants of the ideal Costa scheme, and its capacity CICS is shown in Figure 3.11. In contrast, in the application of interest in this work—speech watermarking for analog legacy system enhancement—watermarking in perceptually irrelevant speech signal components is possible. Then, perceptual masking and the auditory system’s insensitivity to phase in unvoiced speech can be exploited to increase the watermark capacity (see CMask and CPhase in Figure 3.11). Using auditory masking leads to a significant capacity gain compared to the ideal Costa scheme. For high channel SNRs the masking-based capacity is larger than all other watermark capacities in consideration. The watermark capacity in the phase of unvoiced speech can be derived using the sphere packing analogy or using the related concept of mutual information. In unvoiced speech and for high channel SNRs, the capacity is roughly half of the Shannon capacity plus half a bit per independent sample. The overall watermark capacity is signal-dependent. For fast-paced air traffic control speech at low and medium channel SNR, which is the application of interest in this work, the phase modulation approach outperforms the masking-based approach in terms of watermark capacity. The remainder of this thesis focuses on the phase modulation approach. Some form of auditory masking is used in many, if not even most, state-of-the art audio watermarking algorithms, and is a well-explored topic [13]. In contrast, a complete randomization of the speech signal’s phase has, to our best knowledge, not been previously proposed in the context of watermarking. It offers an opportunity for a novel contribution to the area of speech watermarking and opens a window to implementations that could possibly outperform current state-of-the-art methods. Note that the phase modulation approach and the conventional masking-based approach do not contradict each other in any significant way. In fact, it is expected that a combination of the two approaches will lead to a further increase in watermark capacity.

46

“Speech watermarking for analog flat-fading bandpass channels.” IEEE Transactions on Audio. Kleijn. 2009. we present experimental results for embedding a watermark in narrowband speech at a bit rate of 450 bit/s. G. and Language Processing. amplitude modulation and additive noise. The watermark signal is embedded in a frequency subband. and W. In light of the potential application to analog aeronautical voice radio communication. but also recovers the watermark data in the presence of non-linear phase and bandpass filtering. The adaptive equalizationbased watermark detector not only compensates for the vocal tract filtering. Speech. We derive several sets of pulse shapes that prevent intersymbol interference and that allow the creation of the passband watermark signal by simple filtering. which facilitates robustness against bandpass filtering channels. A marker-based synchronization scheme robustly detects the location of the embedded watermark data without the occurrence of insertions or deletions. Kubin. Hofbauer.Chapter 4 Watermarking Non-Voiced Speech We present a blind speech watermarking algorithm that embeds the watermark data in the phase of non-voiced speech by replacing the excitation signal of an autoregressive speech signal representation. 47 . revised and resubmitted. making the watermarking scheme highly robust. Parts of this chapter have been published in K. B.

the following chapter develops a practical watermarking scheme that is aimed at analog legacy system enhancement. .3.1 proposes the watermarking scheme. the combination of established watermarking theory with a well-known principle of speech perception leads to a substantial improvement in theoretical capacity. and secondly. Consequently.4 and 4. we discuss our results and draw conclusions in Section 4. In the remainder of this chapter. the transmission channel depends on the application and is in most cases not only a simple AWGN channel.1. which models the resonances of the vocal tract and is widely used in speech coding.1. c P (n)]T and with a white Gaussian excitation e(n) with time-variant gain g(n). 60. Finally. experimental results are presented in Section 4. the human ear is only partially insensitive to phase modifications in speech. Speech Signal Model Our method is based on an autoregressive (AR) speech signal model. The new method also addresses the two major difficulties with the theoretical approach: Firstly. 4.1. this section presents a practical speech watermarking scheme that is based on this principle. speech synthesis. 4.2. also addressing many practical issues such as synchronization and channel equalization. . 79]). Although the method is of general value. The transmission channel attacks are listed in the left column of Table 4. 48 . Theory In Chapter 3 we have shown the large theoretical capacity of watermarking by replacing the phase of the host signal.5. After a discussion of certain implementation aspects in Section 4.1. Motivated by this finding. . As an approximation. We assume in the following all signals to be discrete-time signals with a sample rate f s .1. we consider certain temporal sections of a speech signal s(n) as an outcome of an order P autoregressive signal model with time-varying predictor coefficients c(n) = [c1 (n). 4. many of our design choices as well as the choice of attacks that we consider are motivated by the application of watermarking to the aeronautical voice radio communication between aircraft pilots and air traffic control (ATC) operators as described in detail in Chapter 6. Transmission Channel Model We focus on analog legacy transmission channels such as the aeronautical radio or the PSTN telephony channel.2. This is a commonplace speech signal representation. . Based on this finding. Watermarking Non-Voiced Speech As shown in Chapter 3. we assume the channel to be an analog AWGN filtering channel with a limited passband width. and speech recognition (cf. Section 4. such that s(n) = m =1 ∑ P c m ( n ) s ( n − m ) + g ( n ) e ( n ). [58.Chapter 4.

1. The signal thus forms a hidden data channel within the speech signal.1. and the embedding band width and position must be selected such that they fully lie within the passband of the transmission channel. 4. An important property of the signal model is that there is no perceptual difference between different realizations of the Gaussian excitation signal e(n) for non-voiced speech [42].1 is accurate only for non-voiced speech. the replacement of e(n) by e(n) can take place only when the speech signal is not voiced. since the model as defined in Section 4.1. Then.: Watermark embedding: Non-voiced speech (or a subband thereof) is exchanged by a watermark data signal. only a subband bandpass component ePB (n) of e(n) can be ˆ replaced by the data signal e(n).1.4. the predictor coefficients c(n) are calculated and regularly updated using linear prediction (LP) analysis. This is realized by adding an unmodified secondary path for the stopband component sSB (n). Theory mWM Non-Voiced SNV Detection sPB sFB LP Analysis sSB Hc-1 Gain Extraction Watermark Signal Gen.4. The signal sFB (n) is − decomposed into two subbands (without downsampling).1 shows an overview of the embedding scheme. and a LP error filter (Hc 1 ) 49 . We denote with ‘non-voiced speech’ all speech that is not voiced. The stopband component eSB (n) = e(n) − ePB (n) must be kept unchanged. In the spectral domain. This accommodates the bandpass characteristic of the transmission channel. since in many practical applications the exact channel bandwidth is not known a priori. It is possible to exchange the white Gaussian excitation signal e(n) by a white Gaussian ˆ data signal e(n) that carries the watermark information. comprising unvoiced speech and pauses. Figure 4. First.1.1). w g r c Hc SNV ˆsF ˆs sPB sFB Figure 4. Watermark Embedding Concept The presented speech signal model facilitates the embedding of a data signal by replacing the AR model phase of the host speech signal (see Section 3.3. The speech parts that are voiced and should remain unmodified are detected based on their acoustic periodicity using an autocorrelation method [80]. the voiced and the non-voiced time segments of the fullband input speech signal sFB (n) are identified and indicated by a status signal SNV . The hidden data channel is restricted both in the temporal and in the spectral domain. ˆ In the temporal domain.

4. since otherwise the LP error filter would try to compensate the bandpass characteristic of sPB (n). While the system-inherent attacks are deterministic and known to the watermark embedder. the watermark channel carrying the hidden data signal is subject to a number of channel attacks. The LP analysis is based on sFB (n).1. resulting in the passband error signal r (n) = g(n)e(n). Watermark Signal Generation In the following. the passband signal is resynthesized using the original error signal r (n) and the LP synthesis filter Hc . Watermarking Non-Voiced Speech Table 4. In voiced speech segments. It is subject to conditions that result from the perceptual transparency. we describe the composition of the real-valued watermark signal w(n). resulting in ˆ the watermarked speech sFB (n). Hidden Data Channel Model The watermark detector is ultimately interested in the payload watermark mWM .3 to counteract the channel attacks defined in Table 4. They are either inherent to the embedding procedure of Section 4. 4. 4.1.3. In principle. The unmodified stopband ˆ component sSB (n) is added to the watermarked passband signal sPB (n). or induced by the transmission channel. the non-voiced speech segments (including unvoiced speech and inactive speech) are resynthesized from a spectrally shaped watermark data signal w(n). In contrast.1. which are listed in Table 4. data transmission rate and robustness requirements. 50 . However. they are randomly varying quantities and unknown to the watermark detector.1.1. on which the original gain g(n) of the error signal and the LP synthesis filter Hc is applied.5.: Transmission Channel and Hidden Data Channel Model Transmission Channel Attacks DA/AD conversion (resampling) Desynchronization AWGN Bandpass filtering Magnitude and phase distortion System-Inherent Attacks Random channel availability SNV Time-variant gain g Time-variant all-pole filtering Hc Additive stopband signal sSB with predictor coefficients c(n) is applied on the passband component sPB (n).Chapter 4. The following sections further refine the embedding concept of Section 4. the payload watermark mWM must be transformed into a watermark signal w(n) using a modulation scheme that is either robust against the channel attack or enables the detector to estimate and revert the attack.1.1.

LAc preamble symbols mAc for marking active frames (Chapter 5).3. However. we previously showed that violating the requirement of Gaussianity does not severely degrade the perceptual quality [3].1.1. LTr training symbols mTr for signal equalization in the receiver (Section 4. and Gaussian. Pulse Shaping The aim of pulse shaping is to transform the data sequence A into a sequence of samples w(n) that 51 .2).1. When using the signal model presented in Section 4.p M mSh ab+1.p mSh N Figure 4.p mSh WM ab. active frame (Ac) and frame identification (ID) regions.1.: Frame and block structure including watermark (WM).p the individual terms of A as defined in the next section.1.5.1.2. we use in the implementation of Section 4. trading data rate against robustness. While mWM .1.p interleaved with the spectral shaping symbols mSh . which carry the data symbols ab. Bit Stream Generation The encoded payload data is interleaved with additional auxiliary symbols in order to facilitate the subsequent processing steps. training (Tr). The symbols are real-valued. LID preamble symbols mID for sequentially numbering active frames with a consecutive counter (Chapter 5). As will be discussed in Section 5. and LSh spectral shaping symbols mSh for pulse shaping and fulfilling the bandpass constraint (Section 4.2. Theory Voiced Active Frame Non-Voiced Voiced Inactive Frame LD Tr Ac ID Tr WM Tr WM ab-1. and denote with A the sequence of these interleaved data symbols and with ab.4.5.2 single-level bipolar one-dimensional symbols. the set of pseudo-random data.2.7). 4. unit variance. The spectral shaping symbols mSh are multi-level real scalars and not part of the data sequence A. training and preamble symbols must consist of independent symbols. we use a fixed frame grid with frames of equal length LD .1.1. and have an amplitude probability distribution that is zero mean. mAc and mID could in principle be multi-level and multi-dimensional symbols. Each frame that lies entirely within a non-voiced segment of the input signal is considered as active frame and consists of LWM watermark symbols mWM representing encoded payload data.5. mTr . 4. and interleaved with each other as shown in Figure 4.

has a defined boxcar (bandpass) power spectral density with lower and upper cut-off frequencies f L and f H that lie within the passband of the transmission f channel and with f H ≤ 2s . the values of the samples mSh are set such that the resulting signal w(n) has the desired spectral shape. any set of samples A defines a unique signal w(t) with the desired properties. and the theories of non-uniform sampling and sampling and interpolation of band-limited signals can be applied to this problem (cf. A common approach for a practical implementation is to choose the parameters such that in a block of N samples on a sampling grid with uniform spacing Ts = 1/ f s 52 . Additionally. carries as many data symbols as possible per time interval (i. Finding the correct quantities.e. and 4. the given data symbol sequence A with an average symbol rate f a as the non-uniformly spaced samples of this signal. requiring the samples of the signal to be algebraically independent. We consider the desired watermark signal w(t) as the band-limited signal.. requiring the signal to be band-limited to f L and f H . To solve this problem. f L and f H (e. [81. the symbol embedding rate f a is maximum). transforming a data sequence into a band-limited signal is in many aspects the inverse problem to representing a continuous-time band-limited signal by a sequence of independent and possibly non-uniformly spaced samples.g. distributions and values for the spectral shaping samples mSh given specified cut-off and sampling frequencies is a non-trivial task. 3. However. [83]). and 4. The above four requirements on w(n) then translate to 1. i. The wanted symbols mSh are obtained by sampling the reconstructed signal w(t) at defined time instants. is exceeded.. Watermarking Non-Voiced Speech 1. With a simple one-to-one mapping from A to w(n) the signal w(n) does not have the desired spectral shape and the data is not reliably detectable since the Nyquist rate. we interleave the signal w(n) with additional samples mSh to reduce the rate f a at which the data symbols A are transmitted to below the Nyquist rate. 2..Chapter 4. has the same predefined and uniform sampling frequency f s as the processed speech signal. and the required pulse shape y(t) as the interpolation function to reconstruct the band-limited signal from its samples. 3. It is known from theory that these requirements can be fulfilled only for certain combinations of f s . which is twice the available transmission channel bandwidth.e. requiring the symbol rate f a to be close to the Nyquist rate of 2( f H − f L ). 83]). The permission of non-uniform sampling loosens the constraints on the permissible frequency constellations. constraining the sampling instants onto a fixed grid with spacing 1/ f s or a subset thereof. 2. 82. has no intersymbol interference (ISI) between the data symbols A.

The channel delay is estimated and the signal is realigned with the frame grid.2.3 gives an overview on the components of the watermark detector.1 and a denote the terms of this sequence. A is divided into non-overlapping and consecutive blocks of M terms.1.p = t p + bNTs with t p = (1 − p) Ts and Ts = 1/ f s . [84]). which follows a classical front-end–equalization–detection design [86]. M) in the b’th block.p denotes the p’th term (p = 1 .1.p of A are considered as the samples of the band-limited signal w(t). These different aspects of synchronization are discussed in Chapter 5. 4.4. determined by the interpolation function.1. sampling timing and bit synchronization as well as signal synthesis and analysis synchronization are inherently taken care of by the embedded training sequences and the adaptive equalizer in the watermark detector. Theory the first M samples are considered as independent samples and fully describe the band-limited signal (e.1. In the presence of an RLS adaptive equalizer as proposed in the next subsection. In summary. let A be the sequence of interleaved watermark. Watermark Detection Figure 4. The remaining N − M samples are determined by the bandwidth-constraint and can be reconstructed with suitable interpolation functions. f L . Data frame synchronization is achieved by using a fixed frame grid and the embedding of preambles.p yb. 4.1. The desired signal w(t) can then be constructed with [85] w(t) = b=−∞ p=1 ∑ ∑ ab.. M. and with subsequent resampling of w(t) at the time instants τ = nTs . Synchronization We deal with the different aspects of synchronization between the watermark embedder and the watermark detector extensively in Chapter 5.p (t).p (t) is the reconstruction pulse shape that is time-shifted to the b’th block and used for the p’th sample in each block. f s .5. In the remainder of this chapter we assume for the received signal path that perfect synchronization has been achieved. To formalize the problem. We provide suitable parameter sets and pulse shapes for different applications scenarios in Section 4. ∞ M (4. . which is non-uniformly sampled at the time instants τb.1) where yb.g. 53 . training and preamble symbols as defined in Section 4. f H and y p must be chosen accordingly. The terms ab. N.7. and the remaining N − M samples are the spectral shaping samples. Detector Front End In the first step the received analog watermarked speech signal is sampled. the sampling operation does not need to be synchronized to the watermark embedder.1.2. To have all previously stated requirements fulfilled.6. Analogously. and ab. the first M samples are the information-carrying data samples. 4.7. .

. . . e' Watermark m'WM Detection Timing+Bit Synchroniz. We initialize the inverse correlation matrix with P(0) = δ−1 I. with w(0) = 0. sPB (n)]T be the K-dimensional tap input vector and w(n) the K-dimensional tap-weight vector at time n. We use an exponentially weighted and non-uniformly updated RLS adaptive filter in an equalization configuration.3.2. and the availability of efficient recursive implementations that do not require matrix inversion [87].1. The same effect can be achieved computationally more efficient by calculating ˜ only the upper or lower triangular part of P(n) and filling in the rest such that 54 . equalization and watermark detection. The RLS method is chosen over other equalization techniques such as LMS because of its rate of convergence and tracking behavior. ∞ we iteratively calculate q(n) = P(n)u(n) k(n) = q(n) λ + u H (n)q(n) ˜ P(n) = λ−1 (P(n) − k(n)q H (n)) ˜ ˜ P(n + 1) = 1 (P(n) + P H (n)) 2 where ∗ denotes complex conjugation and H denotes Hermitian transposition.: Watermark detector including synchronization. Frame Grid Synchroniz. From the discrete-time speech signal sFB (n) the out-of-band components sSB (n) are again removed by an identical bandpass filter as in the embedder. We introduced the last equation to assure that P(n) remains Hermitian for numerical stability. Also let u(n) = [sPB (n + K − 1). Frame Synchroniz.7. . using the regularization parameter 2 2 δ = σs (1 − λ)α with σs being the variance of the input signal sPB (n). . For each time instant n = 0. . 1. Channel Equalization The received signal is adaptively filtered to compensate for the all-pole filtering in the embedder as well as for the unknown transmission channel filtering. 2. resulting in the passband signal sPB (n). We make use of the training symbols mTr embedded in the watermark signal w(n) in order to apply a recursive least-squares (RLS) equalization scheme.Chapter 4. Figure 4. which tracks the joint time-variation of the vocal tract filter and the transmission channel [87]: Let K be the number of taps and λ (0 < λ ≤ 1) be the exponential weighting factor or forgetting factor of the adaptive filter and α be a fixed constant that depends on the expected signal-to-noise ratio in the input signal. . . 4. Watermarking Non-Voiced Speech s' Sampling z-∆ s'FB s'PB Adaptive Filtering Act.

and w(n + 1) = w(n). 4. In this section a number of equations are acausal to simplify notation.2. Implementation Hermitian symmetry is preserved. 55 . the embedded watermark data is detected after synchronization and equalization using a simple minimum Euclidean distance metric.2.7. The active frame markers mAc can be used as additional training symbols.2.2. 4. the relationships can be made causal by the addition of an input signal buffer and appropriate delays. .3) an equalizer retraining is performed.1.5.3.3.1. .2. . The perfect-reconstruction subband decomposition of the discrete-time speech signal sFB (n) with sample rate f s into the embedding-band and out-of-band speech components sPB (n) and sSB (n) is obtained with a Hamming window design based linear phase finite impulse response (FIR) bandpass filter of order NBP (even) with filter coefficients h = [h0 . Signal Analysis We first address the decomposition of the input signal into the model components. 4. The filter bandwidth WPB = f H − f L must be chosen in accordance to the transmission channel and the realizable spectral shaping configuration as described in Section 4. If a training symbol is available for the current time instant.2). Consequently. h NBP ]T . After performing the active frame detection (see Section 5. the tap-weights are not updated.2. If no training symbol is available. eR (n) = mTr (n) − w H (n)u(n) The output e (n) of the RLS adaptive filter is e ( n ) = w H ( n ) u ( n ). Implementation This section presents details of our signal analysis and pulse shaping implementation. . which are required to reproduce the experimental results presented in Section 4.1. resulting in sPB (n) = NBP m =0 ∑ hm sFB (n − m + NBP 2 ) and sSB (n) = sFB (n) − sPB (n). which improves the detection of the subsequent frame ID symbols mID (see Figure 4.1. the error signal eR and the tap weights are updated with ∗ w(n + 1) = w(n) + keR (n). Detection An equalization scheme as presented in the previous subsection avoids the necessity of an expensive signal detection scheme. For implementation. Complete parameter sets for three different transmission channel types are provided in Section 4.4. effectively shifting complexity from the detection stage to the equalization stage [86].

there is at present no automatic procedure to obtain ‘good’ parameter sets which fulfill all stated requirements such as being bandwidth-efficient and free of inter-symbol interference. and g(n) = 1 Lg Lg −1 m =0 1 2 ∑ r2 n − m + Lg −1 2 .p (t) is sampled at the time instants t = nTs . Watermark Signal Spectral Shaping The pulse shaping requirements described in Section 4.g.2. 4. 82. The derived reconstruction pulse shapes yb. ss (4. − The LP error signal r (n) is given by the LP error filter Hc 1 with r (n) = sPB (n) − m =1 ∑ P cm (n)sPB (n − m) = g(n)e(n). and truncated or windowed.Chapter 4. To implement the pulse shapes as digital filters. n ∈ Z. the update interval can be increased and intermediate values can be obtained using line spectral frequency (LSF) interpolation [58]. yb.5. [81. with c(n) = R−1 (n)rss (n). Watermarking Non-Voiced Speech The linear prediction coefficient vector c(n) is obtained from solving the Yule-Walker equations using the Levinson-Durbin algorithm. We do so for a sampling frequency f s = 8000 Hz.2. Thus.1.. (4.2 and the constraints on N. which maintains the perceptually important short-term temporal waveform envelope. We update c(n) every 2 ms in order to obtain a smooth time-variation of the adaptive filter. To determine the segmentation of the speech signal into voiced and non-voiced components. f H and y p given a certain transmission channel passband and a certain sampling frequency f s are theoretically demanding and difficult to meet in practice. f L . which detects the pitch pulses in the speech signal with a local cross-correlation value maximization [77]. If computational complexity is of concern. M.2) where Rss (n) is the autocorrelation matrix and rss (n) is the autocorrelation vector of the past input samples sFB (n − m) using the autocorrelation method and an analysis window length of 20 ms. a narrowband telephony channel. namely a lowpass channel.p (t) are continuous-time and of infinite length. we now provide suitable parameter sets and interpolation filters for three application scenarios of practical relevance. 83]). we use the PRAAT implementation of an autocorrelation-based pitch tracking algorithm. Even though the topic is well explored in literature (e. We use a short window duration of 2 ms (corresponding to L g samples).3) where we define the gain factor g(n) as the root mean square (RMS) value of r (n) within a window of length Lg samples. and an aeronautical radio channel scenario. 56 . which is common practice in speech processing [79].

4) such that at each information-carrying sampling position a zerocrossing of one of the terms occurs. using for example a simple sinc function for yb.q = p sin · M ∏q=1 sin π fs N (t − tq ) − tq ) π fs N (t p . 57 .1. 4.2. Figure 4.4) to obtain a watermark signal w with a bandwidth from 666 Hz to 3333 Hz. Narrowband Telephony Channel For a narrowband telephony channel with a passband from 300 Hz to 3400 Hz we select M = 4 and N = 6 and define the reconstruction pulse shape yb.3.2. which is a general formula for bandwidth-limited lowpass signals and achieves the Nyquist rate [85].p (t) would result in either ISI or the sample rate not being equal to f s . Lowpass Channel For a lowpass channel with bandwidth WC and choosing M. using the aforementioned approach did not lead to a configuration that would achieve the Nyquist rate. N = 8).2.4. Figure 4. This passband width WC = 2666 Hz is as small as the Nyquist bandwidth that is required for the chosen symbol rate M f s . N = 8) and 500 Hz–3500 Hz (M = 6.2.5 shows the M pulse shapes and demonstrates how each pulse creates ISI only in the N − M non-information carrying samples.2. N and f s according to f WC = M 2s . [84] define similar parameter and signal sets for channels bandlimited to 0 Hz–3500 Hz (M = 7.2.p (t) = (−1)bM π fs N (t M − t p − bNTs ) ∏q=1. which is only N achievable for distinct parameter combinations. 4. sinc( x ) = . Aeronautical Radio Channel Given the particular passband width and position of the aeronautical radio channel (300 Hz to 2500 Hz). This is achieved by carefully selecting the frequencies and phases of the terms in (4. Implementation 4. x (4.2.p (t) = y p (t − bNTs ) = y p (t(b) ) = sinc 2π f s (b) t − ( p − 1) Ts 12 f s (b) p · sin 2π t − p − 1 + 2 2 Ts 12 f s (b) · cos 2π t − ( p − 3) Ts . In contrast. Ayanoglu et al.2. 4 sin( x ) t(b) := t − bNTs . one can obtain the desired signal w(t) that meets all requirements with the N interpolation function yb.4 shows this by illustration.

p .: Generation of a passband watermark signal for a telephony channel using (4. − 2 . Watermarking Non-Voiced Speech (a) cos -3 -2 sin -1 |X(f )| sin 0 1 2 cos 3 f in kHz sinc -4 (b) |X(f )| sinc·sin -4 (c) -3 -2 -1 |X(f )| sinc·sin 0 1 2 3 f in kHz sinc·sin·cos -4 -3 -2 -1 0 1 sinc·sin·cos 2 3 f in kHz Figure 4. a0. Frequency domain representation (a) of the individual sinc.4 = 1. the ISI induced by the band-limitation falls into the spectral shaping samples at t = b · 6Ts + 4Ts and t = b · 6Ts + 5Ts .4). resulting in the desired bandwidth and position.: Pulse shapes for zero ISI between the data-carrying samples ab. 58 .4). and cos terms.p=1. 1.3 y0.p 1 Amplitude w(t) w(n) 0 y0.4 −6 −5 −4 −3 0 1 2 3 6 7 8 9 12 13 14 15 Time t (in multiples of Ts ) and sample index n Figure 4.2 y0.5..1 −1 y0. sin. By design. with b ∈ Z.. The 1 example data a0.Chapter 4. (b) of the product of the first two terms.4. −1 is located at t p = ( p − 1) Ts . and (c) of the product of all three terms in (4.

4) by w(t) = with t(b) = t − bNTs and s(t) = cos [2π (W0 + W )t − (r + 1)πWk] − cos [2π (rW − W0 )t − (r + 1)πWk] 2πWt sin [(r + 1)πWk ] cos [2π (rW − W0 )t − rπWk] − cos [2πW0 t − rπWk ] + 2πWt sin [rπWk ] b=−∞ ∑ ∞ ab. We select a = a1 = a2 = W .1) and (4. Experimental Settings Motivated by legacy telephony and aeronautical voice radio applications. Watermark Embedding As input signal we used ten randomly selected speech utterances from the ATCOSIM corpus [5].3. The signal was resampled to f s = 8000 Hz. and the watermark embedded in a frequency band from 666 Hz to 3333 Hz using M = 4 and N = 6 (Section 4. They were chosen by a script such that there was one utterance per speaker (six male and four female) and such that each utterance had a length between 5 and 7 seconds. 4. which corresponds to a shift of one sample in terms of f s .1. LWM = 34 symbols were allocated for watermark data. The desired passband signal w is then given similar to (4.1 s(t(b) ) + ab. LID = 7 for frame 59 .2.5) as defined in [88. LTr = 72 for training symbols.3. resulting in a total of 57 s of speech. which results in the 1 parameter r = 1 and fulfills 2W0 < r < 2W0 + 1 [81]. Eq. an efficient reconstruction pulse set with a watermark bandwidth from 500 Hz to 2500 Hz can be created using Kohlenberg’s second-order sampling [88].1. capacity. LAc = 7 for active frames markers.4. The 1 phase shift for the second-order sampling is chosen to be k = 4W .1.3.3. with k ≡ K]. and the symbol rate achieves the Nyquist rate. the system was evaluated using a narrowband speech configuration and various simulated transmission channels. We omit the exact description of the method and only show its practical application: Using the notation of [88]. 31. 4.2 s(k − t(b) ) (4. 4. analyzed with an LP order P = 10. we define W0 = 500 Hz and W = 2000 Hz. 2) in each block of N = 4 samples.2. Experiments This section presents experimental results demonstrating the feasibility. In each active frame of length LD = 180 samples. and robustness of the proposed method.2). which W W represents every fourth sample in terms of the sample rate f s = 8000 Hz = 4W. This results in M = 2 consecutive information carrying samples (indexed by p = 1. Experiments However.

we simulated the following transmission channels: 1. The simulated transmission channels are 60 .5.1. 6. Additive white Gaussian noise (AWGN) with a constant segmental SNR of 30 dB.5z−1 8. only. 3. 4. Filtering with an IIR allpass filter with non-linear and non-minimum phase and z-transform 1 − 2z−1 H (z) = . Filtering with an FIR linear phase Intermediate Reference System (IRS) transmission weighting filter of order N = 150 as specified in ITU-T P.2.2. Ideal channel (no signal alteration).191 [90]. 4. frame synchronization. an FIR filter with non-linear and non-minimum phase [4]. the shown configuration is a worst-case scenario and tests if the watermark detector can handle the high attenuation within the embedding band. .1. Transmission Channels ˆ ˆ The watermarked speech signal sFB (n) = sPB (n) + sSB (n) was subjected to various channel attacks. 7. Motivated by the application to telephony and aeronautical voice radio communication. The parameters were picked manually such that the corresponding subsystems (equalization. IRS and aeronautical channel filters are shown in Figure 4. Watermarking Non-Voiced Speech indices.Chapter 4. 1}.6 with respect to the watermark embedding band.2. 5. . which form a trade-off between data rate and robustness. 2. Sinusoidal amplitude modulation (flat fading) with a modulation frequency of f AM = 3 Hz and a modulation index (depth) of hAM = 0. A combination of AWGN. ) showed satisfactory performance given the channel model in Table 4. Filtering with a measured aeronautical voice radio channel response from the TUG-EEC-Channels database. The magnitude responses of the bandpass.48 [89] and implemented in ITU-T STL G. and LSh = 60 symbols for spectral shaping. detection. While for the aeronautical channel the applied embedding bandwidth is too large and should instead be chosen according to the specified channel bandwidth as in Section 4. each as described above.3. For all but the spectral shaping symbols we used unit length binary symbols with alphabet {−1. 1 − 0. using a window length of 20 ms. The experimental settings are as such not carefully optimized and serve as illustrative example. sinusoidal amplitude modulation and aeronautical voice radio channel filtering. . There is a large number of possibilities for the above parameter choices. Filtering with a linear phase FIR digital bandpass filter of order N = 200 and a passband from 300 Hz to 3400 Hz.3.

2. for example a higher data symbol dimensionality.: Magnitude responses of the bandpass.1. and the SNR is as such related to the embedding parameters of Section 4. channel 1500 2000 2500 Frequency (in Hz) 3000 3500 4000 Figure 4. might be preferable in the aeronautical application.5.3. In contrast to the description in Section 4.1. in the experiments presented herein we updated the inverse correlation matrix P(n) only when a training symbol was available. in comparison with the watermark embedding band.3. We measured the robustness in terms of raw bit 61 . Overall System Evaluation 4. an (RLS) FIR filter with K = 11 taps can constitute the perfect inverse of a LP all-pole filter of order P = 10.3. and different parameter settings.1. Robustness against desynchronization and D/A-A/D conversion is evaluated independently in Chapter 5.3.1. 4.4. the LP filter is time-variant and the observation distorted and noise-corrupted by the radio transmission channel.7.6. which results in imperfect inversion. It is likely that a real-world aeronautical channel has at times a worse SNR. in the digital domain.7 and α = 0.1. Robustness We evaluated the overall watermarking system robustness (including frame synchronization) in the presence of the channel attacks listed in the previous subsection and using the input signal defined therein.2.3. 4. The number of filter taps is a trade-off between the channel’s memory and the adaptive filter’s tracking ability in light of the fast time-variation of the vocal tract filter and the radio channel.2. Experiments 5 Magnitude response (in dB) 0 −5 −10 −15 −20 −25 0 500 1000 Embedding band Bandpass filter IRS filter Aeron. Watermark Detection We used an adaptive RLS equalizer filter with K = 11 taps.3. In the noise-less and time-invariant case. In practice. IRS and aeronautical transmission channel filters. λ = 0. We set the SNR of the AWGN channel to a level that results in an acceptable overall BER.

7.9% 5.01 0.1 in a measured average payload data rate of R ≈ 690 bit/s in terms of uncoded binary symbols mWM .5% 0.05 [540] [540] [537] [499] [494] [486] [420] 7.7.Chapter 4. Data Rate Out of the 57 s of speech a total of 31 s or γ = 53 % were classified as non-voiced.3.03 Flat fading 3Hz 3.: Overall system robustness in the presence of various transmission channel attacks at an average uncoded bit rate of 690 bit/s.2. it is possible to express the achievable watermark capacity at which error-free transmission is possible by considering the payload data channel as a memoryless binary symmetric channel (BSC).3.00 3.2. The results are shown in Figure 4. Given the transmission rate as well as the BERs shown in Figure 4.1. resulting in a mean opinion score MOS-LQO of 4.6) in bit/s and asymptotically achievable with appropriate channel coding.08 0.06 0.7% [449] 3.862 (PESQ) [92]. Watermarking Non-Voiced Speech 0.02 0.3. resulting with the bit allocation of Section 4.2% Bandpass Aeronaut. Audio files with the original and the watermarked speech signals 62 Aero+N+FF AWGN 30dBseg IRS filter Ideal channel 6.6% 4.07 Bit error ratio (BER) 0. 4. Listening Quality ˆ The listening quality of the watermarked speech signal sFB (n) was evaluated objectively using ITU-T P. The channel capacity CBSC as a function of the given rate R and BER pe is (e.2.2. [91]) CBSC ( R. channel Allpass filter Figure 4.09 0.5).6% .04 0.3.05 (on a scale from 1 to 4. The rate R is f γLWM approximately s LD times a factor that compensates for the unused partial frames in non-voiced segments as shown in Figure 4.g. 4. The numbers in the top row denote the corresponding BSC capacity CBSC in bit/s. without any forward error correction coding being applied.8% 4. error ratio (BER) in the detected watermark data. pe ) = R (1 + pe log2 ( pe ) + (1 − pe ) log2 (1 − pe )) (4.7.5% 0.

4.3. but only so if the transmission channel is linear phase.3. it serves as an upper bound on how well the RLS adaptive filter would theoretically do when neglecting the transmission channel and perfectly inverting the time-variant all-pole filtering Hc . noise and flat fading channels. 4. Detection Using LP Analysis We previously proposed to determine the adaptive filter coefficients in the detector using LP analysis.3. Detection Using RLS Equalization Using the proposed RLS equalization scheme. like in the embedding [3].7. Discussion This section presents a qualitative summary and interpretation of the experimental findings. only a spectral estimation technique that does not yield the actual filter coefficients required in this application.4.3. SLP is. To achieve sufficiently similar coefficients in the embedder and the detector. 4.2.3. The BERs are shown in Figure 4. the watermark data can be detected for all channels. we compared its performance to two alternative adaptive filtering strategies by evaluating their robustness in line with Section 4.3. bandpass. since in most applications the coefficients of the embedder are not available in the detector. Discussion ˆ sFB (n) and sFB (n) are available online [93]. and for the IRS and the non-linear phase channels the BER is 50 % and watermark detection fails altogether.3.7 the BER approximately triples for the ideal. 4. Informal listening tests showed that the watermark is hardly perceptible in the presence of an ideal transmission channel.3. however. amplitude modulation and channel filtering as defined above.2. However.1. 1 Given that sPB (n) is a passband signal. 4. Evaluation of Alternative Detection Schemes To gain insight into the proposed RLS channel equalization. in the presence of a bandpass channel the LP analysis in the embedder must then be based on sPB (n) instead of sFB (n). Detection Using Embedder’s Predictor Coefficients Using the embedder’s predictor coefficients c(n) for adaptive filtering in the detector is a purely hypothetical experiment.3. it might seem beneficial to use selective linear prediction (SLP) [94] to estimate the coefficients.4. and watermark detection fails. In the case of the non-linear phase channels the BER is 50 %. The obtained BERs are lower by approximately one magnitude compared to the RLS results in Figure 4. 4.1.7.3.1 In comparison to Figure 4. and not perceptible in the presence of the simulated radio channel with AWGN. 63 .

to adjust the trade-off between perceptual fidelity.2. Comparison with Other Schemes First and foremost.2.3.Chapter 4. only. For a numerical comparison of our results with the reported performance of other methods we use as measure CBSC/W .4. and in general a large number of factors contribute to the ‘goodness’ of each system. 4. because almost all methods are geared towards different channel attacks and applications. we achieve the Nyquist rate 2WC with all methods presented in Section 4.2) with an SNR of 30 dB as applied in Section 4.4.3 summarizes these and other factors that contribute to the capacity gap. Comparison with Channel Capacity In the following we compare the achieved embedding rate with the theoretical capacity derived in Chapter 3.2 indicates that our method outperforms most of the current state-of-the-art speech watermarking methods. capacity and robustness [2]. but also shows that (2. However. Table 4.2. the addition of a watermark floor (that is. In non-voiced speech (comprising unvoiced speech and pauses). While the method presented in [32] appears to perform similar to our method.3. a fixed minimum watermark signal gain g).2 also includes our previous work and shows that there are multiple options for adapting our approach to different channel conditions. It is the combination of these three aspects that differentiates our proposal from current state-of-the-art methods and our own previous work.6)) and per kHz of embedding or channel bandwidth W. because our current method does not account for the time-variation and the spectral non-uniformity of the SNR that result from the spectral characteristics of the host signal. We have previously applied various measures. or CW = 8500 bit/s considering embedding in non-voiced speech. Watermarking Non-Voiced Speech 4. are robust with respect to the transmission channel model of Section 4. It is. the time-variant and colored spectrum of the host 64 . and have an embedding rate that is comparable to the one shown in Section 4. Table 4. The table shows where there is room for improvement.3. such as the use of a different data symbol dimensionality.2. This makes the method impractical for the considered aeronautical application. difficult to obtain an objective ranking. based on the number of embedded watermark bits R (in terms of CBSC using the reported BER and (4.2.79 % BER. Table 4.1 and the observed system performance. In our practical scheme. or a pre-processing of the host signal. the theoretical capacity evaluates to CW = 16000 bit/s. which corresponds to a BSC capacity of 500 bit/s.2.1. however. The given rate losses are multiplicative factors derived from the experimental settings of Section 4. no other schemes are known at present that are complete (in the sense of most practical issues such as synchronization being addressed). we embed 690 bit/s at 4. With a maximum symbol transmission rate or Nyquist rate 2WC = 5333 symbols/s and CW.2) is an over-estimation since it does not account for the perceptually required temporally constrained embedding.2.1. we embed only one bit per symbol instead of three.1.1). it assumes a time-invariant transmission channel and requires before every transmission a dedicated non-hidden 4 s long equalizer training signal. Phase ≈ 3 bit/symbol (using (2.

0 600 0.5% 3 4. Proposed Method Phase Watermarking 4. WM (scalar sym. Repl. WM (vector sym.0 24 n/a 24 8.0 600 0. of partial traj. [28] [27] [29] [37] [33] [31] [32] [32] [35] [39] [26] [26] [45] [45] Alternative Speech Methods 24 0.1 3 1.8 800 28.0% 169 4.0 420 0.0 Ph.0 Ph. of maskees QIM of DCT of DWT QIM of DCT of DWT Allpass phase mod.2. Mod.) [2] 2000 2.0 348 0.0 Alternative Audio Methods 420 0. 690 4.: Comparison With Reported Performance of Other Schemes Method Ref.0% 24 2.0% 418 2.3.4. R (bit/s) BER CBSC (bit/s) W (kHz) CBSC/W (bit/(s kHz)) SNR Simultan. X X X X X X X X LP or BP DA-AD cnv.0 Ph. WM (scalar sym.7 Ph.7 Phase Watermarking 4.0 Ph.4% 111 4. Discussion Table 4.8% 1866 4. WM (vector sym.0 243 2.0% 130 4. QIM of DHT coeff.) [3] 130 0.8% 499 2.0% 1000 4. WM (scalar sym.0% 3 6.4.0% 418 22.0% 184 10.0 187 168 42 64 250 416 32 466 9 28 1 3 97 1 200 200 18 86 209 19 43 30 30 dBsg X 30 dBsg X 10 dB 15 dB 20 dB 30 dB ∞ ∞ 20 dB ∞ ∞ n/a n/a n/a 35 dB V.56bis ∞ 25 dB ∞ 10 dB ∞ 5 dB X X X X X X X X 65 .0 4 3. 690 6. Allpass phase mod.0% 600 3.0% 209 7.5 243 10.) [2] 2000 11. WM (vector sym.0 200 1.) [2] 300 2.1% 600 3.0% 129 3.0% 258 4. phs. QIM of AR coeff.) [3] 2000 0.0 Ph.6% 449 2. QIM of AR residual QIM of pitch QIM of DHT coeff. attacks Nonlin.0 Spread spectrum Spread spectrum QIM of AR coeff. ch.0 300 n/a 300 3.) [2] 300 9.3.1% 344 4.5% 1663 4. Fading ch.

All methods are invariant against the bandpass filtering and the flat fading transmission channel.3. The hypothetical experiment of using the embedder coefficients in the detector shows that approximating the embedder coefficients is a desirable goal only if the channel is linear phase. 66 . Equalization and Detection Performance Three different methods to obtain the filter coefficients for the adaptive filtering in the detector were evaluated. Using linear prediction in the detector results in a bit error ratio of approximately 10 % even in the case of an ideal channel.3. In contrast. It can compensate for a non-flat passband and for non-linear and non-minimum phase filtering due to the embedded training symbols. Over a channel that combines AWGN.5 %. The same RLS method is applied in [32] for a time-invariant channel using a 4 s long initial training phase with a clearly audible training signal. which corresponds with (4. which is caused by the band-limited embedding introduced herein. A further optimization seems possible by developing an equalization scheme that is not solely based on the embedded training symbols.4. and a measured aeronautical radio channel response. we embed the training symbols as inaudible watermarks and perform on-the-fly equalizer training and continuous tracking of the time-variant channel. but that also incorporates available knowledge about the source signal and the channel model.Chapter 4. the method transmits the watermark with a data rate of approximately 690 bit/s and a BER of 6. such as the spectral characteristics. Additive WGN at a moderate SNR does have an influence on the bit error ratio but does not disrupt the over-all functioning of the system. and the filtering characteristic of the transmission channel. The proposed RLS adaptive filtering based detection is the only method robust against all tested transmission channels. flat fading.6) to a BSC capacity of 450 bit/s. Compared to our previous results [2] this is a degradation. 4.: Causes for Capacity Gap and Attributed Rate Losses Cause No embedding in voiced speech Embedding of training symbols Embedding of frame headers Use of fixed frame position grid Embedding in non-white host signal Imperfect tracking of time-variation Loss in Rate 47 % 60 % 29 % 14 % 67 % 28 % signal. Watermarking Non-Voiced Speech Table 4.

Conclusions 4. The large gap between the presented implementation and the theoretical capacity shows that there is a number of areas where the performance can be further increased. or a joint synchronization. but depends only on the host signal to channel noise ratio. We presented a proof-of-concept implementation that considers many practical transmission channel attacks. equalization. On a larger scale. one could account for the non-white and time-variant speech signal spectrum with an adaptive embedding scheme. a decision feedback scheme for the RLS equalization. and apply the same embedding principle on all time-segments of speech for example using a continuous noise-like versus tonal speech component decomposition [95].5.4. While not claiming any optimality.5. Conclusions The experimental results show that it is possible to embed a watermark in the phase of non-voiced speech.7 kHz. Compared to quantization-based speech watermarking. For example. 67 . the principal advantage of our approach is that the theoretical capacity does not depend on an embedding distortion to channel noise ratio.6 % in a speech signal that is transmitted over a time-variant non-linear phase bandpass channel with a segmental SNR of 30 dB and a bandwidth of approximately 2. it embeds 690 bit/s with a BER of 6. detection and decoding scheme could increase the reliability of the watermark detector.

Chapter 4. Watermarking Non-Voiced Speech 68 .

Hering. revised and resubmitted. as well as the data frame synchronization. We examine the issues of timing recovery and bit synchronization. For the simpler linear prediction-based detector we present a timing recovery mechanism based on the spectral line method which achieves near-optimal performance. Hofbauer. the active frame detection performs near-optimal with the overall bit error ratio increasing by less than 0. K. 2007. vol. “Noise robust speech watermarking with bit synchronisation for the aeronautical radio. “Speech watermarking for analog flat-fading bandpass channels. ser. G. Bit synchronization and synthesis/analysis synchronization are not an issue when using the adaptive equalization-based watermark detector of Chapter 4. the synchronization between the synthesis and the analysis systems. pp.5 %-points compared to ideal synchronization.” in Information Hiding. Speech. 4567/2008. 69 . 2009. B. Using a fixed frame grid and the embedding of preambles. Kleijn. and Language Processing.” IEEE Transactions on Audio. Lecture Notes in Computer Science. the informationcarrying frames are detected in the presence of preamble bit errors with a ratio of up to 10 %. Springer. and W. Hofbauer and H. Kubin. Evaluated with the full watermarking system.Chapter 5 Watermark Synchronization This chapter discusses different aspects of the synchronization between watermark embedder and detector. Parts of this chapter have been published in K. 252–266.

it is necessary that the detector clock synchronizes itself to the incoming data sequence or compensates for the sampling phase shift. Data-Aided Bit Synchronization In data-aided bit synchronization.2 (which is necessary for a channel with linear transmission distortion). In Section 5.Chapter 5. as long as the error is smaller than the RLS filter length. and present a practical synchronization scheme. it is still a major challenge in most modern digital communication systems. Bit synchronization can be achieved with data-aided or non-data-aided methods. and have a different timing phase (which is the choice of sampling instant within the symbol interval). but also reduces the channel capacity available for information transmission.1. Therefore.3. have a slightly different frequency.2 we describe an implementation of the synchronization scheme. 5. the issue of bit or symbol synchronization in-between digital systems arises. a watermark detector as shown in Figure 4. and 3) the position of the data-carrying symbols. Given the RLS equalization based watermark detection presented in Section 4. This enables simple synchronizer designs. While this training sequence could also be used to drive a data-aided bit synchronization method. It is then possible to either transmit a clock signal.1.1.1.1. and certainly so for an analog radio channel. Timing Recovery and Bit Synchronization When transmitting digital data over an analog channel.1. and even more so in watermarking due to very different conditions. there is already a known sequence used for equalizer training embedded in the signal. We discuss in this chapter the different aspects of synchronization between the watermark embedder and the watermark detector. The RLS equalizer inherently compensates for a timing or sampling phase error. Although bit synchronization is a well explored topic in communications. Theory Due to the analog nature of the transmission channel and the signal-dependent embedding of the data in the speech signal.1. Section 5.7. 2) the transformations applied to the data.3 needs to estimate 1) the correct sampling time instants. or to interleave the information data with a known synchronization sequence [65].1 presents related work and our approach to the different levels of synchronization from the shortest to the longest units of time. It is a multilayered problem. 5. which is tailored to the watermarking algorithm of Chapter 4. the transmitter allocates a part of the available channel capacity for synchronization. Experimental results are shown and discussed in Section 5. 5. The digital sampling clocks in the embedder and detector are in general slowly time-variant. this is in fact not necessary for the RLS based watermark detector. Watermark Synchronization “Synchronization is the process of aligning the time scales between two or more periodic processes that are occurring at spatially separated points”[96]. and we will use one or the other depending on the used watermark detector structure. 70 .

the radio-internal frequency references are specified to be accurate within ±5 ppm (parts per million) for up-to-date airborne transceivers and ±1 ppm for up-to-date ground transceivers [97]. this leads to a timing phase error of 0. In non-data-aided bit synchronization the detector achieves self-synchronization by extracting a clock-signal from the received signal. 71 . which we reduce using a phase-locked loop that provides a stable single-frequency output without following the timing jitter. we assume that the received analogue watermark signal is a 2 baseband PAM signal R(t) based on white data symbols with variance σa . To demonstrate the basic idea of the spectral line method. 5. instead. We compensate this phase shift with a Hilbert filter.3. minimum mean-square-error methods.1.3 the use of linear prediction (LP) instead of the RLS equalizer as adaptive filter in the watermark detector. no equalizer training sequence is embedded. A narrow bandpass filter at f s = T is used 2 ( t ).2. In this case. including early-late gate synchronizers. which accumulates over time but is less than one sample (and well below the RLS filter length) given the short utterance durations in air traffic control. which is ideally non-data-aided in order to obtain a high payload data rate.1. 86]. a sampling timing recovery method is required. Theory In the aeronautical application discussed in Chapter 6. maximum likelihood methods and spectral line methods [98.1.048 samples per second. we use the highly popular nonlinear spectral line method because of its simple structure and low complexity [65. 1 which is periodic with the symbol period T. and even though most of the noise is filtered 1 out by the narrow bandpass filter at f s = T . which results in a timing to extract the fundamental of the squared PAM signal R tone with a frequency corresponding to the symbol period and a phase such that the negative zero crossings are shifted by T in time compared to the desired sampling 4 instants [65]. in case the transmission channel has no linear filtering distortion. Since the LP error filter does not compensate for timing phase errors. For our watermarking system. With p(t) being the band-limited interpolation pulse shape. Non-Data-Aided Bit Synchronization We discussed in Section 4. 65]. and payload data can be transmitted.5. Then R(t) has zero mean but non-zero higher moments that are periodic with the symbol rate. This leads to a timing jitter. Due to the noisy transmission channel. it derives from the received signal a timing tone whose frequency equals the symbol rate and whose zero crossings indicate the desired sampling instants. the timing tone is noise-corrupted. It belongs to the family of deductive methods. A wide range of methods exist. For non-voiced speech our watermarking scheme is essentially a pulse amplitude modulation (PAM) system with a binary alphabet. the second moment of R(t) is [86] 2 E | R(t)|2 = σa m=−∞ ∑ ∞ | p(t − mT )|2 . Assuming a sampling frequency of 8 kHz and a worst-case frequency offset of 6 ppm. Based on higher-order statistics of the received signal.

using LP parameters as given in Chapter 4 and real speech signals. As a consequence.1). or if they are considered as part of the hidden data channel as shown in 4. We address these issues with different approaches.e. one might intuitively assume that the block boundaries for the linear prediction analysis in the embedder and detector have to be identical. Second. as well as the gain extraction in the detector would have to be perfectly in sync with the same processes in the watermark embedder. as well as the the gain estimation mismatch between watermark embedder and detector are mostly a result of the transmission channel attacks. Watermark Synchronization 5. First. the predictor coefficients do not change rapidly in between the update interval of 2 ms. we accept that there is a residual mismatch such that we are unable to exactly recuperate the watermark signal w(n). When using the RLS-based watermark detector. deletions and substitutions. the watermark channel appears to the detector as a channel with random insertions. We deal with this issue in combination with the frame synchronization described in the next subsection. Therefore. LP frame synchronization does not exist since the RLS equalizer operates on a per sample basis. Synthesis and Analysis System Synchronization In principle. a synchronization offset in the LP block boundaries is not an issue.1. In [2] we also showed experimentally that the bit error rate is not affected when the LP block boundaries in the detector are offset by an integer number of samples compared to the embedder. the LP analysis.. LP Frame Synchronization When using an LP-based watermark detector.Chapter 5. Voiced/Unvoiced Segmentation Due to the transmission channel attacks outlined in 4. We address this mismatch in two ways. While the embedder knows about the availability of the hidden watermark channel (i. In the context of our watermarking system it is only a matter of definition if one considers these issues as a synchronization problem. the voiced/unvoiced segmentation. Adaptive Filtering and Gain Modulation Mismatch The mismatch between the LP synthesis filter in the embedder and the adaptive filtering in the detector. the signal analysis in the watermark detector needs to synchronize to the signal synthesis in the watermark embedder. However. we reduce the mismatch using the embedded training sequences and the RLS equalizer in the receiver.1.2. In order to perfectly detect the watermark signal w(n) (see Figure 4. 72 . This is difficult to achieve because these processes are signal-dependent and the underlying signal is modified by a) the watermarking process and b) the transmission channel. the host signal being voiced or non-voiced) and embeds accordingly. we only use a binary alphabet for w(n) to achieve sufficient robustness.1 it appears difficult to obtain an identical sample-accurate voiced/unvoiced segmentation in the embedder and the detector.

To achieve synchronization at the watermark detector. [99.7.1 The received watermarked analog signal s(t) is oversampled by a factor k compared to the original sampling frequency f s . the use of IDS-capable channel codes.2.2. Implementation 5.1.1. Therefore. Alternative schemes. Due to the host signal dependent watermark channel availability.3. Data Frame Synchronization In this subsection we address the localization of the information-carrying symbols in the watermark detector. a preamble is embedded in frames that carry a watermark. As outlined in Section 4. Thus. Spectral Line Synchronization Figure 5. 101. 73 .1. After the adaptive equalization of the received signal as described in Section 4.5. a threshold-based correlation detection of the active frame markers mAc over a span of multiple frames detects the frame grid position and as such the overall signal delay.5. including embedding in an invariant domain. which perform similarly. embedding of preambles or pilot sequences. are possible. host signal feature based methods. and exhaustive search (cf. since this allows the exploitation of the watermark embedder’s knowledge of the watermark channel availability. An introduction to frame synchronization is provided in [103]. so-called active frames. the frame indices mID .1. deletions and substitutions (IDS). 102]). the use of convolution codes with multiple interconnected Viterbi decoders. we focus in the following on the non-dataaided bit synchronization for the LP-based watermark detector using the spectral line method. Various schemes have been proposed for watermark synchronization. 5.2. and the training symbols mTr . and does not require a synchronized voicing decision in the detector. the hidden data channel appears to the detector as a channel with insertions. Implementation 5. Frame synchronization is considered as a secondary topic in the scope of this thesis and does not significantly influence the overall performance.1 shows the block diagram of the proposed spectral line synchronization system.1. the active frame detection is based on a dynamic programming approach that detects IDS errors and minimizes an error distance that incorporates the active frame markers mAc . We use a fixed frame grid and an embedding of preambles. Bit Synchronization For the RLS-based watermark detector data-aided bit synchronization is inherently achieved by the RLS equalizer. 100. there remain errors in the detection of the preambles due to channel attacks. In the 1 Whether the proposed structure could be implemented even in analog circuitry could be subject of further study. and consists of 1) a fixed marker sequence mAc that identifies active frames and 2) a consecutive frame counter mID used for addressing or indexing among the active frames.

The coefficients ai have been experimentally determined to be a0 = 0. Filter (opt. fraction of a sampling interval T = f1s ) are removed in a first filtering step. LP e Square Anal. Polyn. In a second step all gaps larger than 1. which is synchronized to the watermark embedder and which serves as input to the watermark detector. The Hilbert filter can be designed with a large transition region given that r (n) is a bandpass signal. a2 = −7 and a3 = 16 . a1 = 1. The DPLL still provides a stable output in the case of the synchronization signal r H (n) being temporarily corrupt or unavailable. It was found that the LP framework used in the simulations introduces a small but systematic fractional delay which depends on the oversampling factor k and results in a timing offset. Point s(t) Sam. Watermark Synchronization s(t) Oversampl. The output is a discrete-time signal with rate f s .1. We extract the periodic component r (n) at f s from the squared residual (e(n))2 with an FIR bandpass filter with a bandwidth of b=480 Hz centered at f s . This results in a window length of k · 160 samples and an update interval of k · 15 samples after interpolation. 74 . all points whose distance to a neighbor is smaller than 0.5 UI are filled with new estimation points which are based on the position of previous points and the observed mean distance between the previous points. There is a vast literature on the design and performance of both analog and digital 2 To improve numerical stability and complexity one could also perform the LP analysis at a lower sample rate and then upsample the resulting error signal. The output r (n) of the bandpass filter is a sinusoid with period T. We found that this timing offset can be corrected using a third-order polynomial t = a0 + a1 k−1 + a2 k−2 + a3 k−3 . the use of a DPLL renders the previous point filtering and gap filling steps unnecessary. Values down to k = 8 are possible at the cost of accuracy. Figure 5. Digital Phase-Locked Loop The bit synchronization can be further improved by the use of a digital phase-locked loop (DPLL).75 UI (unit interval. but instead of with a fixed sampling grid it is now sampled at the estimated positions.: Synchronization system based on spectral line method. Since the estimated sampling points contain gaps and spurious points.2 We exploit the fact that the oversampled residual shows some periodicity with the embedding period T = f1s due to the data embedding at these instances. The received analog signal s(t) is again sampled. and is phase-shifted by π with an FIR Hilbert filter 2 resulting in the signal r H (n). pling (opt. The linear prediction residual e(n) of the speech signal s(t) is computed using LP analysis of the same order P = 10 as in the embedder and with intervals of equal length in time. The positions of the zero-crossings on the rising slope are a first estimate of the positions of the ideal sampling points of the analog signal s(t).5. In addition. later simulations a factor k = 32 is used. The zero-crossings of r H (n) are determined using linear interpolation between the sample values adjacent to the zero-crossings.s' Smooth.) Correct.Chapter 5. Law BP r Hilbert rH DPLL rD Zero Filt.) Cross.

We then gradually decrease the loop natural frequency f n. so that when the input is corrupted. which is given by the complex argument of the analytic signal of r (n). The parameters c1 and c2 are related to the natural frequency f n = ωn of the loop. We consider the first loop as locked when the sample variance σ1 (n) of the phase error θerr (n) in a local window of ten samples is below a given threshold for a defined amount of time. with k being the oversampling factor of the synchronizer. also the output becomes 2 corrupted. However. The structure of the two loops is identical and is shown in Figure 5. by 2 θout (n) = c3 + θ f (n − 1) + c2 θerr (n − 1) + θout (n − 1) θ f (n) = c1 θerr (n) + θ f (n − 1) c1 = The loop is stable if [106] c1 > 0.: Second-order all digital phase-locked loop. We start with a regular second-order all digital phaselocked loop [106]. This increases the loop’s robustness against noise and jitter. The input signal θin (n) of the DPLL is the phase angle of the synchronization signal r (n). We also dynamically adapt the bandwidth of the second loop in order to increase its robustness against spurious input signals. The phase output θout (n) is converted back to a sinusoidal signal rd (n) = − cos(θout (n)). When the c1 > 2c2 − 4. 2π f n k fs 2 and c2 = 4πη f n .2. and 2π 1 the damping factor η = √ .2 of the second loop (which runs in parallel with the first loop) from f n. the output θout (n) then closely resembles the input θin (n) of the loop. Implementation phase-locked loops [104. k fs c2 > c1 and In the first loop we use a natural frequency f n. which is a measure of the overshoot and ringing.2 = 50 Hz.5. The loop is computed and periodically updated with the following set of equations: θin (n) = arg(r (n) + jr H (n)) θerr (n) = θin (n) − θout (n) The input and output phases θin (n) and θout (n) are always taken modulo-2π. c1 in out err D f c3 D out c2 Figure 5. which enables a fast locking of the loop.2. 75 .1 = 5000 Hz. The 2π f parameter c3 = k f s s specifies the initial frequency of the loop. which is a measure of the response time. Inspired by dual-loop gear-shifting DPLLs [107] we extended the DPLL to a dual-loop structure to achieve fast locking of the loop.2 = 5000 Hz to f n. 105].2.

Watermark Synchronization input signal is corrupt. the received and sampled signal s (n) is correlated with the embedded training sequence mTr .2. when the variance of the phase error in the first loop increases.3 are based on a binary symbol alphabet.1.3) LAc LTr 76 .2. the polarity adjusted. To determine the active frames in the presence of bit errors. which is the case in the aeronautical radio application.2. It can be short if a defined utterance start point exists. and the detected binary symbol sequence md of the d’th frame is md (k) = sgn e (dL D + k ) . we denote with nTr (d) the number of samples that equal the training sequence mTr . and sFB (n) = sgn[ρs mTr (∆)]s (n + ∆). and because all experiments in Section 4. Data Frame Synchronization This subsection describes the implementation and experimental validation of the frame synchronization method outlined in Section 5. we use binary symbols herein. 5. Therefore.3. we denote with nAc (d) the number of samples in frame md at the corresponding position that are equal to the marker sequence mAc .2. The sign of the cross-correlation at its absolute maximum also indicates if a phase inversion occurred during transmission.2) There are bit errors in md due to the noise and the time-variant filtering of the channel.2 (n) = 10−σ1 (n) 50 Hz. Active Frame Detection The frame synchronization scheme must be able to cope with detection errors in the frame preambles.2.1.2. (5. or ∆ = argmax |ρs mTr (k )| = argmax k k n=−∞ ∑ ∞ s (n)mTr (n − k ) . 5. 5. The delay ∆ between the embedder and the receiver can be estimated by the lag between the two signals where the absolute value of their cross-correlation is maximum. we reduce the natural frequency of the second 2 loop to f n.2. (5. A simple minimum Euclidean distance detector for the binary symbols in the equalized signal e (n) is the sgn function. the DPLL should continue running at the same frequency and not be affected by the missing or corrupt input.Chapter 5. We then consider those frames d a as potentially active frames where nAc (d) nTr (d) ≥ τS (5. To simplify the description.1) The range of k in the maximization is application-dependent. too. in case a phase inversion occurred. Frame Grid Detection To estimate the delay of the transmission channel and localize the frame grid position in the received signal. Equivalently. The delay is compensated and.

. there are by definition no more insertion or deletion errors. . To improve the detection accuracy of the frame indices mID . over all frames in the candidate set. 2. but only substitution errors in the set of detected active frames d a . The one active frame candidate is removed from the sequence that minimizes as error measure the sum of the number of bit errors in the frame indices of the remaining frame candidates. The sequence of decoded frame indices Υ detected in the frames d a is not identical to the original sequence of indices Υ = 1. either the insertion or an adjacent substitution in Υ must be deleted. . the active frame marker and the training sequence.6. Matching frame indices are immediately assumed to be correct active frames. the product in (5. We use a decision threshold τS = 0. To accommodate for this. Implementation with τS being a decision threshold (with 0 ≤ τS ≤ 1). and to 0. which sequentially number the active frames and are encoded so as to maximize the Hamming distance between neighboring frame indices. in order to determine where the sequences match. the resulting ‘hole’ of missing active frames must be filled with a set of new active frames. 2. Out of all possible frame combinations the one set is selected that minimizes as error measure the sum of the number of bit errors in the frame identification. To give higher weight to consecutive frames. Using the above detection scheme. In the case of zero bit errors. too. because of the IDS errors in d a and the bit errors in md . In the case of a deletion in Υ . .25 on average in non-active frames. 77 . we incorporate the frame indices mID . first all substitutions surrounding the deletion are removed. which was determined experimentally.2. the signals r (n) and md are recomputed using the active frame markers mAc in the frame candidates d a as additional training symbols for equalization. and where insertions. which considerably simplifies the error correction coding of the payload data since a standard block or convolution code can be applied. a fixed number of frames (given by the size of the hole) needs to be selected as active frames. 3. Then. deletions and substitutions occurred [108].5. We process the result in the following manner: 1. 3. deletions and substitutions (IDS) occur in the detection of the active frame candidates d a . and assuming the total number of frames embedded in the received signal to be known. This then also applies for the embedded watermark message. we reduce the error distance by one if a frame candidate has a direct neighbor candidate or is adjacent to a previously matched active frame. We use a dynamic programming algorithm for computing the Levenshtein distance between the sequences Υ and Υ . Insertions. Out of all possible frames that lie between the adjacent matching frames. In the case of an insertion in Υ .3) would evaluate to unity in active frames.

1. the watermarking system is inherently vulnerable to an offset of the sampling phase and frequency between the embedder and the detector. We measure this timing phase error in ’unit intervals’ (UI). assumes ideal frame synchronization. The nodes of these data points would ideally reside on an evenly spaced grid with intervals of f1s . and performs LP-based watermark detection. which is the optimal position for resampling the continuous-time signal 78 . Non-Data-Aided Bit Synchronization Experimental Setup The spectral line based timing recovery algorithm presented in this chapter is necessary only when using an LP-based watermark detector. We use a piecewise cubic Hermite interpolating polynomial (PCHIP) to simulate a reconstruction of a continuous-time signal and resample the resulting piecewise polynomial structure at equidistant sampling points at intervals of f1s or k1s respectively. Results Without synchronization. Figure 5. Experimental Results and Discussion 5.Chapter 5.3 shows the adverse effect of a timing phase offset if no synchronization system is used.1.3. Jitter: The position of each single node is shifted randomly following a Gaussian 2 distribution with variance σJ . An unsynchronized resampling of the received signal leads to a timing phase error. The system omits the bandpass embedding and pulse shaping parts. Watermark Synchronization 5. which is the fraction of a sampling interval T = f1s . 1 fs Sampling frequency offset: The distance between all nodes is changed from one unit interval to a slightly different value. The proposed synchronization system aims to estimate the original position of the sample nodes. Since in this case many components of the watermark embedding scheme presented in Chapter 4 can be omitted. we evaluate the timing recovery using the simpler watermark embedding system that we presented in [2]. which is a phase shift of a fractional sample in the sampling instants of the watermark embedder and detector. The experiments presented in this subsection with the LP-based watermark detector use as input signal a short sequence of noisy air traffic control radio speech with a length of 6 s and a sampling frequency of f s = 8000 Hz.3. UI). LP-based watermark detection is only possible when the transmission channel has no filtering distortion. In order to simulate an unsynchronized system we move the nodes of the data points to different positions according to three parameters: Timing phase offset: All nodes are shifted by a fraction of the regular grid interval (unit interval. Timing Recovery and Bit Synchronization 5.3. f ˆ In this reconstruction process.1. each sample of the watermarked speech signal s(n) serves as a data point for the interpolation.

6(b).4 −0.1). 110]. The remaining increase in BER for non-zero timing phase offsets is a result of aliasing. Its root-mean-square value across the entire signal is shown in Figure 5. Figure 5.1 0.3.6(a) shows the resulting BER when introducing a timing phase offset of a fractional sample in the simulated channel as described above. Section 5. Figure 5. The discontinuity in the BER curve at 0. we use the system implementation and experimental settings of Section 4. a fractionally spaced equalizer should be used instead of the sample-spaced equalizer of Chapter 4 [109.3. The phase estimation error is the distance between the original and the estimated node position in unit intervals.3.2. the BER is then below 8 % for any timing phase offset. This results from the fact that the RLS equalizer is a causal system and cannot shift the received signal forward in time. As shown in Figure 5. the raw BER increases by less than two percentage points across all SNR values.3.6 UI). again using ten randomly selected utterances of the ATCOSIM corpus as input speech.4 UI (equivalent to an offset of 0. The increased error rate can be mitigated by delaying the training sequence that is fed to the equalizer by one sample in time and adding one equalizer filter tap (K = 12). The error ratio significantly increases for timing phase offsets around −0.3 0. Experimental Results and Discussion 30 25 Bit error ratio (in %) 20 15 10 5 0 −0.6 UI results from the shift of the detected frame grid towards the next sample (cf. Compared to the case where ideal synchronization is assumed.5.3 −0. To overcome this limitation. The figure also shows the bit error ratio for different sampling frequency offsets. at the detector.5 shows the raw bit error ratio of the overall watermarking system (in two different configurations and including a watermark floor and residual emphasis as described in [2]) at different channel SNR and compares the proposed synchronizer to artificial error-free synchronization. The bit error ratio as a function of the uncorrected timing phase offset is shown in Figure 5.1.: Robustness of the LP-based watermark detector without synchronization with respect to a timing phase error.2 −0.4 Figure 5.1 0 0. 79 . Data-Aided Bit Synchronization We experimentally evaluate the capability of the RLS equalization based watermarking system (Chapter 4) to compensate a timing synchronization offset.2.2. 5.2 Timing phase offset (in UI) 0. In contrast to the previous experiments.4 for the above three types of synchronization errors.3.

06 0.015 0.04 0.: Synchronization system performance for LP-based watermark detection: Phase estimation error and bit error ratio for various types of node offsets.04 0.02 0.025 0.08 0.4.01 0.Chapter 5.2 Timing phase offset (in UI) 0.005 0 No synchronization Spectral line (SL) SL with DPLL −0.2 0 0.03 0.06 Jitter (in UI RMS) 0.4 −5 0 5 Sampling frequency offset (in Hz) 3 Phase estimation error (in UI RMS) 0.005 0 −0.015 0.5 Bit error ratio (in %) 2 1. Watermark Synchronization Phase estimation error (in UI RMS) 0.08 No synchronization Spectral line (SL) SL with DPLL 2.02 0.025 0.5 1 0.4 No synchronization Spectral line (SL) SL with DPLL Phase estimation error (in UI RMS) 0.01 0.03 0.02 0 0 0. 80 .1 0.02 0.5 0 No synchronization Spectral line (SL) SL with DPLL −5 0 5 Sampling frequency offset (in Hz) Figure 5.

7). This effect could be avoided by considering only the non-voiced segments in the correlation sum (5. .2. . evaluated over 200 randomly selected utterances of the ATCOSIM corpus and a clean transmission channel. 4 s of each utterance to detect the frame grid. The experiment shows that a correct frame grid detection can be achieved for correlation windows as short as TFG = 50 ms (Figure 5.14 bit/sample) Ideal Synchronization (0.1). Experimental Results and Discussion 50 45 40 Bit error ratio (in %) 35 30 25 20 15 10 5 0 −10 0 10 20 30 Channel SNR (in dB) 40 50 60 Synchronizer with DPLL (1 bit/sample) Ideal Synchronization (1 bit/sample) Synchronizer with DPLL (0. Data Frame Synchronization We evaluate the data frame synchronization in the context of the RLS equalization based watermarking system discussed in Chapter 4.14 bit/sample) Figure 5.5. To evaluate the time needed until frame grid synchronization is achieved.1.2 s.7 shows the fraction of utterances with correct frame grid detection for a given signal segment length TFG . which. 5.7) explains the decrease of the detection ratio for TFG > 0. Frame Grid Detection In certain applications a near-real-time availability of the watermark data is required. requires a voiced/non-voiced segmentation in the detector. correlation is low when the window is still relatively short and the fraction of non-voiced samples within the window is low. 81 . 5. in the following experiment we used only the first TFG = 0 s . however. Figure 5.5.3. Because mTr is embedded in the non-voiced segments only.: Comparison of overall system performance (using the LP-based watermark detector) with ideal (assumed) synchronization and with the proposed synchronizer at two different embedding rates (in watermark bits per host signal sample). The detection ratio locally decreases for longer correlation windows because most utterances in the used database start with a non-voiced segment.3. The mean fraction of non-voiced samples in the corresponding segments (also shown in Figure 5.2.3.

5 4 TFG (in s − signal segment length used for frame grid detection) Fract.2 0 0.5 2 2.6 −0.8 Fraction 0.2 Timing phase offset (in UI) 0.2 0 0.6 0.6 (b) Enhanced System (additional delay and equalizer tap) 25 Bit error ratio (in %) 20 15 10 5 0 −0.4 0.2 0 0 0.5 1 1. of non−voiced frames in segment Figure 5.5 3 3.4 0.2 Timing phase offset (in UI) 0. 1 0. of correctly synchronized utterances Fract.4 −0.: Fraction of successfully synchronized utterances as a function of the signal segment length used for the frame grid detection.4 −0. 82 .4 0. Watermark Synchronization (a) Initial System 25 Bit error ratio (in %) 20 15 10 5 0 −0.: Overall system robustness in the presence of a timing phase offset at an average uncoded bit rate of 690 bit/s.Chapter 5.7.6 −0.6 Figure 5.6.

the RLS equalizer resolves many problems occurring otherwise in the synthesis/analysis system synchronization. The experiment shows that the active frame detection is highly reliable up to a bit error ratio of 10 % in the detected binary sequence.2). the active frame detection works reliably up to a bit error ratio of 10 % and does not significantly contribute to the overall watermark detection errors. but leaves room for further improvements for example by incorporating a voiced/nonvoiced detection in the detector.2.05 0. 5.4.15 0.7 0.: Robustness of active frame detection against bit errors in the received binary sequence md .5.4.6 0. It was further observed that active frame detection errors only contribute to a small extent to the overall system bit error ratio: Using ideal frame synchronization instead of the proposed scheme. 83 .3. In combination with a digital PLL. Active Frame Detection In order to evaluate the performance of the active frame detection subsystem. Frame grid synchronization can be achieved with windows as short as 50 ms. Conclusions Frame ID detection ratio 1 0. the synchronization method provides highly accurate timing recovery. the change in the bit error ratios of Figure 4. Conclusions We conclude that timing recovery and bit synchronization can be achieved without additional effort using the RLS equalization based watermark detector of Chapter 4.2 0.25 Bit error ratio Figure 5. resulting from ten randomly selected utterances of the ATCOSIM corpus. The fraction of spurious frames is one minus the given ratio.2. In addition.8 0.8.1 0. we introduced artificial bit errors into an error-free version of the detected binary sequence md of (5. 5.5 %-points for all channels. We demonstrated the application of classical spectral line synchronization in the context of watermarking for a simpler but less robust linear prediction based watermark detector.8 shows the ratio of correctly detected frames as a function of the bit error ratio (BER) in md . On the contrary.5 0 0.7 is < 0. Figure 5.9 0.

Chapter 5. Watermark Synchronization 84 .

Air Traffic Control 85 .Part II.

.

Hofbauer and H. which is in operation worldwide. The AM radio is based on the double-sideband amplitude modulation (DSB-AM) of a sinusoidal carrier. Fairfax. Hering. For the continental air-ground communication. VA.33 kHz or 25 kHz [97]. Besides a set of fixed flight rules and a number of navigational systems.” in Proceedings of the NASA Integrated Communications Navigation and Surveillance Conference (ICNS). 2005. Although digital data communication links between controllers and aircraft are slowly emerging. “Digital signatures for the analogue radio. and gives flight instructions to the aircraft pilots in order to maintain a safe separation between the aircraft. or controllers). The amplitude-modulation (AM) radio. 112]. air traffic control relies on human air traffic control operators (ATCO. Parts of this chapter have been published in K. Given the aeronautical life cycle constraints. USA.Chapter 6 Speech Watermarking for Air Traffic Control The aim of civil air traffic control (ATC) is to maintain a safe separation between all aircraft in the air in order to avoid collisions. has basically remained unchanged for decades. most of the communication between controllers and pilots is verbal and by means of analog voice radios. with a channel spacing of 8. the carrier frequency is within a range from 118 MHz to 137 MHz. the ‘very high frequency’ (VHF) band. it is expected that the analog radio will remain in use for ATC voice in Europe well beyond 2020 [111. and to maximize the number of aircraft that can fly at the same time. The controller monitors air traffic within a so-called sector (a geographic region or airspace volume) based on previously submitted flight plans and continuously updated radar pictures. Air traffic control (ATC) has relied on the voice radio for communication between aircraft pilots and air traffic control operators since its beginning. 87 .

and is most likely to happen when aircraft with similar call-sign are present in the same sector. Instead. certain rules are in place to establish this identification. 3. then on the radar screen. The controller needs to identify the aircraft first in the radio call.Chapter 6. An aircraft call-sign can be wrongly understood by the controller or the pilot. This highly compromises safety. the two parties use a restricted. In order to avoid misunderstandings and to guarantee a common terminology. very simple. In the standard situations. and the pilot starts the message with the flight’s call-sign to identify the addresser of the message. which creates additional controller workload. Problem Statement The avionic radio is the main tool of the controller for giving flight instructions and clearances to the aircraft pilot. Channel monitoring workload. and there are a number of problems associated with this identification process: 1. 2. common language. recipient. every voice message of the aircraft pilot is inherently addressed to the air traffic controller. acceptor). In case not otherwise explicitly specified. Aircraft identification workload.1. the controller starts the message with the call-sign of the aircraft. The correct identification of addresser and addressee is crucial for a safe communication. Therefore the identification of the addressee (the controller in this case) is omitted. Speech Watermarking for Air Traffic Control 6. which means that all aircraft within a sector as well as the responsible controller transmit and listen to all messages on the corresponding radio channel frequency. There is no convergence between the radio call and the radar display system. it has to be clear who is talking (the addresser. In order to establish a meaningful communication in this environment. This creates significant workload. From a technical point of view. For the ATC air-ground communication. It is crucial for a secure and safe air traffic operation to have a reliable and fail-safe radio communication network to guarantee the possibility of communication at any given time. the terms and structure of this language are clearly defined in the corresponding standards [113]. originator) and to whom the current word is addressed to (the addressee. Wrongly understood identification. the air traffic controller is the only person on the channel that does not identify himself as addresser on the beginning of the message. Once this ‘technical’ link between ground and aircraft is established (which we assume further-on). This potential risk is usually referred to as call-sign confusion. Every voice communication on the aeronautical channel is supposed to take place according to this ICAO terminology. high effort is put into the systems in order to provide permanent availability through robust and redundant design. The radio communication occurs on a party-line channel. sender. 88 . the addressee of the message. Although most of the words are borrowed from the English language. the verbal communication between pilot and controller can start. The aircrew needs to carefully monitor the radio channel to identify messages which are addressed to their aircraft.

2. as call-sign confusion and ambiguity is currently an unresolved issue and will become even more problematic as air traffic increases.2. improve the safety and security in air traffic control. with the future demand only increasing. the entire message is declared void and has to be repeated. Bandwidth Efficiency Ideally the system should not create any additional demands on frequency bandwidth. analog radio will continue to be used worldwide for many years to come. An automatic identification and authentication of the addresser and the addressee could solve these issues and. Missed identification. Especially in Europe there is a severe lack of frequency bandwidth in the preferable VHF band.6. High-Level System Requirements Developing an application for the aviation domain implies some special peculiarities and puts several constraints on the intended system. From the above scenario we can derive a list of requirements which such an authentication system has to fulfill.1. thus. Even though a long-term solution for avionic air-ground communication will probably lie outside the scope of analog voice radio. This is additional workload for both the aircrew and the controller. This security threat is currently only addressed through operational procedures and the party-line principle of the radio channel. Neither is it possible to change all aircraft and ground equipment from one day to another. The user-driven requirements specify the functions and features that are necessary for the users of the system. The basic approach is to improve the current communication system. Deployment-Driven Requirements Rapid Deployment First and foremost the development should keep a rapid implementation and a short time to operational use in mind. but this being a temporary solution. There is currently no technical barrier preventing a third party to transmit voice messages with a forged identification. However. High-Level System Requirements 4. 6. If the identification is not understood by the addressee. 89 . Legacy System Compliance The system should be backward compatible to the legacy avionic radio system currently in use. At the same time various aspects have to be considered to enable the deployment of the system within the current framework of avionic communications. 5.33 kHz.2. a system that operates within the existing communication channels would be highly preferable. it is very difficult to obtain a seamless transition to a new system. nor is it trivial to enforce deployment of new technologies among all participants. Forged identification. Co-existence among old and new systems is necessary and they should ideally complement each other. This led to the reduction of the frequency spacing between two adjacent VHF channels from 25 kHz to 8. That is. 6. Changing to a completely new radio technology would have many benefits.

A GPS position report. 6.2. Error Rate Robustness is a basic requirement. and for example become annoying to the air traffic controllers. requires approximately 50 bit of payload data. Real-Time Availability The identification of the addresser has to be detected immediately at the beginning of the voice message. ideally in less than one second. Every new or modified piece of equipment. An unreliable system would not be accepted by the users. The automatic identification should be completed before the call-sign is completely enunciated. most likely a much higher data rate is required. On the other hand 90 . leaving some room for potential extensions. shift costs from on-board to ground systems. A too severe degradation of the sound quality would not be accepted by both the certification authorities and the intended users. too frequent occurrence would lead to user’s frustration. which would be advantageous in various scenarios. the total costs of necessary on-board equipment should be kept as low as possible. entails a long and costly certification process. the system can fail to announce the addresser altogether. Not only that there are much higher numbers of airborne than ground systems.2. This might be the aircraft’s tail number or the 27 bit Data Link Service Address for aircraft and ground stations used in the context of Controller Pilot Data Link Communication (CPDLC). Two types of errors are possible. Speech Watermarking for Air Traffic Control Minimal Aircraft Modifications Changes to the certified aircraft equipment should be avoided as much as possible. wherever possible. If the identification would display only after the voice message. Therefore the development process should. User-Driven Requirements Perceptual Quality Several potential identification systems affect the perceptual quality of the speech transmission. which are generally less cost-sensitive than airlines. might it be a different radio transceiver or a connection to the aircraft data bus. the certification process can be highly simplified. and although this not safety critical. On the one hand. it is necessary to transmit a certain amount of data. For this reason the perceptual distortion needs to be kept at a minimum. Ideally the participants of the communication should not notice the presence of the system. this would be of little use to the air traffic controller. as he is then already occupied with another task. Altogether a data rate of 100 bit/s is desired. Data Rate In order to serve the purpose of providing identification of the addresser. For a secure authentication. Cost Efficiency Finally. Through minimizing the changes to on-board equipment and maintaining low complexity.Chapter 6. but as well most of the ground radio stations are operated by the air traffic authorities.

whenever changes to the procedures are proposed. only. it is indispensable to assure robust transmission and to verify the data’s validity. For the controller. Therefore. the system should support his work by providing additional information.1. No User Interaction The system should be autonomous and transparent to the user. Maintaining Established Procedures Due to the strong focus on safety.3.3. The watermark embedder could be fitted into a small adapter box between the headset and the existing VHF 91 . the announcement of a wrong addresser can compromise safety. and should not add another task. The general components of a speech watermarking system for the aeronautical application are shown in Figure 6. For the pilots. the organization and work-flow in commercial aviation and air traffic control is highly based on procedures. As a consequence. Figure 6. 6. Inband Data Transmission Using Watermarking In order to fulfill the above requirements for an automatic identification system. Background and Related Work 6. The watermark. 28]. For a rapid deployment it therefore seems beneficial not to change or replace these procedures. of which some are in place for decades. could for example consist of the 24 bit aircraft identifier and—depending on the available data rate—auxiliary data such as the aircraft’s position or an authentication.1. the use of speech watermarking was previously proposed [114. but to provide supplementary services.2. a need for additional training and additional workload is unwanted.6. the data that is embedded into the voice signal.: Identification of the transmitting aircraft through an embedded watermark. The voice signal is the host medium that carries the watermark and is produced from the speaker’s microphone or headset. Therefore. the system should work without any human intervention. there is a thorough and time-demanding review and evaluation process.3.1 shows the overall idea. Background and Related Work Innovative functions in the ODS Special displays Security Figure 6.

.3. radio [115]. This thesis ultimately focuses on speech watermarking for the noisy and time-variant analog aeronautical radio channel. Speech Watermarking for Air Traffic Control Voice Signal (pilot or ATCO) Watermark Embedder Data (Aircraft ID. as well as a cost-benefit analysis [116]. Figure 6. As the transmission should occur simultaneously within the legacy voice channel.: General structure of a speech watermarking system for the aeronautical voice radio. Transmitting the identification tag is first and foremost a data communications problem. the body of research concerning watermarking tailored to speech signals is small.) R a d i o VHF Channel (antennas. It converts the analog speech signal to the digital domain. Related Work This thesis focuses on the algorithms for the watermark embedding and detection. assures the validity of the data and displays it to the user.Chapter 6. and we give in the following a short summary on the few existing systems tailored to the aeronautical application. Although the transmitted signal contains a watermark. As shown in Section 2.. potential benefits.3 shows how the area of research narrows down by considering the constraints as outlined in Section 6. This allows a stepwise deployment and parallel use with the current legacy system.. etc. .2. antenna systems. e..3. and converts the watermarked digital signal back to an analog speech signal for transmission with the standard VHF radio. it is technically and perceptually very similar to the original speech signal and can therefore be received and listened to with every standard VHF radio receiver without any modifications. propagation) R a d i o Received Voice Signal (received with legacy radio) Watermark Detector Extracted Data Figure 6. The watermark detector extracts the data from the received signal.2. this requirement does not apply to watermarking for the aeronautical radio communication. applications and packages. a speech watermarking system has to be applied. by highlighting the radar screen label of the aircraft that is currently transmitting. The transmission channel consists of the airborne and ground-based radio transceivers. embeds the watermark data. The channel has crucial influence on the performance of the system and is therefore the subject of investigation in subsequent chapters. 6. The information could also be integrated into the ATC systems. equipment and implementation scenarios. 92 . An industrial study produced a detailed overview on operational concepts. corresponding wiring. While most of the existing watermarking schemes are tailored to be robust against lossy perceptual coding. and the aeronautical radio channel.g.2. and the VHF radio propagation channel.

Speech WM for Analog Channels Speech WM for Compressed Ch. Speech WM for Aeron. which provides high robustness and data rates.: Research focus within the field of data communications.. In order to fully suppress the audibility of the data sequence. the system removes some parts of the voice spectrum around 2 kHz as to operate an in-band modem. Data-in-Voice (DiV) Modem The ‘Data-in-Voice’ system presented in [118] tries to decrease the audibility of the multitone sequence by reducing its power and spectral bandwidth.. This is a very simple and fieldproven technology. Figure 6. Systems Watermarking Systems Analog Comm. Multitone Sequence Transmission The most elementary technique to transmit the sender’s digital identification is a multitone data sequence at the beginning of the transmission [117].. However.2. Legacy Channel Digital Comm.. VHF Radio Speech WM for Digital Ch. 6. Aeronautic Speech WM for Time-Invariant Ch.2. The transmission is clearly audible to the listener as a noise burst at the beginning of the message. which occupies a frequency bandwidth of approximately 300 Hz.. the system also requires a filter on the side of the voice receivers. The noise burst.2. which would hardly be accepted by the user base.3.6. Sound Image Watermarking Audio Watermarking Video Watermarking . Background and Related Work Data Communications .. A standard 240 bit/s MSK (Minimum Shift Keying) modem technique is used. Systems .3..3. can be removed by an adaptive filter. Channels Speech WM for Noiseless Channels . Speech Music Watermarking Speech Watermarking Broadcast Radio Watermarking . the fact that all of the receivers would have to be equipped with this technology renders the entire system undesirable...3. . 6. 93 . Standard analog modulation and demodulation schemes can be applied to create a short high power data package which is transmitted before each message..1. Using a bandreject filter.

and we focus in the following chapters on two important aspects. The embedder first adds redundancy to the data by an error control coding scheme. Therefore. The watermark signal is then spectrally shaped with a linear predictive coding filter and additively embedded into the digitized speech signal. the aeronautical transmission channel has a crucial influence on the data capacity and robustness of the overall system.3. the embedding capacity of the proposed algorithm is host speech signal dependent. The coded data is spread over the available frequency bandwidth by a well-defined pseudo-noise sequence.3.3. The resulting ATCOSIM corpus is not only used for evaluating the watermarking method. The voice signal acts as additive interference (additive noise) that deteriorates the detector’s ability to estimate the watermark. even in the case of an ideal noiseless transmission channel. 6. thus exploiting frequency masking properties of human perception. or a variant thereof. we produced and present in Chapter 7 a speech database of ATC operator speech.Chapter 6. Chapter 10 will validate the suitability of the proposed watermarking scheme for the aeronautical application. we propose to use the speech watermarking algorithm presented in Chapter 4. A maximum-length pseudo random sequence (ML-PRS). detected with a matched filter. is used to ensure synchronization between embedder and detector. 94 . We therefore study the aeronautical voice channel properties both in terms of stochastic channel models as well as using empirical end-to-end measurements and derive a simulation model based on the measurements. Proposed Solution For the embedding of an identification or other data in the ATC radio speech.3. Given the speech corpus and the results of the channel analysis. To substantiate the practical feasibility of our approach. In order to obtain realistic simulation results given the particular ATC phraseology and speaking style. The detector relies on a whitening filter to compensate for the spectral shaping of the watermark signal produced by the embedder. a thorough evaluation and a large number of further tests is required. 119] is based on direct sequence spread spectrum watermarking and embeds the watermark as additive pseudo-random white noise. The signal is then de-spread and the watermark data extracted. the data rate is inherently limited. Spread Spectrum Watermarking The ‘Aircraft Identification Tag’ (AIT) system presented in [28. Second. but is also of general value for ATC-related spoken language technologies. First.2. Speech Watermarking for Air Traffic Control 6.

Morocco. Hofbauer. 95 . “The ATCOSIM corpus of non-prompted clean air traffic control speech. as well as listening tests within the ATC domain.Chapter 7 ATC Simulation Speech Corpus The ATCOSIM Air Traffic Control Simulation Speech corpus is a speech database of air traffic control (ATC) operator speech.” in Proceedings of the International Conference on Language Resources and Evaluation (LREC). ATC language studies. S. speech recognition and speaker identification. May 2008. Marrakech. Possible applications of the corpus are. ATCOSIM is a contribution to ATC-related speech corpora: It consists of ten hours of speech data. and H. The corpus is publicly available and provided free of charge. Parts of this chapter have been published in K. Petrik. the design of ATC speech transmission systems. The database includes orthographic transcriptions and additional information on speakers and recording sessions. which were recorded during ATC real-time simulations. Hering. More detailed information is available in [120]. among others.

The NIST Air Traffic Control Complete Corpus [121] consists of recordings of 70 hours of approach control radio transmissions at three airports in the United States. The use of the French. In contrast. In the development of practical systems the need for appropriate corpora comes into place. ATC Simulation Speech Corpus 7. The constant progress in the development of spoken language technologies opens a door to the use of such techniques for certain applications in the air traffic control domain. which fills a gap that is left by the existing corpora. considering the good signal quality (close-talk microphone. The phraseology that is used for this communication is strictly formalized by the International Civil Aviation Organization [113]. This is even more the case considering the high accuracy and robustness requirements in most air traffic control applications. there is a need for speech corpora that are specific to air traffic control. Spanish or Russian language is also permitted if it is the native language of both pilot and controller involved in the communication.3 and 7. In practice however. This is in parts due to the high reliability requirements that are naturally present in air traffic control. The transcription and production process is described in Section 7. known speaker) and the restricted vocabulary and grammar in use. and even defines non-standard pronunciations for certain words in order to account for the band-limited transmission channel. This is particularly the case for the controller speech on ground. It mandates the use of certain keywords and expressions for certain types of instructions. The corpus contains an orthographic transcription and for each transmission the corresponding flight number 96 . doing for example speech recognition for the incoming noisy and narrowband radio speech is still a quite difficult task. nor is it read. The recordings are narrowband and of typical AM radio quality.1. Section 7.Chapter 7. ATC Related Corpora Despite the large number of existing speech corpora. nor is it a pure command and control speech (in the sense of controlling a device).2 outlines the difficulty of obtaining realistic air traffic control speech recordings and shows the path chosen for the ATCOSIM corpus. The international standard language for ATC communication is English. both controllers and pilots deviate from this standard phraseology. The quality of air traffic control speech is quite particular and falls in-between the classical categories: It is neither spontaneous speech due to the given constraints. only a few corpora are in the air traffic control domain. After a review of the existing ATC related corpora known to the authors. low background noise. the subsequent sections present the new ATCOSIM Air Traffic Control Simulation Speech corpus.4. Due to this and also due to the particular pronunciation and vocabulary in air traffic control. Introduction Until today spoken language technologies such as automatic speech recognition are close to non-existing in operational air traffic control (ATC). gives clear rules on how to form digit sequences. We conclude with a proposal for a specific automatic speech recognition (ASR) application in air traffic control.

2. also on a 97 . It consists of ten hours of military naval radio communication speech recorded during training sessions. even if one would obtain the recordings.2. Another corpus that is similar to ours.1). However. In most air traffic control centers the controller pilot radio communication is recorded and archived for legal reasons. background noise. And third. Second. ATCOSIM Recording and Processing is listed. ATCOSIM Recording and Processing The aim of the ATCOSIM corpus production was to provide wideband ATC speech which should be as realistic as possible in terms of speaking style. the required effort to set up realistic ATC simulations (including facilities. stress levels. is the NATO Native and Non Native (N4) database [126].g.100 English utterances pronounced by non-native speakers without air traffic control experience. such simulations are performed for the sake of evaluating air traffic control and air traffic management concepts. The aim of the ATCOSIM corpus is to fill the gap that is left by the above corpora: ATCOSIM provides 50 hours of publicly available direct-microphone recordings of operational air traffic controller speech in a realistic civil en-route control situation. but outside the domain of ATC. However. most recordings are based on a received radio signal and thus not wideband. The non-native Military Air Traffic Control (nnMATC) database [123] is a collection of 24 hours of military ATC radio speech. hard. scope. approach and tower control. . The database contains 8. First. The corpus is available on request. The corpus was produced in 1994 and is commercially available. The VOCALISE project [124.7. language use. trained personal. their public distribution would be legally problematic at least in many European countries. The database is not available for the public and its use is restricted to research groups affiliated with the French ‘Centre d’Études de la Navigation Aérienne’ (CENA)— now part of the ’Direction des Services de la Navigation Aérienne’ (DSNA). 125] recorded and analyzed 150 hours of operational ATC voice radio communication in France. 7. ) is large and would far exceed the budget of a typical corpora production. The corpus includes an utterance segmentation and an orthographic transcription. but its use is restricted to the NATO/RTO/IST-031 working group and its affiliates. these legal recordings are problematic for a corpus production for a multitude of reasons. The logical resort is the conduction of simulations in order to generate the speech samples. technical conditions and public availability (Table 7. etc. . it is in general difficult to get access to these recordings. The aforementioned corpora vary significantly among each other with respect to e. The recordings are narrowband and of varying quality depending on the speaker location (control room or aircraft).and software. The HIWIRE database [122] is a collection of read or prompted words and sentences taken from the area of military air traffic control. ATCOSIM is meant to be versatile and is as such not tailored to any specific application. However. including en route. and cockpit noise was artificially added afterwards. wire-tapping the actual radio communication during military exercises. The database was published in 2007. The recordings were made in a studio setting. The recordings were made in a military air traffic control center. .

Level of English .Field of prof.: Feature Comparison of Existing ATC-Related Speech Corpora and the ATCOSIM Corpus NIST Recording Situation .Speech bandwidth .Control position .Duration [active speech] Speaker Properties .4 h [10.Speaking style (context) Recording Setup .1.Operational .7 h] non-native mixed yes civil 10 yes Large-scale real-time simulation • CO: Cockpit • CR: Control Room • ATCO: Controller • PLT: Pilot 98 .Transmission channel .Radio transmission noise .Signal source .Table 7.Recording content .Number of speakers Publicly Available (1) direct microphone 51.Acoustical noise high CO & CR VHF radio 70 h [?] mostly native mixed yes civil unknown (large) yes radio narrowband wideband none none CO (artificial) direct microphone (8100 utterances) non-native mixed no (!) N/A 81 yes (?) mostly narrowband none / radio mixed CO & CR mixed 700 h [20 h] mostly non-native mostly male yes military unknown (large) no unknown none / radio mixed CO & CR unknown 150 h [45h] mixed mixed yes civil unknown (large) no operational USA N/A prompted text approach N/A civil ATCO & PLT N/A military ATCO & PLT civil ATCO & PLT military Europe (BE) operational mixed Europe (FR) operational HIWIRE nnMATC VOCALISE ATCOSIM civil ATCO en-route Europe (DE/CH/FR) operational (1) wideband none none CR Chapter 7. ATC Simulation Speech Corpus . operation .Geographic region .Gender .

7.2 shows an example of the recorded signals. Swiss German or Swiss French native language. but not the pilots’. 7. Several controllers operate at the same time. The microphone signal and a push-to-talk (PTT) switch status signal were recorded onto digital audio tape (DAT) with a sampling frequency of 32 kHz and a resolution of 12 bit. Recording Setup The controller’s speech was picked up by the microphone of a Sennheiser HME 45-KA headset. France (Figure 7. Speakers The participating controllers were all actively employed air traffic controllers and possessed professional experience in the simulated sectors. The ATCOSIM speech recordings were made at such a facility during an ongoing simulation. and were asked to show their normal working behavior. The speech signal was automatically muted when the push-button was not pressed.2. was recorded. because the working environment of the pseudopilots.2. 99 . This results in a truncation of the speech signal if the push-button was pressed too late or released too early.1).1 The room and its controller working positions closely resemble an operational control center room. the status signal of the push-to-talk button could after some basic processing be used to reliably perform an automatic segmentation of the recorded voice signal into separate controller utterances. After the digital transfer of the DAT tapes onto a personal computer and an automatic format conversion to a resolution of 16 bit by setting the less significant bits to zero. and as such the speaking style. Figure 7. The push-to-talk switch is the push-button that the controller has to press and hold in order to transmit the voice signal on the real-world radio. During the simulations only the controllers’ voice. 7. The six male and four female controllers were of either German or Swiss nationality and had German.3. in order to simulate also the inter-dependencies between different control sectors. 1 The raw recordings were collected by Horst Hering (EUROCONTROL) for a different purpose and prior to the work presented herein. The controller communicates via a headset with pseudo-pilots which are located in a different room and control the simulated aircraft.1.2. The controllers had agreed to the recording of their voice for the purpose of language analysis as well as for research and development in speech technologies. did not to any extent resemble reality. ATCOSIM Recording and Processing large scale involving tens of controllers. 7. Recording Situation The voice recordings were made in the air traffic control room of the EUROCONTROL Experimental Centre (EEC) in Brétigny-sur-Orge. The simulations aim to provide realistic air traffic scenarios and working conditions for the air traffic controller. The recorded PTT signal is a high-frequency tone that is active when the PTT switch is not pressed.2.2.

Chapter 7.1. ATC Simulation Speech Corpus Figure 7.: Control room and controller working position at the EUROCONTROL Experimental Centre (recording site). 100 .

: A short speech segment (transwede one zero seven rhein identified) with push-to-talk (PTT) signal. ATCOSIM Recording and Processing Figure 7.2.7. 101 .2. Time domain signal and spectrogram of the PTT signal (top two) and time-domain signal and spectrogram of the speech signal (bottom two).

pilot’s radio)3 and for standard English contractions (e.3). All standard text is written in lower-case. letters. Transcription Environment The open-source tool TableTrans was chosen for the transcription of the corpus [127]..1. Table 7. 102 .2 gives several examples of transcribed 2 The duration of each utterance was stretched by a factor of 1. both for convenience and in order to avoid typing mistakes. Orthographic Transcription The speech corpus includes an orthographic transcription of the controller utterances. In the lower half of the window the waveform of the utterance that is currently selected or edited in the table is automatically displayed (Figure 7. laughing and sighs ([HNOISE]). The orthographic transcriptions are aligned with each utterance. nonsensical words ([NONSENSE]). and unknown words ([UNKNOWN]). Stand-alone technical mark-up and meta tags are written in upper case letters with enclosing squared brackets. all utterances are transcribed word-for-word in standard British English. navigational aids and radio call signs are transcribed following a given definition based on several aeronautical standards and references. don’t). 7.2. Transcription Format The orthographic transcription follows a strict set of rules which is given in Appendix A. which is also transcribed. A number of keyboard shortcuts were provided to the transcriptionist using the opensource tool AutoHotKey [128]. individually pronounced letters (~) or unconfirmed airline names (@). TableTrans was selected for its table-based input structure as well as for its capability to readily import the automatic segmentation.3. The transcriptionist can play. Groups of words are embraced by opening and closing XML-style tags to mark off-talk (<OT> .3. In general. </OT> ).7 using the PRAAT implementation of the PSOLA method [77]. These were used for conveniently accessing alternative time-stretched sound files2 and for entering frequent character or word sequences such as predefined keywords. The transcriptionist fills out a table in the upper half of the window where each row represents one utterance.Chapter 7. Numbers. for which a transcription could be added at a later stage.3. and foreign language (<FL> </FL>). Apostrophes are used only for possessives (e. ATC Simulation Speech Corpus 7. empty utterances ([EMPTY]). fragments of words ([FRAGMENT]). These time-stretched sound files were used only when dealing with utterances that were difficult to understand. Punctuation marks including periods. A small number of minor modifications to the TableTrans applications were made in order to lock certain user interface elements and to extend its replay capabilities.g. 3 Corpus transcription excerpts are written in a mono-spaced typewriter font. it’s. ICAO alphabet spellings and frequent commands.. commas and hyphens are omitted. Regular letters and words are preceded or followed by special characters to mark truncations (=). pause and replay the currently active utterance by a single key stroke or as well select and play a certain segment in the waveform display.g. They denote human noises such as coughing. 7.

: Screen-shot of the transcription tool TableTrans. 7.Figure 7.3. Orthographic Transcription 103 .3.

3. Remaining unclear cases were shown to an operational air traffic controller and most of 104 . and beginning and end of off-talk is labeled with <OT> and </OT>.2.g. Silent pauses both between and within words are not transcribed. Table 7. This training transcription was also used to validate the applicability of the transcription format definition and minor changes were made. ATC Simulation Speech Corpus utterances. After the transcription was finished.3. For consistent results this would require an objective measure and criterion and is thus easier to integrate in combination with a potential future word segmentation of the corpus. verified the transcriptions and applied corrections where necessary.. Roughly three percent of all utterances were randomly selected across all speakers and used for training of the transcriptionist. Truncations are marked with an equals sign (=). word fragments with [FRAGMENT] and unintelligible words with [UNKNOWN]. [129]). individually pronounced letters with a tilde (~). Also technical noises as well as speech and noises in the background—produced by speakers other than the one recorded—are not transcribed.Chapter 7. aero lloyd five one seven proceed direct to frankfurt speedway three three five two contact milan one three four five two bye ah lufthansa five five zero four turn right ten degrees report new heading hapag lloyd six five three in radar contact climb to level two nine zero scandinavian six one seven proceed to fribourg fox romeo india ind= israeli air force six eight six resume on navigation to tango good afternoon belgian airforce forty four non ~r ~v ~s ~m identified [HNOISE] alitalia six four seven four report your heading [FRAGMENT] hapag lloyd one one two identified sata nine six zero one is identified <OT> oh it’s over now </OT> aero lloyd [UNKNOWN] charlie papa alfa guten tag radar contact 7. Transcription Process and Quality Assurance The entire corpus was transcribed by a single person.: Examples of Controller Utterance Transcriptions in the ATCOSIM Corpus Human noises are labeled with [HNOISE]. Clear transcription guidelines were established and new cases that were not yet covered by the transcription format immediately discussed. The native English speaker was introduced to the basic ATC phraseology [113] and given lists covering country-specific toponyms and radio call signs (e. the transcriptionist once again reviewed all utterances. as they are virtually always present and are part of the typical acoustical situation in an air traffic control room. which promises high consistency of the transcription across the entire database. The transcriptions collected during the training phase were discarded and the material re-transcribed in the course of the final transcription.

ATCOSIM Structure. which were successfully performed without 105 .4. but only carried out an informal pre-validation and the formal final validation of the corpus. Last. The validation procedure followed the guidelines of the Bavarian Archive for Speech Signals (BAS) [130]. utterance and session IDs and transcriptionist comments. a lexicon of all occurring words was created. Each file corresponds to one controller utterance. The files in the ‘HTMLdata’ directory are HTML files which present the transcriptions and the meta data in a table-like format. and special mark-up codes.4. The examiner and author of the validation report has not been involved in the production of the ATCOSIM corpus.078 files are located in a sub-directory structure with a separate directory for each of the ten speakers and sub-directories thereof for each simulation session of the speaker. Corpus Structure and Format The ATCOSIM corpus data is composed of four directories. truncated words. Validation and Distribution them resolved. reviewing and replaying of utterances and transcriptions from within a standard HTML web browser. Validation The validation of the database was carried out by the Signal Processing and Speech Communication Laboratory (SPSC) of Graz University of Technology. 7. 7.3.7. They enable immediate sorting. 7. Due to the frequent occurrence of special location names and radio call signs an automatic spell check was not performed.2. which includes a count of occurrence and examples of the context in which the word occurs. The ‘TXTdata’ directory contains single text files with the orthographic transcription for each utterance. Instead of this. this list consists of less than one thousand entries including location names. readability. and parsability of data.5 gigabyte and is available in digital form on a single DVD or an electronic ISO disk image. Some statistics of the corpus are given in Table 7. The directory also contains an alphabetically sorted lexicon and a comma-separated-value file which includes not only the transcription of all utterances but also all meta data such as speaker. Austria. call signs. Validation and Distribution The entire corpus including the recordings and all meta data has a size of approximately 2. Every item of the list was manually checked and thus typing errors eliminated. the ‘DOC’ directory contains all documentation related to the corpus.4. It included a number of automatic tests concerning completeness.1. The ‘WAVdata’ directory contains the recorded speech signal data as single-channel Microsoft WAVE files with a sample rate of 32 kHz and a resolution of 16 bits per sample. They are organized in the same way as the audio files. ATCOSIM Structure. The 10. Due to the limited vocabulary used in ATC.4.

8 s. 1848. 1. 808. 238. 38. 8.<OT> </OT> . 1739.4 h (10.04 s. 0. deviation. containing . 2.6 s. 2. 638 3. max) Utterances.: Key Figures of the ATCOSIM Corpus Duration total (thereof speech) Data size Speakers (thereof female/male) Sessions total Sessions per speaker Utterances total Utterances per speaker Utterance duration (mean. 9. 1.[EMPTY] . std. 5. 6.4 GB 10 (4/6) 50 7.Meta tags .Unique words 51.Chapter 7.[UNKNOWN] Words Characters (thereof without space) Lexicon entries .Total .3. 7.[HNOISE] . 384. ATC Simulation Speech Corpus Table 7. min.<FL> </FL> .9 s 182 84 319 35 62 11 11 108883 626425 (517542) 858 9 106 13 730 106 .Compounds . 1716. 378.Truncations .[NONSENSE] .[FRAGMENT] . 3 10078 1167. 1162.7 h) 2.

7.at/ATCOSIM and provided free of charge. showing a transcription accuracy on word level of 99.3. transcriptions. Finally.4. meta data. Examples where the corpus was applied include the work presented in this thesis and the pre-processing of ATC radio speech to improve intelligibility [131]. License and Distribution The ATCOSIM corpus is available online at http://www. development and testing of ATC speech transmission and enhancement systems. also in a commercial environment. Conclusions revealing errors. ATC Speech Transmission System Design The corpus can be used for the design. and the lexicon were done. a re-transcription of 1% of the corpus data was made by the examiner.5. manual inspections of documentation. ATC Language Study Analysis can be undertaken on the language used by controllers and on the instructions given. Applications The application possibilities for spoken language technologies in air traffic control are manifold. The ATCOSIM corpus was therefore declared to be in a usable state for speech technology applications.tugraz. proving the transcriptions to be accurate. Speaker Identification The corpus can be used as testing material for speaker identification and speaker segmentation applications in the context of ATC. 7.spsc. To our best knowledge currently no other speech corpus is publicly available that contains non-prompted air traffic control speech with direct microphone recordings.4%. as it is difficult to produce such recordings for public distribution.5. Conclusions The ATCOSIM corpus is a valuable contribution to application-specific language resources. Speech Recognition The corpus can also be used for the development and training of speech recognition systems. and the corpus can be utilized within different fields of speech-related research and development. 7. The corpus is also foreseen to be distributed on DVD through the European Language Resources Association (ELRA). Such systems might become useful in future 107 . Listening Tests The speech recordings can be used as a basis for ATC-related listening tests and provide real-world ATC language samples. which showed minor shortcomings that were fixed before the public release of the corpus. The large-scale real-time ATC simulations exploited for this corpus production provide an opportunity to record ATC speech which is very similar to operational speech. while avoiding the legal hassle of recording operational ATC speech. It can be freely used for research and development. Furthermore.

an automatic speech recognition (ASR) system that recognizes the controller’s voice radio messages could perform various tasks: In case of extremely high accuracy the system could gather the information directly from the voice message without any user interaction. We expand on the speech recognition example: The controller sees among other information the current flight level of the aircraft on the radar screen in the text label corresponding to the aircraft.Chapter 7. the controller now enters this information (‘climb to 340’) into the system and it shows up in the aircraft label. The ASR system can be speaker-dependent and pre-trained. For accurate results it requires transcriptionists with a certain amount of experience in phonetic transcription or at least a solid background in phonetics and appropriate training. The ASR system could otherwise provide a small list of suggestions. such 108 . Semantic Transcription A semantic transcription would describe in a formal way the actual meaning of each utterance in terms of its functionality in air traffic control. ATC Simulation Speech Corpus ATC environments and for example be used to automatically input into the ATC system the controller instructions given to pilots . to ease the process of entering the instructions into the system. Depending on the achievable robustness. Compared to other ASR applications.g. Phonetic Transcription A phonetic transcription is beneficial for speech recognition purposes. In certain ATC display systems. the voice radio message sent to the pilot already contains all information required by the system. and give a warning in case of apparent discrepancies. Alternatively. However. The vocabulary is limited. For example. It can often be performed semi-automatically. but still requires manual corrections and substantial effort. and additional side information. such as the aircraft present in the sector and context-related constraints. can be exploited. “sabena nine seven zero climb to flight level three four zero”). Word and Phoneme Segmentation A more fine-grained segmentation is also beneficial for speech recognition purposes. as this information is relevant later on when routing other adjacent aircraft. namely the aircraft call-sign and the instruction. the system could compare in the background the voice messages sent to the pilot and the instructions entered into the system. Possible Extensions Depending on the application. a controller issues an instruction to an aircraft to climb from its current flight level 300 to flight level 340 (e. the conditions would be comparably favorable in this scenario: The signal quality is high due to the use of a close-talk microphone and the absence of a transmission channel. and continue training due do the constant feedback given by the controller during operation. further annotation layers might be useful and could be added to the corpus.

Likewise. phonological distribution and speaking style. . This can be considered as a sub-part of the semantic transcription which can be achieved with significantly less effort and requires little ATC related expertise.5. 109 . The production of such a transcription layer requires good background knowledge in air traffic control.7. There are two reasons that would support the collection and transcription of more recordings. the following chapter presents a database of measurements of the ATC voice radio channel. The second reason is the need to extend the coverage of the corpus in terms of speakers. request. This would support speech recognition and language interface design tasks. Conclusions as clearance and type of clearance. This chapter presented a database of ATC controller speech. read-back. certain utterances might appear ambiguous. Due to lack of contextual information. Call Sign Segmentation and Transcription This transcription layer marks the signal segment which includes the call sign and extracts the corresponding part of the (already existing) orthographic transcription. The first reason is the pure amount of data that is required by the training algorithms in modern speech recognition systems. which constitutes the transmission channel of the application at hand. as well as ATC language studies. . such as the pilots’ utterances. Nevertheless this might be beneficial for certain applications such as language interface development. which represents the input signal of the watermarking-based enhancement application considered in this thesis. . Extension of Corpus Size and Coverage With an effective size of ten hours of controller speech the corpus might be too small for certain applications. as well as control task and controlled area.

Chapter 7. ATC Simulation Speech Corpus 110 .

Kubin. Sep. Canada. For the purpose of synchronization.Chapter 8 Aeronautical Voice Radio Channel Measurement System and Database This chapter presents a system for measuring time-variant channel impulse responses and a database of such measurements for the aeronautical voice radio channel. Maximum length sequences (MLS) are transmitted over the voice channel with a standard aeronautical radio and the received signals are recorded. The flight path of the aircraft is accurately tracked. Montreal. Hering. The measurements cover a wide range of typical flight situations as well as static back-to-back calibrations. H. “A measurement system and the TUG-EEC-Channels database for the aeronautical voice radio. The resulting database is available under a public license free of charge. Hofbauer. 111 . A collection of recordings of MLS transmissions is generated during flights with a general aviation aircraft. 2006. the transmitted and received signals are recorded in parallel to GPS-based timing signals. and G.” in Proceedings of the IEEE Vehicular Technology Conference (VTC). Parts of this chapter have been published in K. 1–5. pp.

The 112 .1 shows the basic concept of the proposed system: A known audio signal is transmitted over the voice radio channel and the received signal is recorded. Section 8. The aim of the experiments presented in this chapter is to obtain knowledge about the characteristics of the channel through measurements in realistic conditions.1. which is presented in Section 8. Stochastic reference models in the equivalent complex baseband facilitate a compact mathematical description of the channel’s input-output relationship. Doppler effect. The propagation channel is time-variant and dominated by multipath propagation. These measurements shall support the insight obtained through the existing theoretical channel models discussed in Appendix B.1. which moreover allows a simpler and less expensive setup. the basic concepts in the modeling and simulation of the mobile radio channel are reviewed. The collected data form the freely available TUG-EEC-Channels database. The outline of this chapter is as follows: Section 8. Thus a narrowband method is presented. 133. It requires a large number of frequency channels. very few measurements of the aeronautical voice radio channel that could support and quantify the theoretical models are available to the public [134]. based on which the behavior of a channel can be derived and simulated [132. 6]. Three different small-scale area simulations of the aeronautical voice radio channel are presented in Appendix B.4. Introduction A large number of sophisticated radio channel models exist. we measure the time-variant impulse response of the aeronautical VHF voice radio channel between the aircraft and the ground-based transceiver station under various conditions. 8. Figure 8. Unfortunately. 8.2 shows the design and implementation of a measurement system for the aeronautical voice channel. In Appendix B.2. Voice Radio Channel Measurements Audio Channel Audio Player TX Radio Channel RX Radio Audio Rec. and we demonstrate the practical implementation based on a scenario in air/ground communication. Sync Figure 8.3 gives an overview of the conducted measurement flights. Measurement System for the Aeronautical Voice Radio Channel Conventional wideband channel sounding is virtually unfeasible in the aeronautical VHF band. path loss and additive noise. which are reserved for operational use and are therefore not available. Therefore.: A basic audio channel measurement system.Chapter 8.

This accuracy is in general more than sufficient for serving as clock reference for a baseband measurement system. 113 . but come with a considerable financial expense. Synchronization using GPS Signals The global positioning system (GPS) provides a time radio signal which is available worldwide and accurate to several nanoseconds. due in part to its temperature-dependency. In modern channel sounders atomic clocks provide so stable a frequency reference that the systems stay in sync for a very long time once aligned. a baseband measurement does not reveal a number of parameters that could be obtained with wideband channel sounding.2. the systems still stayed in sync for a certain amount of time before they started drifting apart again.8. Once they were separated. the different clocks cannot be linked and a direct synchronization is not possible. in a setup where one unit is on the ground and the other one is in an aircraft. would be implemented. accurately matched and temperature-compensated oscillators were aligned before measurement. 8. the measurement already takes place in the same domain as where our target application.2. and are relatively large and complex in setup. It can therefore also serve as an accurate frequency and clock reference [135]. In high-grade wideband channel sounders a lot of effort is put into achieving synchronization. 8.2. such as distinct power delay and Doppler spectra. direct synchronization between the transmitter and the receiver-side is in our application again not possible. Moreover. On the one hand. The rising edge of each pulse is synchronized with the start of each GPS and UTC (coordinated universal time) second with an error of less than 1 µs. it is necessary to synchronize the measurement equipment on the transmitter and the receiver side. Measuring the audio channel from baseband to baseband has certain advantages and disadvantages. Although these parameters cannot be directly measured with the proposed system. Certain off-the-shelf GPS receivers output a so-called 1PPS (one pulse per second) signal. The signal is a 20 ms pulse train with a rate of 1 Hz. Synchronization between Aircraft and Ground As already indicated in Figure 8.1. besides its simplicity. This linking is necessary as the internal clock frequency of two devices is never exactly the same and the devices consequently drift apart. Measurement System for the Aeronautical Voice Radio Channel time-variant impulse response of the channel can then be estimated from the known input/output pairs. This problem is common to all channel sounding systems. the clock-frequency is time-variant. But. As a consequence.2. on the other hand. To the authors’ knowledge no battery-powered portable audio player/recorder with an external synchronization input was available on the market at the time of measurements in 2006. a speech watermarking system. their effect on the audio channel can nevertheless be captured. Such systems are readily available on the market.1. In former days. even though a suitable clock reference signal would be available.

are accurately recorded by a hand-held GPS receiver once every two seconds [138]. in parallel with the 1 PPS synchronization signal which is provided by the GPS module [137]. The design considers the necessary blocking of the microphone power supply voltage.2. The aircraft position. All audio files are played and recorded at a sample rate of 48 kHz and with a resolution of 16 bit.875 ms at a chip rate of 8 kHz [139]. Voice Radio Channel Measurements We propose in the remainder of this section a measurement method and setup with which synchronization can nevertheless be achieved by recording the clock reference signal in parallel to the received and transmitted signals and appropriate post-processing.4). the appropriate shielding and grounding for electromagnetic compatibility (EMC) as well as noise and cross-talk suppression.3. status indication and configuration interface for the GPS receiver. The box serves as the central connection point and provides potentiometers for signal level adjustments. The anticipated rate of channel variation at the given maximum aircraft speed and the bandwidth of the channel are below these values [6]. The transmitted and received signals are recorded with a second unit. The measurement signal is therefore upsampled to 48 kHz by insertion of zeros after each sample in the time domain and low-pass interpolation with a symmetric FIR filter (using Matlab’s interpfunction). This length promises a good trade-off between signal-to-noise ratio. The setup is portable.3 shows the final hardware setup on-board the aircraft. test points for calibration and a switch to select transmit or receive signal routing. 8.. The interface box provides a pushto-talk (PTT) switch remote control together with status indication and recording.Chapter 8. On the aircraft. altitude. Measurement Signal The measurement signal consists of a continuously repeated binary maximum length sequence (MLS) of length L = 63 samples or T = 7. speed. rigid. An identical setup is used on the ground. The only requirement is a connection to the aircraft radio headset connectors.2. It results in an MLS frame repetition rate of 127 Hz and an excitation of the channel with a white pseudo-random noise signal with energy up to 4 kHz.4. All devices are interconnected in a star-shaped manner through a custom-built interface box. 8. heading.2. and power supply. The transmitted MLS sequence is interrupted once per minute by precomputed dual-tone multi-frequency (DTMF) tone pairs. 114 . frequency resolution and time resolution. However. Figure 8. etc. The signal is continuously transmitted during the measurements. and can be easily integrated into any aircraft. the signals are transmitted and received via the aircraft-integrated VHF transceiver. The measurement audio signal is replayed by a portable CompactFlash-based audio player/recorder [136]. in order to ease rough synchronization between transmitted and received signals. battery-powered. Hardware Setup The basic structure of the measurement setup is shown in Figure 8. which is fitted into a metal housing unit containing some passive circuitry and standard XLR and TRS connectors (Figure 8.

: Overview of the proposed baseband channel measurement system. 115 .2.8. Measurement System for the Aeronautical Voice Radio Channel Audio Player/Recorder GPS receiver with time output (1PPS) Interface Circuit 2ch Audio Recorder GPS receiver Aircraft integrated VHF radio transceiver GPS data logger Aircraft integrated antenna system Auxiliary VHF radio for pilot Aircraft system VHF radio channel 8. using GPS timing signals for synchronization.33kHz Ground-based receiver antenna Ground-based VHF radio transceiver 2ch Audio Recorder Interface Circuit Audio Player/Recorder GPS receiver with time output (1PPS) Ground system Figure 8.2.

3. and tracking GPS (bottom right).: Complete voice channel measurement system on-board an aircraft. 8. and is mostly automated in Matlab.4. the corresponding files.2. Incorporation of Side Information The original files are manually sorted into a directory structure and labeled with an approximate time-stamp. Data Processing The post-processing of the recorded material consists of data merging. with audio player and recorder. synchronization. synchronization GPS (top left).5. the transmission directions. from which longitude. Based on the data structure. the time-stamps and durations are transferred into a data structure to facilitate automatic processing. It follows an outline of the different processing steps. alignment. 116 . also considering the GPS coordinates of the airport tower. processing. latitude.1. The handwritten notes about scenarios. which is described in Section 8.Chapter 8. The speed values are smoothed over time using robust local regression smoothing with the RLOESS method with a span of ten values or 20 s [141]. the start time of each file is estimated. altitude and time-stamp can be extracted [140]. The aim is to build up a database of annotated channel measurements. The remaining parameters are computed out of these values. and annotation.2. small interruptions are made every 30 seconds and every couple of minutes due to technical and operational constraints. interface box. The GPS data is provided in the widely used GPX-format and is imported into Matlab by parsing the corresponding XML structure. 8.5. Voice Radio Channel Measurements Figure 8.

8.: Overview and circuit diagram of measurement system and interface box.2.4. Measurement System for the Aeronautical Voice Radio Channel VHF Voice Radio Mic IN Phones OUT Mic IN + PTT Audio IN/OUT T R S Signal Wiring Chassis and Signal GND Battery V+ Battery GND Test Wiring GPS Config Audio OUT T R S Audio IN T R S Audio OUT PTT OUT T R S Battery + - 100k PTT G 1 2 3 4 5 6 7 8 9 G 330R PTT 6k8 Tx Rx 22µ 100k 22n 100k GPS T S T R S T R S 1 2 3 4 5 6 7 Audio IN Audio OUT (selected) SyncOUT GPS I/O Audio OUT Audio IN 1 Audio IN 2 Audio Player Audio Recorder GPS Receiver Figure 8. Serial I/O GPS and Test 117 .

2. which is one sixth of a sample in terms of the original MLS sequence. Frame Segmentation and Channel Response The recording of the received signal with a sample rate of 48 kHz is split up into frames of 378 samples. For numerical stability the 118 . and then with the local position in-between two 1PPS pulses. | X (ωk )|2 (8. The alignment is accurate to one 48 kHz sample. the actual DTMF symbol is determined by the maxima of the NDFT output values at the corresponding position. Voice Radio Channel Measurements 8. From the 1PPS recording. The corresponding frame in the recording of the transmitted signal is found by the previously computed time-variant offsets between the 1PPS locations in the two recordings. the non-uniform discrete Fourier transform (NDFT) is computed at the specific frequencies of the DTMF tones [142].Chapter 8. Through this alignment the synchronization between transmitted and recorded signal is reestablished. As it is known that the pulses are roughly one second apart. The time difference in samples between two adjacent pulses defines the effective sample rate at which the file was recorded.5.2. Y. respectively [143]. The tones are detected in the power spectrum of the NDFT by correlation with a prototype shape. the sample rate can be recovered by means of the 1PPS track. Again after filtering for false hits.1) where X. Detection of DTMF Tones and Offset For detecting the DTMF tones. Their average offset in samples is computed and verified with the offset that is obtained from the data structure with the manual annotations. and H are the discrete Fourier transforms of the channel’s input signal. For the aircraft and ground recordings. output signal and impulse response. Based on the realigned transmission input-output pairs. 8. Based on those the sample rate of the audio player can also be estimated. and missing pulses can be restored by interpolation.3. the positions of the pulses are identified by picking the local maximum values of a correlation with a prototype shape. which is the length of one MLS. It is therefore necessary to compensate for the clock rate differences.2. The detected DTMF sequences of corresponding air/ground recording pairs are then aligned with approximate string matching using the Edit distance measure [108]. 8.5. false hits can be removed.2. the channel frequency response for every frame is estimated based on FIR system identification using the cross power spectral density (CPSD) method with H ( ωk ) = X ( ωk ) · Y ( ωk ) . as the duration and relative position of the tones is known from the input signal. These rough offsets are refined using the more accurate 1PPS positions of both recordings. It was found that the sample rate is fairly constant but differs among the different devices by several Hertz.5. Analysis of 1PPS Signal and Sample Rate The recorded files are converted into Matlab-readable one-channel Wave files using a PRAAT script [77].4.

The I/Q monitoring receiver continuously recorded every transmission.3. a Rohde&Schwarz EM-550 monitoring receiver which provides I/Q demodulation of the received AM signal was applied.5.5). respectively. The I/Q data with a bandwidth of 12 kHz was digitally recorded onto a laptop computer. Conducted Measurements A number of ground-based and in-flight measurements were undertaken with a general aviation aircraft at and around the Punitz airfield in Austria (airport code LOGG). as the channel was excited only up to a frequency of 4 kHz.: General aviation aircraft and onboard transceiver used for measurements. For the monitoring receiver a λ -ground plane antenna with a magnetic base was mounted onto the roof of 4 a van. All frames are annotated with the flight parameters and the contextual information of the meta data structure using the previously computed time-stamps and offsets. (b) Aircraft transceiver. 119 . measurements were undertaken for two days in May 2006. 8. It is permanently installed in the airport tower in a fixed frequency configuration.3.20 MHz. Figure 8. On ground. Synchronization is to a certain extent nevertheless possible using the DTMF markers in the transmitted signal. with the tower or the aircraft radio transmitting. For reference and comparison. After initial trials at the end of March 2006. signals are downsampled to a sample rate of 8 kHz before system identification. The aircraft used was a 4-seat Cessna F172K with a maximum speed of approximately 70 m/s and a Bendix/King KX 125 TSO communications transceiver (Figure 8.6). the communications transceiver Becker AR 2808/25 was used (Figure 8. Measurements were taken both in uplink and downlink direction.8. Conducted Measurements (a) Aircraft during measurements. A more accurate channel model and estimation scheme based on the measurement results is presented in Chapter 9. The measurements occurred on the AM voice radio channel of Punitz airport at a carrier frequency of 123. however without a 1PPS synchronization signal. The initial trials were undertaken with a different aircraft.

Chapter 8.6. Figure 8. Voice Radio Channel Measurements (a) I/Q measurement receiver. 120 . (b) Airport tower transceiver.: Ground-based transceivers used for measurements.

at/TUG-EEC-Channels/ The TUG-EEC-Channels database covers the channel measurements described in Section 8. The ground based measurements were concluded with the rolling of the aircraft on the paved runway at different speeds. For each frame of 8 ms (corresponding to 378 samples at 48 kHz) the following data and annotation is provided: • Time-stamp • Transmission recording segment • Reception recording segment • Type of transmitted signal 121 . the TUG-EEC-Channels database. Static measurements were undertaken with the aircraft at several positions along a straight line towards the tower.4.8. The ground measurements consisted of static back-to-back measurements with the aircraft engine on and off. landings. These covered the following situations and parameter ranges: • Low and high altitudes (0m to 1200m above ground) • Low and high speeds (up to 250 km/h) • Ascends and descends • Headings perpendicular and parallel to line-of-sight • Overflights and circular flights around airport • Take-offs.4. The TUG-EEC-Channels Database A series of measurement scenarios was set up which covered numerous situations. spaced out by 30 cm. approaches and overshoots • Flights across. A variety of flight maneuvers was undertaken. coordination and cooperation between all parties involved proved to be crucial for a successful accomplishment of measurements in the aeronautical environment. It is available to the research community under a public license free of charge online at the following address: http://www. along and behind a mountain • Flights in areas with poor radio reception (no line-of-sight) Thorough organizational planning.TUGraz. 8. The TUG-EEC-Channels Database Based on the above measurements we present an aeronautical voice radio channel database. Additional measurements were taken with a vehicle parked next to the aircraft.3 with three extensive flights and several ground measurements with a total duration of more than five hours or two million MLS frames.spsc.

8. but the magnitude varies over time. Upon request. Figure 8.: Normalized frequency response of the static back-to-back voice radio channel in up. This time variation results from a synchronization error which is compensated in the more accurate model of Chapter 9.7. the raw MLS recordings as well as a number of voice signal transmission recordings and voice and background noise recordings made in the cockpit using conventional and avionic headset microphones can also be provided.5. longitude. ) • Plain-text comments An exact description of the data format of the TUG-EEC-Channels and additional information can be found on the website.Chapter 8. The following figures present several excerpts of the database.and downlink direction. distance to tower. Figure 8. . latitude. speed. The basic shape of the frequency response is maintained. In Figure 8.9 illustrates a GPS track. • I/Q recording segment • Link to original sequence • Uplink/downlink direction • Estimated channel impulse response • Flight parameters elevation.8. . which is mostly determined by the radio transceivers. Voice Radio Channel Measurements Downlink Channel 5 0 Magnitude in dB −5 −10 −15 −20 −25 −30 0 1000 2000 3000 Frequency in Hz 4000 Magnitude in dB 5 0 −5 −10 −15 −20 −25 −30 0 1000 2000 3000 Frequency in Hz 4000 Uplink Channel Figure 8. and azimuth relative to line-of-sight • Experimental situation (engine on/off.7 shows the magnitude response of the transmission chain in both directions. aircraft on taxiway or flying. Conclusions We presented a system for channel measurements and the TUG-EEC-Channels database of annotated channel impulse and frequency responses for the aeronautical VHF voice 122 . the magnitudes of selected frequency bins of the downlink channel are plotted over time. .

5.5 1000 1500 2000 2500 3000 0 Time in s Frequency in Hz Figure 8. 1778. 1397. Conclusions Downlink Channel 0 Magnitude in dB −5 −10 1 −15 500 0.: Visualization of the aircraft track based on the recorded GPS data. 2540 and 2921 Hz over time.7 m at a position far behind the s mountain and an altitude of 1156 m. Aircraft speed is 44.9.8. Figure 8. 2159. 1016.8.: Evolution of the frequency bins at 635. 123 .

for taking charge of the local organization of the measurements. Beyond that. Voice Radio Channel Measurements channel. we want to recognize all the other parties that helped to make the measurements possible. We hope that the aeronautics community will make extensive use of our data for future developments. Flyers GmbH and Steirische Motorflugunion for providing the aircraft. it would be interesting to see what conclusions with respect to parameters of the aeronautical radio channel models can be drawn from the data. and Dr. Acknowledgments We are grateful to FH JOANNEUM (Graz. we present a further analysis and abstraction of the data and propose a model of the aeronautical VHF voice channel based on the collected data. We also want to thank the Union Sportfliegerclub Punitz for the kind permission to use their facilities. 124 . The campaign was funded by EUROCONTROL Experimental Centre (France). Holger Flühr and Gernot Knoll in particular. and Rohde & Schwarz for contributing the monitoring receiver. Finally. Austria).Chapter 8. which would also help to verify the validity of our measurement scenarios. The current data can be directly applied to simulate a transmission channel for system test and evaluation. In Chapter 9.

a time-variant gain. We also provide simple methods to estimate and compensate the sampling frequency offset and the time-variant gain. and a corresponding estimation method. To estimate the model parameters. we compare six well-established FIR filter identification techniques and conclude that best results are obtained using the method of Least Squares. such as different filter responses. a DC offset and additive noise. we propose a data model to describe the data in the TUGEEC-Channels database. a sampling frequency offset. Applying the model to select parts of the database.Chapter 9 Data Model and Parameter Estimation In this chapter. we conclude that the measured channel is frequency-nonselective. The data contains a small amount of gain modulation (flat fading). The model is derived from various effects that can be observed in the database. The data model achieves a fit with the measured data down to an error of -40 dB. Its source could not be conclusively established. The observed noise levels are in a range from 40 dB to 23 dB in terms of SNR. with the modeling error being smaller than the channel’s noise. 125 . This chapter presents recent results that have not yet been published. but several factors indicate that it is not a result of radio channel fading.

1). Time Variation There is a systematic time-variation of the impulse response estimates of (8. A manual inspection of the recorded signals and the estimated channel responses revealed that the simple time-variant filter channel model underlying the CPSD estimates of (8. which fully affects the CPSD channel estimates.Chapter 9.1.. and provides a mean to estimate the noise in the measured channel by incorporating neighboring frames. Figure 9. in the evaluation of the watermarking system in Chapter 10. We will apply this channel description. Introduction The TUG-EEC-Channels database contains an estimated channel impulse response for every measurement frame based on the cross power spectral density of the realigned channel input and output signal of the corresponding frame. k ). Measuring the recordings of the received signal with the same sliding window reveals an amplitude (gain) modulation. we propose an improved channel model and a corresponding estimation method that provides stable and accurate estimations with low errors. addresses the irregularities in the channel estimates contained in the database. This set of impulse responses can itself form a channel description based on the assumption of a timevariant filter channel model. 126 . The DC offset of each impulse response has been removed a-priori by subtracting the corresponding mean value. In the remainder of this chapter. We found the following signal properties that are not represented in the basic model. Gain Modulation Measured in a sliding rectangular window with a length of one frame (i.1) may not be an adequate representation for the input/output relationship of the data and may lead to wrong conclusions. even in the case of the aircraft not moving. there is a sampling frequency offset between the recordings of the transmitted and received signals. respectively). among others. the transmitted signal is due to its periodicity of exactly constant power. DC Offset There are significant time-variant DC offsets in the recorded signals and the estimated impulse responses.1 shows this for a short segment of a static back-to-back measurement with the aircraft on ground (frame number 2012056 to 2013055). 63 samples or 378 samples at a sample rate of 8 kHz or 48 kHz. Additive Noise The recorded signals are corrupted by additive noise. Sampling Frequency Offset Albeit small. We denote the n’th sample of the estimated impulse response of the k’th frame with h(n. We apply the method to select parts of the measurement database to demonstrate the validity of the model and to estimate its parameters. which appears to be approximately sinusoidal with a frequency of 40–90 Hz. Data Model and Parameter Estimation 9.e.

k) Figure 9.1 0 −0.3 −0.1.: Time-variation of the estimated impulse responses h(n.3 0.k) 0.1 −0.k) h(5. Introduction (a) Mean and standard deviation of the coefficients over all frames.2 −0.1 0 −0.1 −0.k) 0.1. k ) in the TUG-EECChannels database for a static back-to-back measurement over 1000 frames (8 s).4 0 200 400 600 Frame index k 800 1000 h(3. 127 . 0.4 0 10 20 30 40 Sample index n 50 60 h(n.k) h(7.2 −0.3 Impluse response h(n.3 −0.4 0.9.4 Impluse response sample h(n.2 0.2 0. 0.k) (µ±σ over all frames k) (b) Time variation of selected coefficients over all frames.

no matter their content.2 by separating the measurement error and voice radio channel models. Data Model and Parameter Estimation TX Measurement Errors (Transmitter-Side) Voice Radio Channel Measurement Errors (Receiver-Side) RX Figure 9.1. and to the interface box circuits in particular. and the VHF wave propagation channel. We attribute the DC offset to deficiencies in the measurement hardware. the radio receiver.1 results from an insufficient synchronization between the recordings of the transmitted and the received signals. it is a bandpass channel and does not transmit DC signal components. One can observe two particular filter characteristics in the database that solely depend on the transmission direction (air/ground or ground/air) and.Chapter 9. and we make the following assumptions. It successfully does so with an accuracy of one sixth of a sample in terms of the original MLS chip rate. As such. In fact.: Separation between voice radio channel model and measurement errors. using the GPS timing signals. the results will show that this accuracy is not sufficient for obtaining low estimation errors. It is often difficult to conclusively establish if an observed effect is caused by the measurement system or the radio channel. since in laboratory measurements the frequency response of the measurement system was verified to be flat compared to the observed filtering. The database aims to synchronize all signal segments. The sampling frequency offset can be clearly attributed to the measurement system. By specification. 9. and consider the time-variant DC offset (measured within a window of one measurement frame) as a measurement error. the voice radio channel includes the radio transmitter. Assumptions It is important to take into consideration that part of the observed effects might be measurement errors.1 results from a slight drift between the channel input/output frame pairs. on the transmitter and receiver used. When the offset between a pair is larger than half a sample 128 . since the sampling clocks of the audio devices were not synchronized and the signals replayed and recorded with slightly offset sampling frequencies. the constant slope in the lower plot of Figure 9.2.2. We assume that the filtering characteristics of the system can be mostly attributed to the voice radio channel. as such. We conjecture—and later results will confirm—that the systematic time-variation of the TUG-EEC-Channels estimates shown in Figure 9. However. 9.2. This is represented in Figure 9. It appears that the filtering is dominated by the radio transmitter and receiver filters. Proposed Data and Channel Model We propose an improved channel model that better represents the measured data by incorporating the aforementioned effects.

1). Nevertheless.: Proposed data and channel model with a linear and block-wise timeinvariant filter H. Parameter Estimation Implementation Voice Radio Channel x yH w(n) yHw g(n) yHwg c(n) yHwgc yHwgcR TX H Resampling RX Measurement Errors Figure 9. the offset is automatically compensated by shifting the input by one sample. 9. Parameter Estimation Implementation Before Section 9.2. We present a simple sampling clock synchronization method using resampling (Section 9. For example. We merge the transmitter.3.2). additive noise w. For example.9. and there is electromagnetic interference from the aircraft avionics.3. The observed gain modulation could be a measurement effect or a channel property. 129 .2.3. there is small cross-talk between the recorded MLS and GPS signals due to deficiencies in the measurement hardware. Also.3 shows the implementation of an estimation method for the channel model shown in Figure 9.1.3. k ) visible in Figure 9. y Hw denotes the input signal after having passed the filter H and the addition of the noise w.3.3. we first address two sub-problems. and the depicted arrangement represents one possible choice. time-variant gain g and time-variant DC offset c.3. which leads to the discontinuities in h(n. and we will discuss this issue in more detail at the end of this chapter.and receiver-side measurement errors into one block. 9. the measurement system also contributes to the overall noise level. The overlap between the voice radio channel model and the measurement error model represents the uncertainty in the attribution of the components in the overlapping segment to one or the other category. we propose a data model as shown in Figure 9. We expect that most of the additive noise results from the voice radio channel. and compare different methods to estimate a linear and time-invariant channel filter given a known channel input and a noise-corrupted channel output (Section 9.3. The subscripts of the signal y denote the model components that the signal has already passed. Proposed Model Based on the above assumptions. there is an uncertainty in the order of the model elements. (at a sample rate of 48 kHz).

the presented method is simple and provides sufficient performance given the task at hand. . Since the signal was upsampled to a sampling frequency f 0 = 48 kHz. x M −4 x M −2 x0 x1 . . . . A manual inspection of the recordings showed that the sampling frequencies of the different devices are different among each other.875 ms. and 2 are plotted in Figure 9. which is advantageous for numeric maximization. and present a simple resynchronization method in which we leave aside the transmitted signal recordings. The signal was replayed at an unknown sampling frequency close to 48 kHz. assuming M frames within one block of y1 (n). We determine the optimal f 2 or. . . In order to compensate for the two unknown sampling frequencies. .4 for an exemplary block of the database (frame-ID 2011556 to 2013057). Self-Synchronization of the Received Signal by Resampling In the following. We define two scalar error functions g1 ( γ ) = g2 ( γ ) = x 0 x 2 .Chapter 9. x M −3 x M −1 T T x M/2 x M/2+1 . In the following we assume that the sampling frequencies are constant within one block. y2 (kN0 + 1). we focus on signal segments in the database containing MLS signals. but exhibits many local maxima. In contrast. f2 We maximize g1 and g2 as a function of the resampling factor γ = f0 in the following way: 130 . The functions g1 and g2 are similar to an autocorrelation of y2 at lags N0 and N0 M . . . Let the vector xk = [y2 (kN0 ). y2 (kN0 + N0 − 2). A block denotes a recording fragment that contains a continuous MLS signal.3. the MLS signal is periodic with a signal period N0 = 6 · 63 = 378 samples 378 or T0 = 48000 s ≈ 7. y2 (kN0 + 2). x M−2 x M−1 . and accurately resample the received signal recordings to the chip rate of the original MLS.1. the resampling f2 factor γ = f0 using two autocorrelation-based periodicity measures and a three-step local maximization procedure. its maximum peak is very wide and thus prone to the influence of noise. y2 (kN0 + N0 − 1)] denote the k’th frame of the resampled signal (with k in the range from 0 . . M-1. and typically has a length of 20 s to 30 s. . g2 has a very narrow and distinct peak. While we do not claim any optimality herein. respectively). . and recorded at a sampling frequency that is again close to 48 kHz but unknown. transmitted over an analog channel. and M even). we transform the recorded received signal y1 (n). denoted RX in the database and corresponding to y HwgcR in Figure 9. While g1 has a smooth and parabolic shape. x M/2−2 x M/2−1 x 1 x 3 .3. . . Data Model and Parameter Estimation 9. equivalently. into a cubic spline representation and resample the spline with a sampling frequency f 2 such that the signal period of the resulting y2 (n) is again exactly N0 (or T0 . The original MLS signal used in the measurements has a length of 63 chips at a chip-rate of 8000 chips/s. but stable within segments of several minutes.

opt 0.98 0.998 0.001 1.995 1 1.015 1. opt −2 0.999 0.4.01 Frequency scaling f2 : f0 1.005 1.02 (b) Periodicity Measure g2 x 10 4 Periodicity measure g2 3 2 1 0 −1 12 g2 f2.99 0.002 Frequency scaling f2 : f0 f2 f0 .985 0. Figure 9.: Two periodicity measures as a function of the resampling factor 131 .0005 1.9995 1 1.0015 1.9.3. Parameter Estimation Implementation (a) Periodicity Measure g1 5 Periodicity measure g1 4 3 2 1 x 10 12 g1 f2.9985 0.

Hofbauer. 1. Germany. Estimation Methods We compare the estimation accuracy of the following methods. and with the circumflex or hat accent ( ˆ ) estimated variables.und Raumfahrtkongress). γ2 + δ]. Gruber and K. 2007.” in Proceedings of the CEAS European Air and Space Conference (Deutscher Luft. 9.06 and a grid search with a search interval of 0. Comparison of Filter Estimation Methods We compare six well-established methods to estimate the channel impulse response of the filter component H of the model in Figure 9.opt at which the spline is resampled to form y2 (n). Maximization of g2 over γ in the interval γ = [γ1 − 2δ. and resulting in the final estimates γopt and f 2. 1 Parts of this subsection are based on M.02] using golden section search and parabolic interpolation (Matlab function ‘fminbnd’. 132 . The experiments were performed by Mario Gruber in the course of a Master’s thesis [145]. again using golden section search and parabolic interpolation. Sep.5.2.5). we assume a linear and time-invariant (LTI) filter plus noise model to characterize the relation between input and output. Data Model and Parameter Estimation x H yH w yHw Figure 9. 1. the output signal y Hw (n) and additive white Gaussian noise (AWGN) w(n) (see Figure 9. resulting M in the estimate γ2 . [144]). The frequency offset f 2 − f 0 for different measurement signal segments is included in Table 9. Maximization of g1 over γ in the interval γ = [0. The model is characterized by the filter’s impulse response h(n) or frequency response H (e jω ). γ2 + 2δ] using a manually tuned δ = 5 · 10−5 + 0. resulting in the estimate γ1 .1 Filter Plus Noise Channel Model For the comparison of the different estimators.Chapter 9. the input signal x (n).3. Berlin. 3. “A comparison of estimation methods for the VHF voice radio channel.98. in terms of accuracy and noise robustness.: LTI filter plus noise channel model.3.2. with N the number of input and output samples used for the channel estimation. Maximization of g2 over γ in the interval γ = [γ2 − δ.1δ. 2. We denote with P the order of the filter.

  . . ˆ y Hw (n) = x (n) ∗ h Power Spectral Densities [146] ˆ The frequency response H (k ) can also be obtained using c Frequency domain s Gxy (k ) = Gxx (k) H (k ). Parameter Estimation Implementation ˆ Deconvolution The filter’s impulse response h(n) is estimated recursively using [146] p 1 ˆ ˆ h( p) = y Hw ( p) − ∑ x (m)h( p − m) . . which is obtained using [146] Time domain Frequency domain ˆ (n) c s Y (k) = X (k) H (k) . the Toeplitz structured convolution matrix X is   x ( P − 1) · · · x (1) x (0)  x ( P)  ··· x (2) x (1)   X=  . x ( N − 1) · · · x ( N − P + 1) x ( N − P )  y Hw ( P−1 ) 2 y Hw (1 + P−1 ) 2 .  and the output vector y is   y=  y Hw ( N − 1 − P −1 2 )   . .1) Using the covariance windowing method.. . x (0) m =1 ˆ Spectral Division The impulse response h(n) is estimated by an inverse discrete ˆ Fourier transformation (DFT) of the frequency response H (k ). ˆ Time domain ˆ R xy (n) = R xx (n) ∗ h(n) N −1 where R xx (n) = ∑m=0 x (m) x (m + n) is the auto-correlation of the input x (n) and N −1 R xy (n) = ∑m=0 x (m)y Hw (m + n) the cross-correlation between input x (n) and the output y Hw (n). . .9. (9. . 133 . or [148] ˆ ˆ ˆ h(n) = δ(n) ∗ h(n) ≈ R xx (n) ∗ h(n) = R xy (n).3. then R xx (n) ≈ δ(n) and the cross-correlation R xy (n) between input x (n) and output y Hw (n) approximates the channel’s impulse response. .  Maximum Length Sequence If the input signal x (n) is a maximum length sequence (MLS). . . . and Gxx (k ) and Gxy (k ) the corresponding power spectral density (PSD) and cross-PSD. Method of Least Squares the normal equation [147] ˆ The impulse response vector h is given by the solution of ˆ h = (XT X)−1 XT y.

The estimation error for all methods and for the two input signals with and without noise w(n) is shown in Table 9.Chapter 9. the estimation error has a minimum at P ≈ PLP . In the case of large N (i. Experimental Results We use as error measure the error in the filtered signal obtained with the original and the estimated filter. Estimation Performance Comparison Experimental Settings We estimate P = 63 filter coefficients using as input N = 214 samples of an audio signal with a sample rate f s = 22050 Hz or a repeated MLS signal. In the presence of noise.4 kHz. The approximation error can be fully compensated with [148] ˆ h(n) = 1 L+1 R xy (n) + P −1 m =0 ∑ ˆ h(m) . The index RMS indicates the root mean square value of the corresponding signal.1. it is possible 134 . the estimation accuracy increases with increasing N (Figure 9.7). and AWGN w(n) added at an SNR of 10 dB. N P) and using the method of Least Squares. Data Model and Parameter Estimation The approximation R xx (n) ≈ δ(n) is only valid for long MLS. affects the observed estimation error.. In the presence of noise the method of Least Squares outperforms all other methods.6 shows that in the noise-free case the estimation is error-free as soon as P > PLP (using the same experimental settings as above. and Least Squares estimation). This result is consistent with estimation theory.e. since for large P the estimator begins to suffer more from the channel noise due to the random fluctuations in the redundant filter coefficient estimates. RMS y H. it attains the Cramer-Rao lower bound) given the linear model and additive white Gaussian observation noise [147]. influences the observed estimation error. so far set to P = 63. as the method of Least Squares is an efficient estimator (i. Estimation Parameters and Noise Estimation The number of estimated filter coefficients. with SNR = 20 log10 ( x (n) ∗ h(n))RMS w RMS dB.e. Figure 9..RMS dB. The signal x (n) is filtered using an FIR lowpass filter of order 60 (PLP = 61) and a cutoff-frequency of 4. It is given by ˆ ˆ ey ( n ) = y H ( n ) − y H ( n ) = x ( n ) ∗ h ( n ) − x ( n ) ∗ h ( n ) and expressed as the ratio Ey = 20 log10 ey. Assuming a time-invariant channel and using the same experimental settings as before. The number of input samples. so far set to N = 214 samples.

: Filter Estimation Results Estimation error to output ratio Ey for different estimation methods. Parameter Estimation Implementation Table 9. The original filter has PLP = 61 coefficients.7 dB -10 dB (a) Noise-Free Channel 0 −50 −100 Ey in dB −150 −200 −250 −300 −350 0 20 40 60 80 100 P (number of estimated coefficients) 120 Ey PLP (b) Noise-Corrupted Channel 0 Ey −10 Ey in dB PLP −20 −30 −40 0 20 40 60 80 100 P (number of estimated coefficients) 120 Figure 9.: Estimation error to output ratio Ey as a function of the number of estimated coefficients. and with and without observation noise w. 135 . Estimation Method Deconvolution Spectral Division PSD Least Squares MLS MLS (compensated) Audio < -150 dB < -150 dB < -150 dB < -150 dB n/a n/a Audio+Noise > +100 dB -16 dB -21 dB -34 dB n/a n/a MLS < -150 dB -68 dB < -150 dB < -150 dB -27 dB < -150 dB MLS+Noise > +100 dB > +100 dB +8 dB -34 dB -9.1.3.6.9. input signals.

1. we have developed an estimation method for the channel model of Figure 9. the estimation consists of the following steps.Chapter 9. The resampled signal is decimated (including anti-alias filtering) by a factor of six to the chip-rate of the original MLS sequence.2) has an estimation error ˆ e w ( n ) = w ( n ) − w ( n ). Ew = 20 log10 ew.3. in comparison to the frame-based PSD estimator used for the database. which is needed in the channel estimation method presented hereafter. is resampled with the resampling factor γopt to 48000 Hz. 9.1.8. Channel Analysis using Resampling.7 as a function of the number of samples used for estimation. RMS wRMS which is again expressed as an estimation error ratio dB and shown in Figure 9. In combination with using long signal segments it also allows to estimate the channel noise.3. The recorded received signal y HwgcR (n). denoted RX in the database. using the method described in Section 9. The noise estimate ˆ ˆ ˆ w(n) = y Hw (n) − y H (n) = y Hw (n) − x (n) ∗ h(n) (9. We conclude that. As depicted in Figure 9.3. to estimate not only the filter coefficients h(n). 2.7. but also the channel noise w(n). Data Model and Parameter Estimation 0 −5 −10 E in dB −15 −20 −25 −30 −35 0 1000 2000 3000 4000 5000 N in samples 6000 7000 Ew Ey Figure 9. Gain Normalization and Least Squares Estimation Based on the results of the previous two subsections. 136 . the Least Squares method is a more suitable estimator. Pre-Filtering.: Signal and noise estimation errors Ey and Ew at an SNR of 10 dB as a function of the number N of input and output samples ( f s = 22050 Hz) used for estimation.3. We focus only on measurement data blocks containing MLS signals.

2 and computed with Matlab’s backslash operator.9. and additive noise based on measured data and with compensation for measurement errors.: Implementation of the gain normalization.: Estimation of the channel’s impulse response.3.. we apply the analysis system presented in the previous section to select parts of the database that cover a wide range of experimental conditions. y Hw (n) as noise-corrupted output. The channel noise w(n) is estimated using (9. The DC offset c(n) is estimated using a lowpass filter of order 1000 and a cutoff frequency of 40 Hz.)-1 Figure 9. To remove the DC offset. The channel filter’s impulse response h(n) is estimated using the original binaryˆ valued MLS signal xMLS (n) of the database as input. 2 It is assumed that the filter is time-invariant within one block. ˆ Hwg y ˆ g ˆ -1 g ˆ Hw y |. 5.| (.8. 3. The time-variant gain g(n) is estimated with a conventional envelope follower (shown in Figure 9. Experimental Results and Discussion MLS xMLS RX ˆ yHwgcR Resampling (48.0 kHz) ↓6 ˆ Hwgc y ˆ Hwg y ˆ Hw y Gain Normalization Filter Estimation ˆ w ˆ h ˆH y Figure 9.2) and ˆ ˆ ˆ ˆ ˆ w(n) = y Hw (n) − xMLS (n) ∗ h(n) = y Hw (n) − y H (n). 137 . 6. y Hwgc (n) is bandpass filtered using a bandpass filter with a passband from 100 Hz to 3800 Hz. and the method of Least Squares for estimation. time-variant gain. and using a lowpass filter with a cutoff frequency of 100 Hz). 4.4 will show that the assumption of a time-variant filter in combination with a tracking estimation method leads to overall inferior results. 9. as discussed in Section 9.2 7..9. Section 9. ˆ ˆ The signal y Hwg (n) is then normalized to unit gain using the gain estimate g. Experimental Results and Discussion To show the validity of the analysis system and to obtain insight in the channel properties.4. and low and high frequency noise outside the frequency ˆ range of interest.9.4...

Chapter 9. and calculates an ˆ estimate h(n. the ‘Windowed LS’ method performs a Least Squares estimation within a window of five frames (315 samples). is a strong measure for the suitability of the entire channel estimation ˆ scheme.998. The table also shows the estimation error when tracking h(n. Assuming N P. we conclude that the bursts are not an estimation error but are present in the signal. there is a sampling frequency drift or gain modulation that is not appropriately compensated. a manual inspection of the error signals did not reveal any significant temporal changes within a data block. which is at the same time the error signal of the Least Squares estimation. the full data block is used for the estimation of h(n).4. as well as its spectral and acoustical quality. We define the estimated signal-to-noise ratio (and error measure) as ˆ wRMS SNRest = 20 log10 dB. At low noise levels (e. If. which is another justification for the assumption of a time-invariant channel filter h(n). the timevariation of the impulse response is smaller than what can be measured given the channel noise. However. Due to their periodic occurrence it appears likely that the bursts result from electromagnetic interference within the measurement system itself or between the measurement system and other components of the aircraft or ground systems. Apart from this. data block number 1). This means that the performance improvement obtained by using as many samples as possible for the estimation (cf. for example. Data Model and Parameter Estimation 9. Table 9. The origin of these noise bursts is unknown.2 and the estimated frequency responses as shown in Figure 9. ˆ y H. At medium noise levels 138 . Filter and Noise Estimation ˆ The noise estimate w(n). since the same noise bursts are audible in recordings of signal pauses of the database when no MLS signal was transmitted. we use as error measure the overall power of the ˆ error signal w(n). Figure 9. which assumes that h(n) is time-invariant. t) for every sample of the data block. results in the estimated SNRs as summarized in Table 9.g. In other words. t). Thus. but only noise. t) with an exponentially weighted recursive-least squares (RLS) adaptive filter [87] with a forgetting factor λ = 0. and as such on the transmitting and receiving voice radios. the error signal w(n) is expected to contain no MLS signal components. very little other variation of the frequency response among the different blocks was observed.. A manual inspection of the estimated responses showed that the frequency response mainly depends on the transmission direction (air/ground or ground/air). there is no good overall Least Squares fit. In fact. Also. To accommodate for potential slow time-variations of h(n. The results show that the assumption of a time-invariant h(n) leads to the lowest estimation error. and a significant fraction of the (modulated) ˆ MLS signal resides in w(n). In the ‘Least Squares’ (LS) method described so far.1.7) outweighs the performance degradation induced by the time-invariance assumption. parts of the proposed channel model were inspired ˆ by residuals observed in w(n).10.2 also provides estimation results for two alternative estimation methods. RMS Applying the proposed estimation method to the database using P = 63 and always the full data block for input and output. the error signal contains short repetitive noise bursts with a constant frequency of approximately 6 Hz.

139 . Experimental Results and Discussion (a) Air/Ground Transmission 10 Magnitude Response (in dB) 0 −10 −20 −30 0 500 1000 1500 2000 2500 Frequency (in Hz) 3000 3500 4000 Max.4. Response Min. (all blocks) (data block nr.: Estimated frequency responses. 1) (all blocks) Phase Response (in radians) 0pi −2pi −4pi −6pi 0 500 1000 1500 2000 2500 Frequency (in Hz) 3000 3500 4000 (b) Ground/Air Transmission 10 Magnitude Response (in dB) 0 −10 −20 Response −30 0 500 1000 1500 2000 2500 Frequency (in Hz) 3000 3500 4000 (data block nr.9. 3) 0pi Phase Response (in radians) −8pi −16pi −24pi −32pi 0 500 1000 1500 2000 2500 Frequency (in Hz) 3000 3500 4000 Figure 9.10.

8 27.3 28.8 58. Doppler Shift Neg.: Filter and Noise Estimation Results Data Block Transm.10 0.3 22.2 296 n/a Off 0.92 0.1 0.3 28.6 35.Table 9.92 0.5 23.7 34.89 0.1 37.5 50.7 1467 166 On 4.5 36.2.7 40.2 1594 268 On 19.0 36.09 -0.98 -0.8 Recursive LS Number Back-to-Back Back-to-Back Back-to-Back Back-to-Back Pos.6 34.2 1342 173 On 5. Data Model and Parameter Estimation 1 2 3 4 5 6 7 8 9 10 11 12 13 2012056 2065388 1008000 4016698 2396236 2400075 2354363 2412064 2377571 2116191 2251911 4260506 3292995 2013560 2066892 1009504 4018202 2397740 2401579 2355867 2413568 2379075 2117695 2253415 4262010 3294499 A/G A/G G/A G/A A/G A/G A/G A/G A/G A/G A/G G/A A/G Off 0.9 22.99 39.92 0. Doppler Shift Pos.7 30.8 36. Doppler Shift Pos. Doppler Shift Zero Doppler Shift Med.0 297 n/a On 0.3 39.1 26.0 300 n/a On 0.1 1075 92 0.0 992 183 On 43.8 22.1 2.94 -0.2 1363 197 On 2.5 22.1 0.7 22.10 0.1 28.6 1032 3 On 7.5 35.0 36.9 30.3 29.9 41. Distance 140 .9 22.3 295 n/a On 5.3 29. Direction Distance (in km) Frame-ID (Start) Frame-ID (End) Engine On/Off Speed (in m/s) Altitude (in m) Experimental Conditions Rel.7 37.0 64.6 28.1 0. Distance Max.0 22.94 0.0 38. Azimuth (in ◦ ) f2 − f0 Difference (in Hz) SNRest (in dB) Windowed LS Comment Least Squares Chapter 9.8 28.4 27.7 22.96 0.1 58.3 30.6 26. Doppler Shift Neg.9 29.5 1343 331 On 5.9 61. Doppler Shift Neg.0 27.2 51.93 0.9 1224 26 On 3.4 29.1 27.

9.4. Experimental Results and Discussion
(a) Data Block Nr. 1 (A/G, engine off, back-to-back)
−40 Magnitude (in dB) −50 −60 −70 −80 −90 0 500 1000 1500 2000 2500 Frequency (in Hz) 3000 3500 4000

(b) Data Block Nr. 7 (A/G, engine on, flight)
−40 Magnitude (in dB) −50 −60 −70 −80 −90 0 500 1000 1500 2000 2500 Frequency (in Hz) 3000 3500 4000

(c) Data Block Nr. 13 (A/G, engine on, flight, max. distance)
−40 Magnitude (in dB) −50 −60 −70 −80 −90 0 500 1000 1500 2000 2500 Frequency (in Hz) 3000 3500 4000

Figure 9.11.: DFT magnitude spectrum of the noise estimate and estimation error ˆ signal w for different data blocks. (e.g., data block number 7) the error signal sounds similar to colored noise, and at the highest noise levels (e.g., data block number 13), the error signal is very similar to white noise. Figure 9.11 shows the corresponding power spectra of the error signals.

9.4.2. DC Offset
Figure 9.12 shows the time-variant DC offset c(n) in the recordings of the received signal, obtained by filtering data block number 1 with a lowpass filter of order 1000 and a cutoff frequency of 40 Hz. The clearly visible pulses with a frequency of 1 Hz result from interference between the voice recordings and the GPS timing signal within the measurement system, and are as such a measurement error. Even though the amplitude of this DC offset is small compared to the overall signal amplitude, the offset, if not

141

Chapter 9. Data Model and Parameter Estimation

0.01 Amplitude 0.005 0 −0.005 −0.01 0 1 2 3 Time (in s) 4 5 6

Figure 9.12.: DC component of the received signal recording. The pulses result from interference with the GPS timing signal. removed a-priori, would severely degrade the overall estimation performance.

9.4.3. Gain
In some data blocks we observed a sinusoidal amplitude modulation of the estimated ˆ gain g of the received signal. The DFT magnitude spectrum of the time-variant gain ˆ g(n) for three different data blocks is shown in Figure 9.13. In two of these cases the spectrum exhibits a distinct peak. For every data block, Table 9.3 provides the frequency location f sin of the most dominant spectral peak, as well as its amplitude Asin relative to the DC gain C, expressed in L Asin /C = 20 log10 Asin C ,
2π f sin fs n

modeling the gain modulation with g(n) = C + Asin sin quantity in terms of RMS values results in L ARMS /CRMS = 20 log10   Asin sin
2π f sin fs

. Expressing the same

CRMS

Table 9.3 also shows the overall power of the gain modulations relative to the DC gain, expressed in ˆ [ g(n) − C ]RMS L g−C/C = 20 log10 . CRMS The source of the sinusoidal gain modulations (GM) is not immediately clear. Given Figure 9.13 and Table 9.3, we observe the following facts: 1. GM is present if engine is running, only. 2. GM is present even if aircraft is not moving. 3. Frequency of GM is independent of the relative heading of the aircraft.

RMS 

≈ L Asin /C − 3 dB.

142

9.5. Conclusions 4. Frequency of GM depends on an unknown variable that has some connection with the aircraft speed. 5. Frequency of GM is significantly larger than the frequency predicted in Appendix B for gain modulations caused by multipath propagation and Doppler shift. We conclude from these facts that the observed gain modulation is not caused by the physical radio transmission channel, especially since it as also present in situations where the aircraft did not move. We believe that the gain modulation results from interference between the aircraft power system and the voice radio or the measurement system. In particular, the gain modulation might result from a ripple voltage on the direct current (DC) power supply of the aircraft radio. Such a ripple voltage is a commonly observed phenomenon and results from insufficient filtering or voltage regulation of the rectified alternating current (AC) output of the aircraft’s alternator or power generator [149]. The frequency of the ripple voltage is, as such, a linear function of the engine’s rotational speed, which is measured in rotations per minute (rpm). We conjecture that the unknown variable that the gain modulation frequency depends on, is the engine rpm. This would align well with the results of Table 9.3 with no gain modulation when the engine is off and a significant frequency difference between engine idle and engine full throttle. The magnitude of the ripple voltage as well as its impact on the communication system is expected to be system- and aircraft-specific and to be less of an issue in modern systems and large jet aircraft with an independent auxiliary power unit (APU). An alternative explanation for the gain modulation could be a periodic variation of the physical radio transmission channel induced by the rotation of the aircraft’s propeller. The rotational speed of the propeller is linked to the engine rpm either directly or via a fixed gear transmission ratio.

9.5. Conclusions
In this section, we presented an improved model to describe the data in the TUG-EECChannels database presented in Chapter 8 and a corresponding estimation method. This method allows to characterize the measured channel in terms of its linear filter response, time-variant gain, sampling offset, DC offset and additive noise, and also helps to identify measurement errors. In contrast, the channel estimates included in the database combine all effects into a single variable, namely a channel response estimated frame by frame. Given the low level of the estimation error signal and the spectral characteristics of the noise estimate, we conclude that the proposed estimation method works as expected, and that the proposed data model is a reasonable description for the measured data. As expected by theory, the analysis of the measurements confirms that the channel is not frequency-selective. The source of the observed flat fading could not be conclusively established. However, there are several factors that strongly suggest that the observed gain modulation does not result from radio channel fading through

143

Chapter 9. Data Model and Parameter Estimation

(a) Data Block Nr. 1 (A/G, engine off, back-to-back)
80 Magnitude (in dB) 60 40 20 0 −20 0 20 40 60 80 Frequency (in Hz) 100 120

(b) Data Block Nr. 4 (G/A, engine on, back-to-back)
80 Magnitude (in dB) 60 40 20 0 −20 0 20 40 60 80 Frequency (in Hz) 100 120

(c) Data Block Nr. 7 (A/G, engine on, flight)
80 Magnitude (in dB) 60 40 20 0 −20 0 20 40 60 80 Frequency (in Hz) 100 120

ˆ Figure 9.13.: DFT magnitude spectrum of the time-variant gain g of different data blocks.

144

6 dB L ARMS /CRMS -72. Frequency 1 2 3 4 5 6 7 8 9 10 11 12 13 79.0 dB -21.9 dB -25.6 dB -29.3 dB -36.4 dB -51.9 dB -44.2 dB -36.1 dB -34.2 Hz 27. which are worst-case estimations since all estimation errors accumulate in the noise estimate.4 dB -28.2 dB -33.5 dB -31.7 dB -34.7 dB -53.8 dB -27.5.9.6 dB L g−C/C -51.1 Hz 45.5 Hz 86. Conclusions Table 9.1 dB -36.3 dB -23.2 dB -42.3 dB -21.2 Hz 73.8 dB -41.3 dB -67.8 dB -36.8 Hz 86.8 dB -24.3 dB -30.2 dB -64.8 dB -56.8 Hz 80.8 dB -27.9 dB -22.1 Hz L Asin /C -69.: Dominant Frequency and Amplitude Level of the Gain Modulation Data Block Nr.3 dB -27.5 Hz 65.3. and the real channel noise level might in fact be smaller.1 dB -39.3 dB -39.1 dB -39.6 Hz 83.7 Hz 81.8 dB -31.6 dB -31.4 Hz 86.7 dB multipath propagation but from interference between the aircraft engine and the voice radio system.8 dB -37.6 dB -26. 145 . The observed noise levels are in a range from 40 dB to 23 dB in terms of SNR.0 Hz 86.8 dB -30.9 Hz 43.

Data Model and Parameter Estimation 146 .Chapter 9.

This chapter presents recent results that have not yet been published. We experimentally demonstrate the robustness of the method against filtering. desynchronization.Chapter 10 Experimental Watermark Robustness Evaluation In this chapter. we make use of the channel model derived in Chapter 9 to evaluate the robustness of the proposed watermarking method in the aeronautical application. Furthermore we show that pre-processing of the speech signal with a dynamic range controller can improve the watermark robustness as well as the intelligibility of the received speech. gain modulation and additive noise. 147 .

1 confirm the robustness of the watermarking method against this time-variant filtering attack. 8.3. flight’.3 demonstrated the robustness of the watermarking method against linear and time-invariant filtering using a bandpass filter.2. gnd/air’). 10. one with the aircraft being static on ground (‘Ch. we use the same experimental setup and settings as described in Section 4. This is a worst-case scenario. The evaluation is an explicit continuation of the experimental results of Section 4.1.4. Time-Variant TUG-EEC-Channels Filter We evaluated the robustness against the time-variant CPSD-based channel estimates of Section 8. ground’. Frame-ID 2396236 to 2399927).1 and indicate that the method is robust against both filters (denoted ‘Ch. both with transmission in air/ground direction.Chapter 10. we used the two distinct channel response filters for air/ground and ground/air transmissions as depicted in Figure 9.5 kHz to 2. Consequently. the ground/air channel has a non-flat frequency response and significant attenuation above 2.5. 9. as discussed in Chapter 9. A more suitable choice for the watermark embedding would be the configuration described in Section 4. Experimental Watermark Robustness Evaluation This chapter experimentally evaluates the robustness of the watermark system presented in Chapter 4 in light of the application to the air traffic control voice radio. As shown in Figure 9.2. the time-variation of these channel response estimates originates from an estimation error due to insufficient compensation of the sampling clock drift. 8. Estimated Static Prototype Filters We evaluated the filtering robustness using the measured and estimated aeronautical radio channel filters of Chapter 9.2. an IRS filter.10. 10. We used the estimates of two different measurement scenarios.1.5 kHz. 148 . 9. The increased BER in the ground/air transmission direction likely results from a mismatch between the used embedding band position (666 Hz–3333 Hz) and the frequency response of the ground/air channel.2. a randomly chosen estimated aeronautical channel response filter and an allpass filter.10. The filter taps obtained from the CPSD-based channel estimates (available every 7. air/gnd’ and ‘Ch. Frame-ID 2011556 to 2013968 in the TUGEEC-Channels database) and one with aircraft flying at high speed (‘Ch.1. 10. In particular. since.3 with an embedding band from 0. Filtering Robustness Section 4. These results are repeated and extended in Figure 10.1. The resulting BERs are shown in Figure 10.3.5 kHz.1. The results in Figure 10. now incorporating the obtained knowledge about the aeronautical voice radio channel.875 ms) were interpolated over time and continuously updated while filtering the watermarked speech signal.

2. The experimental conditions are the same as in Section 4. chan. 10. 9. 4) 0 149 . gnd/air) Aeron.02 Time−var. f GM = 73 Hz.2. the resulting BER is shown in comparison to an ideal channel with constant gain in Figure 10. chan. chan.2 shows the robustness of the principle embedding method against sinusoidal gain modulations of the form f ˆ sFB (n) = 1 + hGM sin 2πn GM fs sFB (n) with a modulation frequency f GM and a modulation depth hGM . 9. except that ideal frame synchronization is assumed since the implemented frame detection scheme works reliably only up to a BER of approximately 15 % (see Figure 5.8). hGM = −24. air/gnd) Figure 10. (Ch. Gain Modulation Robustness In this section. ground) Time−var.04 0.3.10. in this and all following experiments of this chapter the input speech is normalized to unit variance on a per utterance level to establish a coherent signal level across utterances. (Ch. Aeron. 8. 8. (Ch.062) and real frame detection. Using the measured worst-case modulation frequency and depth of Table 9. 11. Gain Modulation Robustness 0. flight) Bandpass filter IRS filter Ideal channel Aeron. (Ch. Sinusoidal Gain Modulation Robustness Figure 10. both systematically and in the measured aeronautical channel conditions. (Ch. In contrast to the previous experiments. the robustness of the watermarking method against sinusoidal gain modulations is evaluated. 10.2. chan.: Overall system robustness in the presence of various transmission channel filters at an average uncoded bit rate of 690 bit/s.06 0.3.1.2 dB = 0.08 Bit error ratio (BER) 0. chan.1.3 (Data Block Nr.

(Ch. frequency (in Hz) 0 0 0.3 0.6 0.1 0 300 200 100 Mod.: Overall system robustness in the presence of the measured gain modulation and a simulated automatic gain control at an average uncoded bit rate of 690 bit/s (and using the actual frame synchronization).Chapter 10.: Watermark embedding scheme robustness against sinusoidal gain modulation at an average uncoded bit rate of 690 bit/s (assuming ideal frame synchronization).4 0.8 1 Modulation depth Figure 10.4 Bit error ratio (BER) 0.01 0 Ideal channel Figure 10.02 0. 0.04 Bit error ratio (BER) 0.03 0.2. 9.3. 11) Automatic gain control . Experimental Watermark Robustness Evaluation 0.2 0. 150 Gain mod. Bk.2 0.

1 of Appendix C for given peak signal-to-noise ratios (PSNR). the transmission channel is non-linear.3. Due to the presence of gain controls in the transceivers. the aeronautical channel is in fact peak power constrained.3 with a segmental SNR of 30 dB. the automatic gain controls were in a steady state and the output signal power at some nominal level. Noise Robustness The robustness of the watermarking method against AWGN was evaluated in Section 4. perfect frame detection is assumed. and the actual peak power limit lies above this measured steady-state level.7 dB (see Table 9. In the remainder of this section. we simulated a channel with an AGC using the SoX implementation [151] of a dynamic range controller (or ‘compressor’. As a consequence. and a compression ratio of 1:4 above a threshold level of -24 dB [150]. We assume the peak power to be the signal’s maximum average power measured in windows of 20 ms. and care must be taken when defining or comparing signal-to-noise ratios. bit and frame synchronization is provided in Section 5. The SNRs of Table 9.4 are shown in Figure C.2 were measured with an input signal of constant instantaneous power.2.10. The BER curves corresponding to Figure 10.3. 150]. Desynchronization Robustness An experimental evaluation of the timing.4.2). since the worst-case measured SNR (including estimation errors that contribute to the noise level) was found to be 22. the robustness of the watermarking scheme against AWGN is difficult to specify in absolute terms in the aeronautical application. a look-ahead of 20 ms. since otherwise the BER would level off at around 15 % due to failing frame synchronization. The BER curve indicates that the method would sometimes not be robust against the measured aeronautical channel conditions.4 (‘Original signal’). To show the behavior of the principle method. However. Desynchronization Robustness 10. 151 . Simulated Automatic Gain Control Aeronautical radio transceivers typically contain an automatic gain control (AGC) both in the radio frequency (RF) and the audio domain [97. a release time of 500 ms. with the peak power level being given by the maximum allowed output power of the transmitter (integrated over a short time window). While SNR is a measure of the mean signal and noise power. The noise robustness for a wide range of non-segmental SNRs is shown in Figure 10.3. This topic is not further treated herein.3 demonstrates that the AGC does not decrease the watermark detection performance. Due to the reaction time of the gain controls. we briefly discuss and experimentally evaluate additional measures to increase the noise robustness of the watermarking method. 10. To evaluate the robustness against adaptive gain modulation. [152]) with an attack time of 20 ms. The resulting BER shown in Figure 10. a peak of a real speech signal may in practice far exceed this nominal level. 10.2.

original signal) 50 60 Original signal Original signal with silence removed [sr] sr+multiband compr. 470 bit/s (‘sr’) and 485 bit/s (‘sr+mbcl’). 10. Given the ten normalized utterances of the ATCOSIM corpus used throughout the tests of this chapter.5 0. floor (at −SNR−3dB) [wmfl] sr+mbcl + wmfl (at −SNR) + ug (+3dB) sr+mbcl + wmfl (at −SNR+3dB) sr+mbcl + wmfl (at −SNR+3dB) + ug (+3dB) Figure 10.: Watermark embedding scheme robustness against AWGN at average uncoded bit rates of 690 bit/s (‘original signal’). A simple measure to increase the noise robustness of the watermarking method is to not embed during pauses and constrain the embedding to regions where the speech signal has significant power. both calculated on the basis of the duration of the original signal.45 0. Inevitably. when assuming a fixed channel noise level. which also includes pauses and silent segments. which are defined by a 43ms-Kaiser-window-filtered intensity curve being more than 30 dB below its maximum for longer than 100ms.1. however.2 0.15 0. leading to high BERs.4 (‘silence removal’) demonstrates a significant improvement in terms of noise robustness.1 0. wrt.05 0 0 10 20 30 40 SNR (in dB. at the cost of a reduced 152 Bit error ratio . This reduces the average watermark bit rate from 690 bit/s to 470 bit/s.4 0. Experimental Watermark Robustness Evaluation 0.35 0. the local SNR in silent regions is very low and watermark detection fails. 14 % of the speech signal are detected as silence. this decreases the uncoded watermark bit rate. and limiting [mbcl] sr+mbcl + unvoiced gain (+3dB) [ug] sr+mbcl + unvoiced gain (+6dB) sr+mbcl + waterm.3 0.4. We simulate the non-embedding in pauses by removing all silent regions of the input signal.25 0. Selective Embedding The watermarking method as described in Chapter 4 embeds the watermark in all time segments that are not voiced. The resulting BER curve shown in Figure 10.Chapter 10.4. However.

: Improving watermark robustness and speech intelligibility by dynamic range compression. 10.5. Figure C. Given various channel noise levels. the compressed speech signal has much higher overall energy compared to the original signal and. The methods were compared subjectively using the modified rhyme test (MRT) and significantly improved the intelligibility of the received speech at high channel noise levels. it is of relevance if the signal to channel noise ratio is defined as SNR or PSNR. Further on.2 of Appendix C.4. The settings used for both units were chosen manually and are shown in Figure C. Given the peak power constrained transmission channel and. Since the speech signal has overall higher energy. Due to the presence of the compression system. consequently. As a consequence. thus. the resulting BER curve is shown in Figure 10. [152]) to improve the intelligibility.4. The compressor increases the SNR of the transmitted signal without increasing the peak signal-to-noise ratio (PSNR). also the watermark signal that replaces the speech signal is higher in energy relative to the channel noise if the compression precedes the watermark embedding as shown in Figure 10.2. but also the noise robustness of the watermark system. the SNR is different depending on whether the signal power is measured before the enhancement system (as the channel is peak-power constrained and both processed and unprocessed signal have the same peak power) or if it is measured at the channel input. Speech Enhancement and Watermarking A Master’s thesis [131] that was supervised by the author of this work studied and compared different algorithms to pre-process the speech signal before the transmission over the aeronautical radio channel in order to improve the intelligibility of the corrupted received speech.10.5.4 and exhibits a significant improvement.1 shows the BER curves corresponding to Figure 10. a larger perceptual distance from the constant-power channel noise. Noise Robustness Enhancing System Dynamic RangeC AT Cha Compression n n e l Watermark Embedding Enhancing System ATC A T C C Channel h a n n e l Figure 10. the dynamic range compression not only increases the intelligibility. The input speech signal without silent regions as used in the previous subsection was pre-processed by the compressor before the embedding of the watermark as shown in Figure 10.5.6 percentage points on average. bit rate. The isolated word recognition rate increased by 12. a fixed peak signal to noise level ratio. We simulate a dynamic rage controller using Audacity [153] with the Apple CoreAudio implementation [154] of a multiband compressor followed by the LASDPA implementation [155] of a limiter [152].4 but using alternative signal-to-noise ratio 153 . The study showed the high effectiveness of a dynamic range controller (or ‘compressor’. The compared systems are fully backward-compatible and operate only at the transmitting side.

Two simple measures are introduced and their effectiveness evaluated. Nevertheless it significantly improves the noise robustness of the watermark.1) to the watermarked signal sPB (n). however. the watermark floor becomes audible. At the receiving end all receivers (including legacy installations) would benefit from the intelligibility improvement. the watermark floor is imperceptible since it is masked by the channel noise.4 (‘waterm. The method allows the embedding of the watermark in a frequency band that matches the passband of the transmission channel.5. Given a watermark floor level -3 dB below the channel noise.4. at the cost of perceptual quality.1). In non-voiced regions. the additional gain does in most cases not increase the peak power of the watermarked signal. The equalizer in the watermark detector compensates for phase distortions and tracks eventual time-variations of the channel filters.3. A primitive way to increase the watermark power is to add the watermark data ˆ signal w(n) (using the notation of Section 4. Watermark Amplification In principle. Another simple measure to increase the watermark power is to increase the gain g(n) of the non-voiced speech components (cf.4 (‘unvoiced gain’) shows the positive effect on the BER curves given a gain g(n) that is increased by a factor of +3 dB and +6 dB relative to its initial measured value. Figure 10. a certain value has to assumed based on operational requirements. increase the intelligibility of the transmitted ATC speech. At higher levels. An equipped receiver is required in order to benefit also from the embedded watermark. The watermark signal is added as a watermark floor at a fixed power level relative to the channel noise power. Section 4. The high robustness against filtering attacks comes as no surprise since the watermark detector also has to cope with the rapidly time-varying LP synthesis (vocal tract) filter. The resulting BER curves shown in Figure 10. 10.Chapter 10. Conclusions We conclude that the proposed watermarking method is robust against any filtering that is expected to occur on the aeronautical radio channel. The increased noise robustness comes. If the channel noise level is not known a-priori. Experimental Watermark Robustness Evaluation measures. at the same time. We cannot overstate the fact that within a single system that combines watermark embedding and compression it is possible to robustly embed a watermark and. a comfort noise of equivalent power is added in order to maintain a constant background noise level. floor’) demonstrate the effectiveness of the watermark floor. The watermarking method is also highly robust against sinusoidal gain modulations 154 . The system could be fully backward-compatible. any measure that increases the power of the embedded watermark signal relative to the channel noise increases the likelihood that the embedded message will be successfully detected. and the additional gain in robustness comes at the cost of perceptual quality. 10. Since the non-voiced components are often low in power compared to voiced components.

but many alternative schemes exist and further optimization is possible. A number of further measures are available to improve the noise robustness. It is difficult to specify the SNR ranges of real world ATC channels.5. We evaluated the noise robustness over a wide range of signal-to-noise ratios. Given the presented BER curves. for an operational ATC implementation it might be beneficial to reduce the bit rate and. considering that the measured worst case modulation depth is well below -20 dB. 155 . The high gain modulation robustness is not surprising because the method must inherently be robust against the rapidly time-varying gain of the unvoiced speech excitation signal. The timing synchronization scheme based on the detector’s equalizer shows satisfactory performance. therefore. increase the noise robustness of the watermark. The most remarkable is a pre-processing of the speech signal by multiband dynamic range compression and limiting. This processing not only boosts the noise robustness of the watermark. A working implementation for frame synchronization was presented. The implementation presented herein far exceeds the operationally required watermark bit rate of approximately 100 bit/s [116]. but also increases the intelligibility of the received speech signal. Non-sinusoidal gain modulations in the form of an AGC do not degrade the detection performance either. Conclusions over a wide range of modulation frequencies and modulation depths.10. but there is room for further improvement by the application of fractionally spaced instead of sample-spaced equalizers.

Chapter 10. Experimental Watermark Robustness Evaluation 156 .

and ultimately increase safety. A review of related work reveals that most of the existing watermarking theory and most practical schemes are based on a non-removability requirement. security and efficiency in ATC.. e. Theoretical capacity derivations show that dropping this constraint allows for significantly larger watermark capacities. 9] Synchronization [Ch.Chapter 11 Conclusions Air traffic control (ATC) radio communication between pilots and controllers is subject to a number of limitations given by the legacy system currently in use. 4] ATC Channel [Ch. this thesis investigates the embedding of digital side information into ATC radio speech.1. perceptually irrelevant signal components are not limited to small modifications but can be replaced by arbitrary values. it poses an unnecessary constraint in the legacy enhancement application considered in this thesis. 8. 4. where the watermark should be robust against tampering by an informed and hostile attacker. transaction tracking applications.1 illustrates the connections among the different contributions in the context of the ATC application at hand. Ch. because watermarking can be performed in perceptually irrelevant instead of perceptually relevant host signal components. In contrast to perceptually relevant components. 157 . Figure 11. The structure of this chapter mostly mirrors the signal flow shown in the figure. ATC Speech [Ch. 10] Figure 11. To overcome these limitations.g. 7] Watermark Embedding [Ch. Ch.: Relationship among the different chapters of this thesis. It presents a number of contributions towards the ATC domain as well as the area of robust speech watermarking. 5] Watermark Detection [Ch. While this is a key requirement in.

We proposed a model for the measured channel data.Chapter 11. The corpus is useful in the validation of our watermarking scheme. Even though the practical scheme still operates far below the theoretical capacity limit. the speech corpus DVD with a download size of 2.1 Synchronization between watermark embedder and detector is necessary to detect the watermark in the received signal. We evaluated the noise robustness over a large range of channel SNRs and proposed a number of measures that. and a simulated automatic gain control. we created two evaluation resources. Due to the non-linearity of the ATC channel it is difficult to specify absolute channel noise levels. considering the type of channel attacks that are relevant in the aeronautical application. and was downloaded in full from our website by approximately 40 people (conservative estimate based on manually filtered server log-files). and presented a full implementation that covers the various layers of synchronization. Many proposed watermarking schemes assume perfect synchronization and do not treat this (often difficult) problem any further. we designed and implemented a practical speech watermarking method that replaces the imperceptible phase of non-voiced speech by a watermark data signal. but also constitutes a valuable resource for language research and spoken language technologies in ATC. Moreover. which is a distinct advantage compared to quantization-based watermarking. we evaluated the robustness of the proposed method in light of the aeronautical application. Since little information is publicly available. we designed and implemented a ground/airborne based audio channel measurement system and performed extensive in-flight channel measurements. To underline the validity of our theoretical results. The watermark detector is based on hidden training sequences and adaptive equalization. The collected data is now publicly available and can be used as a basis for further developments related to the air/ground voice radio channel. the proposed scheme outperforms current state-of-the-art methods on a scale of embedded bit rate per Hz of host signal bandwidth.4 GB was requested by mail from several institutions. The overall method is inherently robust against time-variant filtering and gain modulation. if necessary. good knowledge about the characteristics of the aeronautical voice radio channel is required. The method is robust against static and time-variant measured channel filter responses. Conclusions We derived the watermark capacity in speech based on perceptual masking and the auditory system’s insensitivity to the phase of unvoiced speech. At first. We carefully addressed this issue. To validate the robustness of our scheme in the aeronautical application. In addition. due to the particular nature of ATC speech and the limited availability of ATC language resources. we also created an ATC speech corpus. Using the data obtained in the channel measurements. it outperforms most current state-of-the-art methods and shows high robustness against a wide range of signal processing attacks. and the extracted model parameters are used in the validation of the watermarking scheme to simulate different effects of the aeronautical channel. and showed that this capacity far exceeds the capacity of the conventional ideal Costa scheme. can increase the noise robustness of the proposed 1 Within one year since its public release. transferred digital communication synchronization methods to the watermarking domain. 158 . the measured channel gain modulation.

the work presented in this thesis is subject to limitations. it is expected that an optimization of the existing parameters. such an analysis would likely lead to a better system design. For example. an SNR-adaptive watermark embedding. a real-time implementation was outside the scope of this thesis and is yet to be carried out. 159 . a theoretical analysis of the robustness of the phase modulation method against different transmission channel attacks could provide additional insight. We expect that a continuation of the theoretical work laid out in this thesis could lead to a deeper understanding of speech watermarking and. the capacity derivations for masking based watermarking show that in our system a lot of additional watermark capacity is still unused because perceptual masking is not incorporated. To improve the validity of the evaluation. Incorporating the particular characteristics of the host signal and the aeronautical channel. transceiver combinations and flight situations. the aeronautical channel measurements should be carried out on a large scale with a wide range of aircraft types. fractionally-spaced equalization. only. An evaluation of the capacity under tighter but more realistic assumptions would provide additional insight and likely result in a significantly lower watermark capacity. Naturally. our implementation can be considered as proof-of-concept implementation. While the method operates in principle on a per frame basis and is computationally not prohibitively expensive. to better performing system designs. we have not evaluated the real-time capability of our implementation. This not only significantly increases the noise robustness of the watermark but also improves the subjective intelligibility of the received speech. due to the idealistic assumptions made about the host signal and the transmission channel.method. The evaluation of the method’s robustness in the aeronautical application was limited by the relatively small scope of the available channel data. An increased intelligibility in air/ground voice communication would constitute a further substantial contribution to the safety in air traffic control. only. equalization and detection scheme could increase both robustness and data rate. or a joint synchronization. Further research in speech watermarking should aim to exploit this additional capacity by combining masking-based principles with the phase replacement approach introduced in this thesis. The large gap to the theoretical capacity shows that the performance can be further increased and numerous options for further optimizations exist. Our theoretical capacity estimation for phase modulation based watermarking is an upper bound. ultimately. Last but not least. Furthermore. Complementing our experimental robustness evaluation. and a number of options for further research arise. Since we proposed an entirely novel watermarking scheme. We recommend a pre-processing of the input speech signal with multiband dynamic range compression and limiting for the aeronautical application.

Conclusions 160 .Chapter 11.

Groups of words are embraced by opening and closing XML-style tags to mark off-talk (<OT> 1 The mono-spaced typewriter type represents words or items that are part of the corpus transcription. don’t). All standard text is written in lower-case. letters.g. all utterances are transcribed word-for-word in standard British English. stand-alone technical mark-up tags are written in upper case letters with enclosing squared brackets (e.2 contains content-specific amendments such as lists of non-standard vocabulary. In general. [HNOISE]). In terms of notation. Numbers. 161 .1 describes the basic rules of the transcription format. Regular lower-case letters and words are preceded or followed by special characters to mark truncations (=). Silent pauses both between and within words are not transcribed either. While Section A.g. Apostrophes are used only for possessives (e. A. navigational aids and radio call signs are transcribed as follows. individually pronounced letters (~) or unconfirmed airline names (@).g.1. Punctuation marks including periods. Transcription Format The orthographic transcription follows a strict set of rules which is presented hereafter. Technical noises as well as speech and noises in the background—produced by speakers other than the one recorded—are not transcribed. Section A. commas and hyphens are omitted. pilot’s radio)1 and for standard English contractions (e. it’s.Appendix A ATCOSIM Transcription Format Specification This appendix provides the formal specification of the transcription format of the ATCOSIM speech corpus presented in Chapter 7.

Notation: Standard dictionary spelling without hyphens. A. connected digits. It should be noted that controllers are supposed to use non-standard pronunciations for certain digits. one forty four. This is however applied inconsistently. A. Acronyms Definition: Acronyms that are pronounced as a sequence of separate letters in standard English. Numbers Definition: All digits.1.1. Notation: As defined in the reference.Appendix A.. ICAO Phonetic Spelling Definition: Letter sequences that are spelled phonetically. and in any case transcribed with the standard dictionary spelling of the digit. ‘thousand’ and ‘decimal’.2. and foreign language (<FL> </FL>). such as ‘tree’ instead of ‘three’.. Example: three hundred. Example: india sierra alfa report your heading Word List: alfa bravo charlie delta echo foxtrot golf hotel india juliett kilo lima mike november oscar papa quebec romeo sierra tango uniform victor whiskey xray yankee zulu Supplement: fox (short form for foxtrot) Reference: [113] A. opec and icao respectively. ICAO are transcribed as nato. and the keywords ‘hundred’. Exception Example: NATO.1.1. Notation: Individual lower-case letters with each letter being preceded by a tilde (~). for which currently no transcription is given. which is also transcribed. The tilde itself is preceded by a space. ‘niner’ instead of ‘nine’. ATCOSIM Transcription Format Specification . one oh nine Word List: zero oh one two three four five six seven eight nine ten hundred thousand decimal 162 .3. or ‘tousand’ instead of ‘thousand’ [113]. OPEC. Example: ~k ~l ~m Exception: Standard acronyms which are pronounced as a single word. four seven eight. These are transcribed without any special markup. </OT> ).

[159. Exceptions Examples: foxtrot sierra india.4. Non-verbal Articulations Definition: Non-verbal articulations such as confirmatory. The words used can be names of real places (ex. Notation: Geographical locations (navaids) are transcribed as given in the references using lower-case letters. Notation: [HNOISE] (in upper-case letters) Example: sabena [HNOISE] four one report your heading A. gotil). 160] A.7. the notation oh is used for the meaning ‘zero’. corna. contrast to ohh as an expression of surprise. surprise or hesitation sounds. Annex A: Maps of Simulation Airspace].6. Human Noises Definition: Human noises such as coughing. 163 . Notation: Limited set of expressions written in lower-case letters. Notation: Spelling exactly as given in the references. Navigational Aids and Airports Definition: Airports and navigational aids (navaids) corresponding to geographic locations. Also breathing noises that were considered by the transcriptionist as exceptionally loud were marked using this tag. as in one oh one.1. 157. 129] A. Example: malaysian ah four is identified Word List: ah hm ahm yeah aha nah ohh3 2 In 3 In some occasions less popular five-letter designators are also spelled out using ICAO phonetic spelling. Examples: contact rhein on one two seven alitalia two nine two turn left to gotil alitalia two nine two proceed direct to corna charlie oscar romeo november alfa References: [158. laughing and sighs produced by the speaker.2 Airports and control centers are transcribed directly as said and in lower-case spelling.1.1. Transcription Format A. hapag lloyd Exceptions: Airline designators given letter-by-letter using ICAO phonetic spelling as well as airline designators articulated as acronyms.A. ~k ~l ~m References: [156.1.1.5. britannia. hochwald) or artificial five-letter navaid or waypoint designators (ex. Airline Telephony Designators Definition: The official airline radio call sign. Examples: air berlin.

Truncated Words Definition: Words which are cut off either at the beginning or the end of the word due to stutters. sa= [HNOISE] =bena (interruption by cough) Exception: Words which are cut off either at the beginning or the end of the word due to fast speech or sloppy pronunciation are recorded according to standard spelling and not marked. A.8. This notation is used when the word-part is understandable. This also applies to words that are interrupted by human noises ([HNOISE]). and released the button again. Notation: The missing part of the word is replaced by an equals sign (=). Off-Talk Definition: Speech that is neither addressed to the pilot nor part of the air traffic control communication. said nothing at all. luf= lufthansa three two five (stutter).9. or where the controller pressed the push-to-talk (PTT) button too late. =bena four one (PTT pressed too late).1. Word Fragments Definition: Fragments of words that are too short so that no clear spelling of the fragment can be determined. Notation: [FRAGMENT] Example: [FRAGMENT] A. Examples: good mor= good afternoon (correction)..1. another notation is used (see below). Empty pauses within words are not marked. Notation: [EMPTY] Exception: If the utterance contains human noises produced by the speaker. ATCOSIM Transcription Format Specification A. full stops. If the word-part is too short to be identified. Exception Example: “goo’day” is transcribed as good day. the [HNOISE] tag is used.1. Empty Utterances Definition: Instances where the controller pressed the PTT button..11. Notation: Off-talk speech is transcribed and marked with opening and closing XMLstyle tags: <OT> .10. </OT> Example: speedbird five nine zero <OT> ohh we are finished now </OT> 164 .Appendix A. A.1.

13. A. Notation: [UNKNOWN] Example: [UNKNOWN] five zero one bonjour cleared st prex A.2.2.1. ciao Exception Word List: See Section A. Exception Examples bonjour. and are not tagged in any special way. but are nonetheless clearly identified. A.14. Notation: [NONSENSE] Example: [NONSENSE] futura nine three three identified A. Military Radio Call Signs There was no special list for military aircraft call signs available.1. Foreign Language Definition: Complete utterances. Notation: The foreign language part is not transcribed but is in its entirety replaced by adjacent XML-style tags: <FL> </FL> Example: <FL> </FL> break alitalia three seven zero report mach number Exception: Certain foreign language terms.1. Unknown Words Definition: Word or group of words that could not be understood or identified.1.1. Amendments to the Transcription Format The actual language use in the recordings required the following additions to the above transcription format definitions. tag. A.A. Amendments to the Transcription Format A.4. This is usually a slip of the tongue and the speaker corrects the mistake.12. Airline Telephony Designators The following airline telephony designators cannot be found in the references cited above. The following call signs were confirmed by an operational controller: • ~i ~f ~o • mission 165 . are transcribed according to the spelling of that language. Nonsensical Words Definition: Clearly articulated word or word part that is not part of a dictionary and that also does not make any sense.2.1. given in a foreign language.2. such as greetings. A full list is given below. or parts thereof.2.

fokker twenty eight). The following uses occurred: • • • • • • • • • • • • • • • • • • bafair (Short form for belgian airforce.2.Appendix A. The following manufacturer names occurred: • fokker • ~b ~a (Short form for British Aerospace.4.) • malta (Short form for air malta.) • tunis air (This is the airline name.) hapag (Short form for hapag lloyd.1.) meridiana (This is the airline name. General Aviation Call Signs In certain cases general aviation aircraft are addressed using the aircraft manufacturer and type number (e.2. Additional Verified Call Signs The following call sign occurred and is also verified: • london airtours (This call sign is listed only in the simulation manual [161]. ATCOSIM Transcription Format Specification • nato • spar • steel A.) british midland (This is the airline name.3.) berlin (Short form for air berlin.) netherlands air (Short form for netherlands air force. which also used to be the radio call sign.) lufty (Short form for lufthansa.2.) france (Short form for airfrans.2.) german air (Short form for german air force. Deviated Call Signs In certain cases the controller uses a deviated or truncated version of the official call sign.) luha (Short form for lufthansa. The radio call sign is midland.g.) hansa (Short form for lufthansa.) A.) french line (Short form for french lines.1.) netherlands (Short form for netherlands air force.1.) french air force (The official radio call sign is france air force.) turkish (Short form for turkish airforce. The official call sign was changed to merair at some point in the past.) algerie (Short form for air algerie. The radio call sign is tunair.) • lauda (Short form for lauda air.) A.) 166 .) israeli air (Short form for israeli air force.) israeli (Short form for israeli air force.

Additional Verified Navaids The following additional navaids occurred and are verified as they were occasionally spelled out by the controllers: • corna • gotil A. The following uses occurred: • milano (Local Italian version for milan. Additional Unverified Radio Call Signs The following airline telephony designators could not be verified through any of the available resources.2.2.2. Special Vocabulary and Abbreviations The following ATC specific vocabulary and abbreviations occurred. @aerovic @alpha @aviva @bama @cheena @cheeseburger @color @devec @foxy @hanseli @indialook @ingishire @jose @metavec @nafamens @period @roystar @sunwing @taitian @tele A. They are transcribed as understood by the transcriptionist on a best-guess basis and preceded by an at symbol (@).2. Amendments to the Transcription Format A.3.) A. using a simplified spelling which avoids special characters: 167 . Deviated Navaids In certain cases the controller uses a deviated version of the official navaid name.2. pronounced as one word) • ~r ~v ~s ~m (Reduced Vertical Separation Minimum) • ~c ~v ~s ~m (Conventional Vertical Separation Minimum) • ~i ~f runway (Initial Fix runway) • sec (sector) • freq (frequency) A.1.2.5. Foreign Language Greetings and Words Due to their frequent occurrence the following foreign language greetings and words were transcribed.2.2. This listing is most likely incomplete.) • trasa (Short form for trasadingen. Navigational Aids A.2.4.2.A. • masp (Minimum Aviation System Performance standards.1.2.

Appendix A. ATCOSIM Transcription Format Specification • • • • • • • • • • • • • • • • • • • hallo (German for ‘hello’) auf wiederhoren (German for ‘goodbye’) gruss gott (German for ‘hello’) servus (German for ‘hi’) guten morgen (German for ‘good morning’) guten tag (German for ‘hello’) adieu (German for ‘goodbye’) tschuss (German for ‘goodbye’) tschu (German for ‘goodbye’) danke (German for ‘thank you’) bonjour (French for ‘hello’) au revoir (French for ‘goodbye’) merci (French for ‘thank you’) hoi (Dutch for ‘hello’) dag (Dutch for ‘goodbye’) buongiorno (Italian for ‘hello’) arrivederci (Italian for ‘goodbye’) hejda (Swedish for ‘goodbye’) adios (Spanish for ‘goodbye’) 168 .

Serbia. Hofbauer and G. 169 . The realization of these reference models as filtered Gaussian processes leads to practical implementations of frequency selective and frequency nonselective channel models.Appendix B Aeronautical Radio Channel Modeling and Simulation In this appendix. the basic concepts in the modeling and simulation of the mobile radio channel are reviewed. Kubin. Three different smallscale area simulations of the aeronautical voice radio channel are presented and we demonstrate the practical implementation of a frequency flat fading channel. Based on a scenario in air/ground communication the parameters for readily available simulators are derived. Doppler effect. We conclude that the aeronautical voice radio channel is a frequency nonselective flat fading channel and that in most situations the frequency of the amplitude fading is band-limited to the maximum Doppler frequency. Jul. Parts of this chapter have been published in K.” in Proceedings of the International Conference on Research in Air Transportation (ICRAT). 2006. Belgrade. path loss and additive noise. “Aeronautical voice radio channel modelling and simulation—a tutorial review. The propagation channel is timevariant and dominated by multipath propagation. The resulting outputs give insight into the characteristics of the channel and can serve as a basis for the design of digital transmission and measurement techniques. Stochastic reference models in the equivalent complex baseband facilitate a compact mathematical description of the channel’s input-output relationship.

2.Appendix B. the results are not always directly transferable to the aeronautical domain. 1 x (t) = ([| x AM (t)|] LP − A0 ) k Figure B. An analog baseband voice signal x (t) which is band-limited to a bandwidth f m modulates the amplitude of a sinusoidal carrier with amplitude A0 . carrier frequency f c and initial phase ϕ0 .2.1) Under the assumption that f c f m the HF signal can be demodulated and the input signal x (t) reconstructed by detecting the envelope of the modulated sine wave. A3E or simply AM) of a sinusoidal. Amplitude Modulation and Complex Baseband The aeronautical voice radio is based on the double-sideband amplitude modulation (DSB-AM. low-pass filtered. However. It is highly recommended as a pointer for further reading and its content is not repeated herein.1. B. Since the baseband signal is. Introduction Radio channel modeling has a long history and is still a very active area of research. A comprehensive up-to-date literature review on channel modeling and simulation with the aeronautical radio in mind is provided in [134]. In this appendix.1. the modeling. we review the general concepts of radio channel modeling and demonstrate the application of three readily available simulators to the aeronautical voice channel. unsuppressed carrier [65]. Basic Concepts This and the following section are based on the work of Pätzold [132] and provide a summary of the basic characteristics. This is especially the case with respect to terrestrial mobile radio communications and wideband data communications due to commercial interest. The modulated high frequency (HF) signal x AM (t) is defined as x AM (t) = ( A0 + kx (t)) cos(2π f c t + ϕ0 ) with the modulation depth m= |kx (t)|max ≤1 A0 The real-valued HF signal can be equivalently written using complex notation and ωc = 2π f c as x AM (t) = Re ( A0 + kx (t))e jωc t e jϕ0 (B. the HF signal is a 170 . and the simulation of the mobile radio channel. Aeronautical Radio Channel Modeling and Simulation B. by definition. B. Another comprehensive treatment on this extensive topic is given in [133]. The absolute value is low-pass filtered and the original amplitude of the carrier is subtracted.1 shows the spectra of the baseband signal and the corresponding HF signal.

The Hilbert transform removes the negative frequency component of s(t) before downconversion [86]. modified). The rotating plane can be seen as a coordinate system for the vector.2. any real bandpass signal s(t) can be represented as the real part of a modulated complex signal. which rotates with the angular velocity ωc .2) reveals that the complex envelope g AM (t) of the amplitude modulated HF signal x AM (t) simplifies to g AM (t) = ( A0 + kx (t))e jϕ0 The signal x (t) is reconstructed from the equivalent complex baseband of the HF signal by demodulation with 1 x (t) = (| g AM (t)| − A0 ) (B. Basic Concepts Figure B.1. In general. s(t) = Re g(t)e jωc t (B.2) where g(t) is called the equivalent complex baseband or complex envelope of s(t) [139]. 171 .3) k The complex baseband signal can be pictured as a time-varying phasor or vector in a rotating complex plane.: Signal spectra of (a) the baseband signal x (t) and (b) the HF signal x AM (t) with a carrier at f c and symmetric upper and lower sidebands (Source: [139]. A comparison of (B.1) and (B. bandpass signal and contains energy only around the carrier frequency and the lower and upper sidebands LSB and USB. The complex envelope g(t) is obtained by downconversion of the real passband signal s(t).B. namely ˆ g(t) = (s(t) + js(t))e− jωC t ˆ with s(t) being the Hilbert transform of s(t).

This leads to a large number of samples and thus makes numerical simulation difficult even for very short signal durations. The time delays correspond to phase shifts in between the superimposed 172 .2. Aeronautical Radio Channel Modeling and Simulation Figure B. reflected waves have to travel a longer distance to the aircraft and therefore arrive with a time-delay compared to the line-ofsight signal. it must be sampled with a frequency of more than twice the carrier frequency. f m ] as the baseband signal x (t). From hereon we assume.Appendix B. Most of the existing channel simulations are based on the complex baseband signal representation. The received signal is spread in time and the channel is said to be time dispersive. diffracted or absorbed on its way to the receiver. The received electromagnetic signal can be a superposition of a line-of-sight path signal and multiple waves coming from different directions.2.2. without loss of generality. Multipath Propagation The transmitting medium in radio communications is the atmosphere or free space.2. Depending on the geometric dimensions and the properties of the objects in a scene. As a consequence it can be sampled with a much lower sampling frequency. Together with the carrier frequency ωc the complex envelope g AM (t) fully describes the HF signal x AM (t). which facilitates efficient numerical simulation without loss of generality. The complex envelope g AM (t) has the same bandwidth [− f m .1. the physical medium that is used to send the signal from the transmitter to the receiver.” Radio channel modeling usually also includes the transmitting and receiving antennas in the channel model. The effects treated in this paper are identical for both directions. that the ground station transmits and the aircraft receives the radio signal.: Multipath propagation in an aeronautical radio scenario (Source: [162]). B.2. . As illustrated in Figure B. scattered. In order to represent the HF signal as a discrete-time signal.2. This effect is known as multipath propagation. . an electromagnetic wave can be reflected. B. Mobile Radio Propagation Channel Proakis [65] defines the communication channel as “. into which the signal is coupled as electromagnetic energy by an antenna.

if W < BCB .2. the channel is said to be frequency selective. 173 . Additive Noise During transmission additive noise is imposed onto the signal. It is proportional to d1p . but as well from different directions relative to the aircraft heading (Figure B.2). f c the carrier frequency and c = 3 · 108 m the speed of s light. In the optimal case of line-of-sight free space propagation p = 2. from atmospheric noise or radio channel interference. which is the largest possible Doppler shift. they undergo different Doppler shifts. This means that all the frequency components of the received signal are affected by the channel in the same way [65]. among others. If the frequency bandwidth W of the transmitted signal is larger than the coherence bandwidth (W > BCB ). is dependent on the angle of arrival α of the electromagnetic wave relative to the heading of the aircraft. Channel Attenuation The signal undergoes significant attenuation during transmission. As both the position and the phase shifts change constantly due to the movement of the aircraft. which is the difference between the transmitted and the received frequency.4.max = f c (B.B. The multipath spread Tm is the time delay between the arrival of the line-of-sight component and the arrival of the latest scattered component. The path loss is dependent on the distance d and the obstacles between transmitter and receiver.2. Basic Concepts waves and lead to constructive or destructive interference depending on the position of the aircraft. from thermal noise in electronic components. B. B. f D = f D. is given by v f D. the signal undergoes strong pseudo-random amplitude fluctuations and the channel becomes a fading channel.2. This results in a continuous distribution of frequencies in the spectrum of the signal and leads to the so-called Doppler power spectral density or simply Doppler spectrum.2.2. As a consequence. The noise results.2. B.2. Doppler Effect The so-called Doppler effect shifts the frequency content of the received signal due to the movement of the aircraft relative to the transmitter. the channel is frequency nonselective or flat fading. or from man-made noise such as engine ignition noise.max . The reflected waves arrive not only with different time-delays compared to the line-of-sight signal. with the pathloss exponent p in the range of 2 ≤ p < 4. The Doppler frequency f D .2. Otherwise.4) c where v is the aircraft speed. Its inverse BCB = T1 M is the coherence bandwidth of the channel.3.max cos(α) The maximum Doppler frequency f D.

40

RAYLEIGH AND RICE PROCESSES AS REFERENCE MODELS

40

RAYLEIGH AND RICE PROCESSES AS REFERENCE MODELS

cR :=

ρ2 2 . 2σ0

(3.18)

cR :=

ρ2 2 . 2σ0

(3.18)

From the limit ρ → 0, i.e., cR → 0, the Rice process ξ(t) results in the Rayleigh process ζ(t), whose statistical amplitude variations are described by the Rayleigh distribution [Pap91]   x e− 2  0 0,
x2 2 σ0

From the limit ρ → 0, i.e., cR → 0, the Rice process ξ(t) results in the Rayleigh process ζ(t), whose statistical amplitude variations are described by the Rayleigh distribution [Pap91]   x e− 2  0 0,
x2 2 σ0

pζ (x) =

, x ≥ 0, x < 0.

(3.19) Appendix B. Aeronautical Radio Channel0 .Modeling and Simulation x<

(3.19)

pζ (x) =

, x ≥ 0,

The probability density functions pξ (x) and pζ (x) according to (3.17) and (3.19) are shown in the Figures 3.3(a) and 3.3(b), respectively. (b)
0.9 0.8 0.7 0.6 p ζ (x) 0.5 0.4 0.3 0.2 0.1 0 0 1 2 x 3 4
2 σo= 2 2 σ o =1/2

(a)

(a) Rayleigh PDF

The probability density functions pξ (x) and pζ (x) according to (3.17) and (3.19) are shown in the Figures 3.3(a) and 3.3(b), respectively. ELEMENTARY PROPERTIES OF RICE AND RAYLEIGH PROCESSES 37 38 RAYLEIGH AND RICE PROCESSES AS REFERENCE MODE (b) (b) (a) (b)

(b) Rice PDF

(c) Jakes PSD

(d) Gaussian PSD
-3

0.7 0.6 0.5 p ξ (x)
ρ= 0 (Rayleigh)
(f) 0.04 0.035 0.03 0.025 0.02 0.015
2 σ o =1

7 6
τ

x 10

5 (f) Sµ
iµi

0.3 0.2 0.1 0 0 1 2 x 3 4
ρ= 1

2 σo=

1

0.4

ρ= 1/2

iµi

4 3 2 1

0.01 0.005 0 -100 -50 0 f (Hz) 50 100

0

-200

-100

Figure 3.3:

The probability density function of (a) Rice and (b) Rayleigh processes.

Figure 3.3:

As mentioned before, the exact shape of the Doppler power spectral density Sµµ (f ) has no effect on the probability density of the absolute value of the complex Gaussian D,max random process, i.e., ξ(t) = |µρ (t)|. Analogously, this statement is also valid for the probability density function of the phase ϑ(t) = arg{µρ (t)}, where ϑ(t) can be expressed with (3.1), (3.2), and (3.4) as follows ϑ(t) = arctan µ2 (t) + ρ sin (2πfρ t + θρ ) µ1 (t) + ρ cos (2πfρ t + θρ )

will play following, where denotes the 3-dB-cut-off Figure B.3.: Probability density functions (PDF) an important role in the spectral fdensities during transmission. The average Doppler shift B (PSD, frequency. and power a carrier signal experiences As mentioned before, the exact shape of the Doppler power spectral density S (f ) 2 the probability density of the absolute valueinvestigations inGaussian have shown (Source: S[132]).the Doppler spread B is the square root of t the first moment of has no effect on Theoretical of the [Bel73] that f = 91 Hz,process, = ξ(t) =for(t)|. Analogously, thisofand Rice validhas a Gaussiansecond central moment of(f )spectralConsequently, B and B are defined by σ0 i.e., 1) |µ Rayleigh statementcomplex channels shape. the Doppler power andthe S (f random is channels for density aeronautical also Further information on ).
c
µµ
µi µi (2) µi µi

Figure 3.1: (a) Jakes power spectral density Sµi µi (f ) and (b) the corresponding 2 autocorrelation function rµi µi (τ ) (fFigure91 Hz, σ0 = 1).Gaussian power spectral density Sµi µi (f ) and (b) correspond max = 3.2: The probability density function of (a) Rice and (b) Rayleigh processes. (a) √ 2 autocorrelation function rµi µi (τ ) (fc = ln 2fmax , fmax = 91 Hz, σ0 =

0 f (Hz)

100

200

(1) µi µi

(3.20) . B.2.2.5. Time Dependency

In order to confirm this statement, we study the probability density function pϑ (θ; t) of the phase ϑ(t) given by the following relation [Pae98d]

Most of the parameters described in this section vary over time due to the movement of the aircraft. As a consequence the response of the channel to a transmitted signal also varies, and the channel is said to be time-variant.

B.2.3. Stochastic Terms and Definitions

the probability density function of the phase ϑ(t) = arg{µρ (t)}, where ϑ(t) can be measurements concerning the propagation characteristics of aeronautical satellite ∞ Sµi µi (f )df expressed with (3.1), (3.2), and (3.4) as follows channels can be found, for example, in [Neu87]. Although no := −∞ fcorrespondence (1) (3.13 Bµi µi absolute ∞ S to the obtained measurements could be proved, (3.11) can in most casesi µi (f )df be −∞ µ very well µ2 (t) + ρ sin (2πfρ t + θρ ) used as a sufficiently good approximation [Neu89]. For signal bandwidths up to some (3.20) ϑ(t) = arctan . and µ1 (t) + ρ cos (2πfρ t + θρ ) 10 kHz, the aeronautical satellite channel belongs to the class of frequency-nonselective mobile radio channels [Neu89]. ∞ (1) (f − Bµi µi )2 Sµi µi (f )df −∞ (2) In order to confirm this statement, we study the probability density function pϑ (θ; t) , (3.13 Bµi µi := ∞ [Cox73] Especially of the phase ϑ(t) given by the following relation [Pae98d] for frequency-selective mobile radio channels, it has been shown Sµi µi (f )df −∞ that the Doppler power spectral density of the far echoes deviates strongly from the shape of the Jakes power spectral density. for i = 1, 2, respectively. Equivalent — but often easier to calculate — expressions Hence, the Doppler power spectral density is approximately Gaussian shaped and is(3.13a) and (3.13b) can theobtained by using the autocorrelation function rµi µi (τ ) generally shifted from be origin of the frequency plane, because the far echoes mostly dominate from a second time derivative at the origin, i.e., well as its first and certain direction of preference. Specifications for frequency-shifted Gaussian power spectral densities for the pan-European, terrestrial, cellular GSM system can be found in [COS86]. 2 rµ µ (0) ¨ rµi µi (0) ˙ 1 rµi µi (0) ˙ 1 (2) (1) , 3.14a, ( − i i and The inverse Fourier transform results for the GaussianBµi µi := 2πj · r density (3.11) Bµi µi = 2π power spectral rµi µi (0) rµi µi (0) µi µi (0) in the autocorrelation function
2 rµi µi (τ ) = σ0 e − π √fc τ
ln 2 2

ρ

µi µi

(1) µi µi

(2) µi µi

The following section recapitulates some basic stochastic terms in order to clarify the nomenclature and notation used herein. The reader is encouraged to refer to [132] for exact definitions. Let the event A be a collection of a number of possible outcomes s of a random experiment, with the real number P( A) being its probability measure. A random variable µ is a mapping that assigns a real number µ(s) to every outcome s. The cumulative distribution function Fµ ( x ) = P(µ ≤ x ) = P({s|µ(s) ≤ x })

(3.12) For the important special case where the Doppler power spectral densities Sµ1 µ1 and (3.11) are identical with the ˙ In Figure 3.2, the Gaussian power spectral densitySµ2 µ2 (f ) is illustrated and symmetrical, rµi µi (0) = 0 (i = 1, 2) holds. Hence, corresponding autocorrelation function (3.12). using (3.7), we obtain the following expressions for the corresponding characteris quantities of the Doppler power spectral density Sµµ (f ) Characteristic quantities for the Doppler power spectral density Sµi µi (f ) are the (1) (2) √ average Doppler shift Bµi µi and the Doppler spread Bµi µi [Bel63]. The average Doppler β (1) (1) (2) (2) shift (Doppler spread) describes the average frequency shift (frequency spread)Bthat Bµi µi = , (3.15a, Bµµ = Bµi µi = 0 and µµ = 2πσ0 .

for i = 1, 2, respectively.

is the probability that the random variable µ is less or equal to x. The probability density function (PDF, or simply density) pµ ( x ) is the derivative of the cumulative distribution function, dFµ ( x ) pµ ( x ) = dx The most common probability density functions are the uniform distribution, where the density is constant over a certain interval and is zero outside, and the Gaussian dis2 tribution or normal distribution N (mµ , σµ ), which is determined by the two parameters 2. expected value mµ and variance σµ With µ1 and µ2 being two statistically independent normally distributed random
2 variables with identical variance σ0 , the new random variable ζ = µ2 + µ2 represents 2 1 a Rayleigh distributed random variable (Figure B.3(a)). Given an additional real

parameter ρ, the new random variable ξ = (µ1 + ρ)2 + µ2 is Rice or Rician distributed 2 (Figure B.3(b)). A random variable λ = eµ is said to be lognormally distributed. A multiplication of a Rayleigh and a lognormally distributed random variable η = ζλ leads to the so-called Suzuki distribution.

174

B.3. Radio Channel Modeling A stochastic process µ(t, s) is a collection of random variables, which is indexed by a time index t. At a fixed time instant t = t0 , the value of a random process, µ(t0 , s), is a random variable. On the other hand, in the case of a fixed outcome s = s0 of a random experiment, the value of the stochastic process µ(t, s0 ) is a time function, or signal, that corresponds to the outcome s0 . As is common practice, the variable s is dropped in the notation for a stochastic process and µ(t) written instead. With µ1 (t) and µ2 (t) being two real-valued stochastic processes, a complex-valued stochastic process is defined by µ(t) = µ1 (t) + jµ2 (t). A stochastic process is called stationary if its statistical properties are invariant to a shift in time. The Fourier transform of the autocorrelation function of such a stationary process defines the power spectral density or power density spectrum of the stochastic process.

B.3. Radio Channel Modeling
Section B.2.2.1 illustrated multipath propagation from a geometrical point of view. However, geometrical modeling of the multipath propagation is possible only to a very limited extent. It requires detailed knowledge of the geometry of all objects in the scene and their electromagnetic properties. The resulting simulations are time consuming to set up and computationally expensive, and a number of simplifications have to be made. Furthermore the results are valid for the specific situation only and cannot always be generalized. As a consequence, a stochastic description of the channel and its properties is widely used. It focuses on the distribution of parameters over time instead of trying to predict single values. This class of stochastic channel models is the subject of the following investigations. In large-scale areas with dimensions larger than tens of wavelengths of the carrier frequency f c , the local mean of the signal envelope fluctuates mainly due to shadowing and is found to be approximately lognormally distributed. This slow fading is important for channel availability, handover, and mobile radio network planning. More important for the design of a digital transmission technique is the fast signal fluctuation, the fast fading, which occurs within small areas. As a consequence, we focus on models that are valid for small-scale areas, where we can assume the path loss and the local mean of the signal envelope due to shading, etc. , to be constant. Furthermore we assume for the moment a frequency nonselective channel and, for mathematical simplicity, the transmission of an unmodulated carrier. A more extensive treatment can be found in [132], on which this section is based on.

B.3.1. Stochastic Mutlipath Reference Models
The sum µ(t) of all scattered components of the received signals can be assumed to be normally distributed. If we let µ1 (t) and µ2 (t) be zero-mean statistically independent 2 Gaussian processes with variance σ0 , then the sum of the scattered components is given in complex baseband representation as a zero-mean complex Gaussian process µ(t) and is defined by Scatter: µ(t) = µ1 (t) + jµ2 (t) (B.5)

175

Appendix B. Aeronautical Radio Channel Modeling and Simulation The line-of-sight (LOS) signal component m(t) is given by LOS: m(t) = A0 e j(2π f D + ϕ0 ) , (B.6)

again in complex baseband representation. The superposition µm (t) of both signals is LOS+Scatter: µm (t) = m(t) + µ(t) (B.7)

Depending on the surroundings of the transmitter and the receiver, the received signal consists of either the scatter components only or a superposition of LOS and scatter components. In the first case (i.e. (B.5)) the magnitude of the complex baseband signal |µ(t)| is Rayleigh distributed. Its phase ∠(µ(t)) is uniformly distributed over the interval [−π; π ). This type of a Rayleigh fading channel is predominant in regions where the LOS component is blocked by obstacles, such as in urban areas with high buildings, etc. In the second case where a LOS component and scatter components are present (i.e. (B.7)), the magnitude of the complex baseband signal |µ(t) + m(t)| is Rice distributed. The Rice factor k is determined by the ratio of the power of the LOS and the scatter components, where k = aeronautical radio channel. One can derive the probability density of amplitude and phase of the received signal based on the Rice or Rayleigh distributions. As a further step, it is possible to compute the level crossing rate and the average duration of fades, which are important measures required for the optimization of coding systems in order to address burst errors. The exact formulas can be found in [132] and are not reproduced herein. The power spectral density of the complex Gaussian random process in (B.7) corresponds to the Doppler power spectral density when considering the power of all components, their angle of arrival and the directivity of the receiving antenna. Assuming a Rayleigh channel with no LOS component, propagation in a two-dimensional plane and uniformly distributed angles of arrival, one obtains the so-called Jakes power spectral density as the resulting Doppler spectrum. Its shape is shown in Figure B.3(c). However, both theoretical investigations and measurements have shown that the assumption that the angle of arrival of the scattered components is uniformly distributed does in practice not hold for aeronautical channels. This results in a Doppler spectrum which is significantly different from the Jakes spectrum [163]. The Doppler power spectral density is therefore better approximated by a Gaussian power spectral density, which is plotted in Figure B.3(d). For nonuniformly distributed angles of arrival, as with explicit directional echos, the Gaussian Doppler PSD is unsymmetrical and shifted away from the origin. The characteristic parameters describing this spectrum are the average Doppler shift (the statistic mean) and the Doppler spread (the square root of the second central moment) of the Doppler PSD .
A2 0 2. 2σ0

This Rice fading channel dominates the

B.3.2. Realization of the Reference Models
The above reference models are based on colored Gaussian random processes. The realization of these processes is not trivial and leads to the theory of deterministic

176

B.3. Radio Channel Modeling processes. Mostly two fundamental methods are applied in the literature in order to generate colored Gaussian processes. In the filter method, white Gaussian noise is filtered by an ideal linear time-invariant filter with the desired power spectrum. In the Rice method an infinite number of weighted sinusoids with equidistant frequencies and random phases are superimposed. In practice both methods can only approximate the colored Gaussian process. Neither an ideal filter nor an infinite number of sinusoids can be realized. A large number of algorithms used to determine the actual parameters of the sinusoids in the Rice method exist. The methods approximate the Gaussian processes with a sum of a limited number of sinusoids, thus considering the computational expense [132]. For the filter method on the other hand, the problem boils down to filter design with its well-understood limitations.

B.3.3. Frequency Nonselective Channel Models
In frequency nonselective flat fading channels, all frequency components of the received signal are affected by the channel in the same way. The channel is modeled by a multiplication of the transmitted signal with a suitable stochastic model process. The Rice and Rayleigh processes described in Section B.3.1 can serve as statistical model processes. However it has been shown that the Rice and Rayleigh processes often do not provide enough flexibility to adapt to the statistics of real world channels. This has led to the development of more versatile stochastic model processes such as the Suzuki process and its variations (a product of a lognormal distribution for the slow fading and a Rayleigh distribution for the fast fading), the Loo Model with its variations, and the generalized Rice process.

B.3.4. Frequency Selective Channel Models
Where channel bandwidth and data rate increase, the propagation delays can no longer be ignored as compared to the symbol interval. The channel is then said to be frequency selective and (over time) the different frequency components of a signal are affected differently by the channel. B.3.4.1. Tapped Delay Line Structure For the modeling of a frequency selective channel, a tapped delay line structure is typically applied as reference model (Figure B.4). The ellipses model of Parsons and Bajwa [133] shows that all reflections and scatterings from objects located on an ellipse, with the transmitter and receiver in the focal points, undergo the same time delay. This leads to a complex Gaussian distribution of the received signal components for a given time delay, assuming a large number of objects with different reflection properties in the scene and applying the central limit theorem. As a consequence, the tap weights ci (t) of the single paths are assumed to be uncorrelated complex Gaussian processes. It is shown in Section B.2.3 that the amplitudes of the complex tap weights are then

177

B. either Rayleigh or Rice distributed. τ ) is wide sense stationary1 (WSS) in its temporal dependence on t. etc.3. Bello proposed in 1963 the class of WSSUS models. from a given scattering function. power spectral densities and properties such as Doppler and delay spread. the European working group ‘COST 207’ published scattering functions in terms of delay power spectral densities and Doppler power spectral densities for four propagation environments which are claimed to be typical for mobile cellular communication. They are based on the tapped delay line structure and allow the computation of all correlation functions. depending on the mean of the Gaussian processes. with h(t. They are nowadays widely used and are of great importance in channel modeling. and assuming that scattering components with different propagation delays τ are statistically uncorrelated (uncorrelated scattering (US)). τ ). The system can be fully described by its time-variant impulse response h(t.2. Aeronautical Radio Channel Modeling and Simulation Figure B. Based on these two assumptions. The scattering function may be obtained by the measurement of real channels. ci (t) are uncorrelated complex Gaussian processes.4. For example.4. with input and output signals in the complex baseband representation. W is the bandwidth of the transmitted signal. These input/output relations of the stochastic channel can be significantly simplified assuming that the impulse response h(t. In order to establish a statistical description of the input/output relation of the above system. An analytic expression for the phases of the tap weights can be found in [132]. Linear Time-Variant System Description The radio channel can be modeled as a linear time-variant system.. the channel is further considered as a stochastic system. 1 Measurements have shown that this assumption is valid for areas smaller than tens of wavelengths of the carrier frequency f c . by specification. or both.: Tapped delay line structure as a frequency selective and time variant channel model (Source: [132]). τ ) as its stochastic system function.Appendix B. 178 .

AWGN Channel Model The noise that is added to the transmitted signal during transmission is typically represented as an additive white Gaussian noise (AWGN) process.4. which together with the signal power defines the signal-to-noise ratio (SNR) of the output signal [65].8 and an input signal which is band-limited to f l = 300 Hz to f m = 2. Figure B. Aeronautical Voice Channel Simulation B. and the same Matlab-based pre. The white circle represents the unmodulated carrier signal. As mentioned earlier. and amplitude modulation and demodulation. | g AM (t)|.4. the ‘very high frequency’ (VHF) band. We use as example the aeronautical VHF voice radio channel between a fixed ground station and a general aviation aircraft which is flying at its maximum speed. Aeronautical Voice Channel Simulation This section aims to present three different simulators which implement the above radio channel models. A short segment of the magnitude of the signal. this results in a maximum Doppler frequency of f D.33 kHz. The main parameter 2 of the model is the variance σ0 of the Gaussian process. conversion to and from complex baseband.5 kHz with a sharp cut-off below and above this frequency range [97].1. the frequency response of the transmitter is required to be flat between 0. According to specification. Using (B. The processing consists of bandpass pre. B.5 shows all values that the amplitude modulated signal x AM (t) takes on during the observation interval in the equivalent complex baseband representation g AM (t).2 kHz. which is a single point in the equivalent complex baseband. The input and output signals of all three simulators are equivalent complex baseband signals. we assume a carrier with amplitude A0 = 1. In the propagation model a general aviation aircraft with a speed of v = 60 m is s assumed. which is sampled with a frequency of f sa = 8000 Hz and bandpass filtered according to the above specification.and post-filtering. Simulation Scenario and Parameters For air-ground voice communication in civil air traffic control. B.4. the carrier frequency f c is within a range from 118 MHz to 137 MHz.max = 24 Hz.5. resulting in an analog channel bandwidth W = 2. the models are based on a small-scale area assumption where path loss and shadowing are assumed to be constant. frequency f c = 120 MHz and initial phase ϕ0 = π .B. a channel spacing of 8. The AWGN channel is usually included as an additional block after the channel models described earlier.33 kHz in specific regions of Europe in order to increase the number of available channels to a theoretical maximum of 2280. For the illustrations we use a purely sinusoidal baseband input signal x (t) = sin(2π f a t) with f a = 500 Hz. The channel spacing is reduced to 8.5 kHz. We first define a simulation scenario based on which we show the simulators’ input parameters and the resulting channel output. a modulation depth 4 m = 0. The 760 channels are spaced 25 kHz apart.3. is also shown in the figure.3 kHz to 2.4). For the simulations. 179 .and post-processing of the signals is used for all simulators.

The 180 . In these rare cases. but only with a reflection on far-away steep slopes. In contrast. large path delays only occur in situations with large h1 and h2 and small d such as in air-to-air communication. the resulting coherence bandwidth is in the same order of magnitude as the channel bandwidth.5. the wavelength is λ = fcc = 2. Mathworks Communications Toolbox Implementation The Mathworks Communications Toolbox for Matlab [164] implements a multipath fading channel model.01 Figure B. which corresponds to a fairly strong line-of-sight signal [163]. With BCB W.5 m which corresponds to a path delay of ∆τ = 38.5 m.3 ns.5 0 0 0 0. The ground antenna is considered to be mounted at a height of h1 = 20 m.2.7 Magnitude: |g(t)| 1. according to Section B. Right: The magnitude of g AM (t). which is not considered herein.1 the channel is surely frequency nonselective. The geometric path length difference ∆l between the line-of-sight path and the dominant reflection along the vertical plane on the horizontal flat ground evaluates to ∆l = h2 + 1 h1 d h1 + h2 2 + h2 + d − 2 h1 d h1 + h2 2 − d2 + (h2 − h1 )2 = 11. h2 = 10000 m and d = 100 km results in a path difference of ∆l = 6. Worst-case multipath spreads of Tm = 200 µs as reported in [163] cannot be explained with a reflection in the vertical plane.2. Given the carrier frequency f c . The Rician factor k is assumed to be k = 12 dB.7 1 In−Phase: Re[g(t)] 1 1. we assume that the aircraft flies at a height of h2 = 3000 m and at a distance of d = 10 km from the ground station.5 1 0.0 m. the coherence bandwidth is still BCB = 2.0417 s.005 Time in s 1. For example. This distance λ is covered in tλ = 0. In the special case of a non-elevated ground antenna with h1 ≈ 0 the path delay vanishes.: Sinusoidal AM signal in equivalent complex baseband. We cannot confirm the rule of thumb given in [163] where ∆l ≈ h2 given d h2 . Furthermore. Aeronautical Radio Channel Modeling and Simulation 2 Quadrature: Im[g(t)] 1 0. In a worst case scenario with a multipath spread of Tm = 10∆τ.4.6 MHz. B.Appendix B. Left: In-phase and quadrature components. a typical case for commercial aviation where h1 = 30 m. The simulator supports multiple fading paths. The white circle indicates an unmodulated carrier.2. of which the first is Rice or Rayleigh distributed and the subsequent paths are Rayleigh distributed.

and weighted by a random process hk (n). One can specify the delay and the average power of each path.7 shows the time-variation of the power of the two components.3 with one Rician path is appropriate. It is worthwhile noticing that the distance between two maxima is roughly one wavelength λ. Its tap-weights g(m) are given by a sampled and truncated sum of shifted sinc functions. The 181 .4. The Generic Channel Simulator Implementation The Generic Channel Simulator (GCS) is a radio channel simulator which was developed between 1994 and 1998 under contract of the American Federal Aviation Administration (FAA). B. The only necessary input parameters for the channel model are f sa .3.6(a).4.1.4. This means that it contains only the line of sight signal and no scattering. the sinc terms coincide. the results are equivalent to the first scenario.3 ns. As shown in Section B. We define one Rician path with a Rician factor of k = 200.3. We furthermore define one Rayleigh path with a relative power of -12 dB and a time delay of ∆τ = 38. the gain and phasor of the multipath components. bandpass filtering the received signal according to the specified channel bandwidth does not remove the amplitude fading. and a model according to Section B. This fast fading results from the superposition of the line-of-sight component and the multitude of scattered components with Gaussian distribution. Aeronautical Voice Channel Simulation Doppler spectrum is approximated by the Jakes spectrum. They are shifted by the path delays τk of the kth path. weighted by the average power gain pk of the corresponding path. the toolbox models the channel as a time-variant linear FIR filter.3) and shown in Figure B. In our scenario the channel is frequency-flat. A scenario similar to the first one with two distinct paths is shown for comparison. The uncorrelated random processes hk (n) are filtered Gaussian random processes with a Jakes power spectral density.B. This results in g(m) = ∑ sinc k τk − m hk (n) pk . the Jakes Doppler spectrum is not suitable for the aeronautical channel. The preferable Gaussian Doppler spectrum is unfortunately not supported by the simulator. This results in a filter with only one tap and consequently in a frequency nonselective channel. Due to the small bandwidth of our channel. 1/ f sa The equation shows once again that when all path delays are small as compared to the sample period. The output of the channel for the sinusoidal input signal as defined above is shown in Figure B. The demodulated signal (using (B. The toolbox provides a convenient tool for the visualization of the impulse and frequency response. and the evolution of these quantities over time. The toolbox also allows a model structure with several discrete paths similar to Figure B. both relative to the LOS path. As shown in Figure B. In terms of implementation.max and k.6(c).6(b)) reveals the amplitude fading of the channel due to the Rician distribution of the signal amplitude. Its source code and documentation is provided in [165].3. with the total power being normalized to 0 dB. f D. Figure B.

5 1 Amplitude 0 1 2 3 4 5 6 7 8 Time in units of tλ=0.1 0.2 Time in s 0.3 1 0.5 0 −0.5 Figure B. 182 .5 Amplitude (b) Demodulation 1.Appendix B.5 0 1 In−Phase: Re[g(t)] 2 −1.5 −1 −1.2 Time in s 0.: Received signal at different processing stages.5 0 −0. (b) after demodulation. Aeronautical Radio Channel Modeling and Simulation (a) Channel Outpout 2 Quadrature: Im[g(t)] 1.6 Figure B. Received signal (channel output of Mathworks Communication Toolbox and an observation interval of 2 s) (a) in equivalent complex baseband.7. Components Power in dB 0 −5 −10 −15 −20 −25 −30 0 0.5 0 0.4 0.5 −1 (c) Bandpass Filter 1.: Power of the line-of-sight component (top) and the Rayleigh distributed scattered components (bottom).5 1 0.5 0 −0.6.042s 0. and (c) after bandpass filtering.

frequency. the GCS software confirms our computed time delay of ∆τ = 38. the VHF air/ground channel among others.6(c).2. We believe that the out-dated source code of the GCS has legacy issues which lead to problems with current operating system and compiler versions.: Generic Channel Simulator: Received signal in an observation interval of 300 s (channel output) (a) in equivalent complex baseband and (b) after demodulation. This version requires a fairly complex installation procedure and a number of adjustments in order to enable compiling of the source code on current operating systems. geometry.8. Using the same parameters for speed. nor does it coincide with first informal channel measurements that we pursued. The time axis in the plot of the demodulated signal is very different as compared to the one in Figure B. as in the scenarios described above.4. we obtain a channel output as shown in Figure B. a discrete line-of-sight signal and a randomly distributed scatter path with a power of -12 dB. We provide some advice on how to install the software on the Mac OS X operating system and how to interface the simulator with Matlab [166]. Aeronautical Voice Channel Simulation (a) Channel Outpout 2 1 Amplitude 0 −1 −2 (b) Demodulation 0 100 200 Time in s 300 Figure B.B. the Doppler power spectrum is assumed to have a Gaussian shape.4. 183 .3 ns. The GCS allows the simulation of various types of mobile radio channels. The amplitude fading of the signal has a similar shape as before but is in the order of three magnitudes slower than observed in Section B. With the same geometric configuration as above.8. In the following example we use a similar setup as in the second scenario in Section B. the GCS simulates the radio channel by a time-variant IIR filter. Similar to the Mathworks toolbox. This contradicting result cannot be explained by the differing model assumptions.. GCS written in ANSI C and provides a graphical MOTIF-based interface and a command line interface to enter the model parameters. etc.2.4. The last publicly available version of the software dates back to 1998. Data input and output files are in a binary floating point format and contain the signal in equivalent complex baseband representation. The scatter path delay power spectrum shape is approximated by a decaying exponential multiplying a zeroth-order modified Bessel function.

Appendix B. Aeronautical Radio Channel Modeling and Simulation

(a) Channel Outpout
2 Quadrature: Im[g(t)] 1.5 1 0.5 0

(b) Demodulation and Filtering
1.5 1 Amplitude 0.5 0 −0.5 −1

−0.5

0 1 In−Phase: Re[g(t)]

2

−1.5

0

0.1 0.2 Time in s

0.3

Figure B.9.: Reference channel model with Jakes PSD: Received signal (channel output) (a) in equivalent complex baseband and (b) after demodulation and bandpass filtering.

B.4.4. Direct Reference Model Implementation
The third channel simulation is based on a direct implementation of the concept described in Section B.3.3 with the Matlab routines provided in [132]. We model the channel by multiplying the input signal with the complex random process µm (t) as given in (B.7). The random processes are generated with the sum of sinusoids approach as discussed in Section B.3.2. Figure B.9 shows the channel output using a Jakes Doppler PSD and a Rician reference model. The result is very similar to the one obtained with the Mathworks toolbox. In a second example, a Gaussian distribution is used for the Doppler power spectral density instead of the Jakes model. The additional parameter f D,co describes the width of the Gaussian PSD by its 3 dB cut-off frequency. The value is arbitrarily set to f D,co = 0.3 f D,max . This corresponds to a fairly small Doppler spread of B = 6.11 Hz, compared to the Jakes PSD with B = 17 Hz. Figure B.10 shows the resulting discrete Gaussian PSD. The channel output in Figure B.11 confirms the smaller Doppler spread by a narrower lobe in the scatter plot. The amplitude fading is by a factor of two slower than with the Jakes PSD.

B.5. Discussion
We further examine the path gain of the frequency nonselective flat fading channel simulated in Section B.4.2. This path gain is plotted in Figure B.12(a) and corresponds to the amplitude fading of the received signal shown in Figure B.6(b). Considering the time-variant path gain as a time series, its magnitude spectrum (shown in Figure B.12(b)) reveals that the maximum frequency with which the path gain changes approximately coincides with the maximum Doppler frequency f D,max = v f c of the c channel. The amplitude fading is band-limited to the maximum Doppler frequency. This fact can be explained with the concept of beat as known in acoustics [167].

184

B.5. Discussion

x 10 Doppler PSD 1 0.5 0

−3

−20

−10 0 10 Doppler frequency in Hz

20

Figure B.10.: Discrete Gaussian Doppler PSD with f D,max = 24 Hz and a 3-dB-cut-off frequency of f D,co = 7.2 Hz.
(a) Channel Outpout
2 Quadrature: Im[g(t)] 1.5 1 0.5 0 −1 −0.5 0 1 In−Phase: Re[g(t)] 2 −1.5 0 0.2 0.4 Time in s 0.6 Amplitude

(b) Demodulation and Filtering
1.5 1 0.5 0 −0.5

Figure B.11.: Reference channel model with Gaussian PSD: Received signal (channel output) (a) in equivalent complex baseband and (b) after demodulation and bandpass filtering.
(a) Path Gain
5

(b) Path Gain Spectrum
0 DFT Magnitude in dB −10 −20 −30 −40 −50 −60 0 10 24 40 60 Frequency in Hz

Path Gain in dB

0 −5 −10 −15

0

0.2 0.4 Time in s

0.6

Figure B.12.: Time-variant path gain and its magnitude spectrum for a flat fading channel with f D,max = 24 Hz.

185

Appendix B. Aeronautical Radio Channel Modeling and Simulation The superposition of two sinusoidal waves with slightly different frequencies f 1 and f 2 leads to a special type of interference, where the envelope of the resulting wave modulates with a frequency of f 1 − f 2 . The maximum frequency difference f 1 − f 2 between a scatter component and the carrier frequency f c with which we demodulate in the simulation is given by the maximum Doppler frequency. This explains the band-limitation of the amplitude fading. In a real world system a potentially coherent receiver may demodulate the HF signal with a reconstructed carrier frequency fˆc which is already Doppler shifted. In this case, the maximum frequency difference between Doppler shifted carrier and Doppler shifted scatter component is 2 f D,max . This maximum occurs when the aircraft points towards the ground station so that the LOS signal arrives from in front of the aircraft, and when at the same time a scatter component arrives from the back of the aircraft [163]. We can thereof conclude that the amplitude fading of the frequency nonselective aeronautical radio channel is band-limited to twice the maximum Doppler frequency f D,max . For a channel measurement system as presented in Chapter 8 the maximum frequency of the amplitude fading entails that the fading of the channel has to be sampled with at least double the frequency, that means with a sampling frequency of f ms ≥ 4 f D,max , to avoid aliasing. With the parameters for a general aviation aircraft used throughout this appendix, this means that the amplitude scaling has to be sampled with a frequency of f ms > 96 Hz or, in terms of an audio sample rate f sa = 8 kHz, at least every 83 samples. For a measurement system based on maximum length sequences (MLS, see Chapter 8) this means that the MLS length should be no longer than L = 2n − 1 = 63 samples. A factor that is not considered in the presented channel models is the likely presence of automatic gain controls (AGC) in the radio transmitter and receiver. Since the presented models are propagation channel models, transceiver components are by definition not part of these models. Transceiver modeling is in fact a discipline by itself and is outside the scope of this work. However, it should be noted that an AGC in the receiver might partially compensate for the amplitude modulation induced by the fading channel. Figure B.13(a) and B.13(b) show the demodulated signal of Section B.4.2 after the bandpass filtering and after the application of a simulated automatic gain control with a cut-off frequency of 300 Hz. In theory, carrier wave amplitude modulations with a frequency of less than f l = 300 Hz should not be caused by the band-limited input signal but by the fading of the channel. An automatic gain control could therefore detect all low-frequency modulations and compensate for all fading with a frequency up to f l = 300 Hz. The actual implementations for automatic gain controls and their parameters are however highly manufacturer-specific and very little information is publicly available. We considered throughout this appendix the case of a small general aviation aircraft as used in the channel measurements presented in Chapter 8. It is also of interest to briefly discuss the case of large commercial aviation passenger aircraft. The relevant parameters are a typical cruising speed of around 250 m/s and a cruising altitude of around 10000 m. In combination with the parameters of Section B.4.1 this results in a maximum Doppler frequency of f D,max = 100 Hz. The upper band limit of the

186

B.6. Conclusions

(a) Bandpass Filter
1.5 1 Amplitude

(b) Automatic Gain Control
1.5 1 Amplitude 0.5 0 −0.5 −1

0.5 0 −0.5 −1 −1.5 0 0.1 0.2 Time in s 0.3

−1.5

0

0.1 0.2 Time in s

0.3

Figure B.13.: Received signal (a) before and (b) after an automatic gain control. amplitude fading, equaling 2 f D,max = 200 Hz, approaches the lower band limit of the audio signal. The path delay of ∆l = 6.0 m or ∆τ = 20 ns results in a coherence bandwidth of BCB = 5 MHz, and the channel is thus again frequency nonselective and flat fading. In this appendix, we dealt with the Doppler shift of the modulated carrier wave, which leads in combination with the multipath propagation to signal fading. It is worthwhile to note that also the modulating signal, that means the audio signal that is represented in the amplitude of the carrier wave, undergoes a Doppler shift. However, this Doppler shift is so small that it can be neglected in practice. For example, for a sinusoidal audio signal with a frequency of 1000 Hz and an aircraft speed of 250 m/s the maximum Doppler frequency using (B.4) is 8.3 · 10−4 Hz.

B.6. Conclusions
We reviewed the most common radio propagation channel models and performed simulations for the aeronautical voice radio channel. We conclude that due to its narrow bandwidth the aeronautical voice radio channel is a frequency nonselective flat fading channel. In most situations the frequency of the amplitude fading is band-limited to the maximum Doppler frequency, which is a simple function of the aircraft speed and the carrier frequency. The amplitude fading of the channel might in parts be mitigated by an automatic gain control in the radio receiver.

187

Appendix B. Aeronautical Radio Channel Modeling and Simulation 188 .

Appendix C Complementary Figures This appendix presents complementary figures for Section 10. 189 .4. which are provided for reference.

05 0 0 10 20 30 40 SNR (in dB.25 0. floor (at −SNR−3dB) [wmfl] sr+mbcl + wmfl (at −SNR) + ug (+3dB) sr+mbcl + wmfl (at −SNR+3dB) sr+mbcl + wmfl (at −SNR+3dB) + ug (+3dB) Figure C.3 0.15 0. processed signal) 50 60 Original signal Original signal with silence removed [sr] sr+multiband compr.25 0.35 0. wrt. measured in 20ms window) 70 Original signal Original signal with silence removed [sr] sr+multiband compr.35 0.1.2 0.45 0. and limiting [mbcl] sr+mbcl + unvoiced gain (+3dB) [ug] sr+mbcl + unvoiced gain (+6dB) sr+mbcl + waterm. but plotted against different SNR measures.45 0.1 0.: Noise robustness curves identical to Figure 10.5 0.1 0. floor (at −SNR−3dB) [wmfl] sr+mbcl + wmfl (at −SNR) + ug (+3dB) sr+mbcl + wmfl (at −SNR+3dB) sr+mbcl + wmfl (at −SNR+3dB) + ug (+3dB) 30 40 50 60 PSNR (in dB. wrt. Complementary Figures 0.15 0.4.5 0.3 0.4 0. processed signal.2 0.05 0 20 0. 190 Bit error ratio Bit error ratio .4 0. and limiting [mbcl] sr+mbcl + unvoiced gain (+3dB) [ug] sr+mbcl + unvoiced gain (+6dB) sr+mbcl + waterm.Appendix C.

191 .Figure C.: Multiband compressor and limiter settings for transmitter-side speech enhancement.2.

Appendix C. Complementary Figures 192 .

Marrakech.” in Proceedings of the International Conference on Language Resources and Evaluation (LREC). Hering. May 2008. “A measurement system and the TUGEEC-Channels database for the aeronautical voice radio. VA. Hofbauer and G. Hofbauer and H. [6] K. Brétigny-sur-Orge. Sep. 241–244. Speech. 2009. B.Bibliography [1] K. “Digital signatures for the analogue radio. Dec. “The ATCOSIM corpus of non-prompted clean air traffic control speech. S. vol. “Noise robust speech watermarking with bit synchronisation for the aeronautical radio. Jul. “Speech watermarking for the VHF radio channel. [7] K. Kubin. PY. Fairfax. Canada. Hofbauer. [3] K. Kubin.” in Proceedings of the NASA Integrated Communications Navigation and Surveillance Conference (ICNS). [4] K. Hofbauer.” in Proceedings of the IEEE Vehicular Technology Conference (VTC). and G. and H. [5] K. 2005. Kubin. 252–266. Hering. Springer. 2006.” in Information Hiding. revised and resubmitted. 193 . G. “Aeronautical voice radio channel modelling and simulation—a tutorial review. H. Petrik. Montreal.” in Proceedings of the EUROCONTROL Innovative Research Workshop (INO). [2] K. 1–5. Hofbauer. Hering. Hofbauer and H. Hering. and G. [8] K.” in Proceedings of the International Conference on Research in Air Transportation (ICRAT). pp. H. 4567/2008. Kubin. USA. Serbia. Lecture Notes in Computer Science. “High-rate data embedding in unvoiced speech. Hofbauer and G. 2005. 2006. Morocco. and W. Hering. Belgrade.” IEEE Transactions on Audio. Pittsburgh. Kubin. Kleijn. pp. pp. Hofbauer. Sep. USA. 2007.” in Proceedings of the International Conference on Spoken Language Processing (INTERSPEECH). pp. and Language Processing. ser. France. 215–220. 2006. “Speech watermarking for analog flat-fading bandpass channels.

Malvar. “Quantization index modulation: A class of provably good methods for digital watermarking and information embedding. Jul. pp. Lazic and P. 2008. and Operations Conference (ATIO). 2006. Oct. Sep. 1999.Bibliography [9] K. Digital Audio Watermarking Techniques and Technologies: Applications and Benchmarks. “Secure spread spectrum watermarking for multimedia. [20] M. 12. vol. 29. [10] M. T. 35. Berlin.” in Proceedings of the AIAA Aviation Technology. vol. Leighton. F. 8. vol. McKellips. Digital Watermarking and Steganography. “Writing on dirty paper. 5. [16] I. S. France. Hofbauer. 439–441. S. 2007. no. 2008. no. Alaska. Northern Ireland. [14] W. 1. Anchorage. L. 3. Bloom. and Operations Conference (ATIO). [22] B. [18] N. Integration. 8. [17] D. Miller. pp. 47. Morimoto.” in Proceedings of the CEAS European Air and Space Conference (Deutscher Luft. and A. [11] H. Kirovski and H. Chen and G. “From analogue broadcast radio towards end-to-end communication. Hering and K. no. Cox. Costa. He and M. 918–924. “Advanced speech watermarking for secure aircraft identification. [15] N.” IEEE Transactions on Information Theory. J. 313–336. Fridrich. vol.” IEEE Transactions on Multimedia. Dec. 2007. Sep. Kilian. 1127–1141. 1423–1443. Bender. Integration. W.” IBM Systems Journal. 6.” Research Letters in Signal Processing. H.” IEEE Transactions on Image Processing. Scordilis. Dec. [13] I. Morgan Kaufmann. J. “Communication over an acoustic channel using data hiding techniques. Sep. 1673–1687.” IEEE Transactions on Signal Processing. no. M. no. Seppanen. 2007. Wornell. Apr. pp. 4. vol. Jan. May 1983. Hofbauer. IGI Global. “Watermarking as communications with side information. 51. Aarabi. N.” in Proceedings of the EUROCONTROL Innovative Research Workshop (INO). no. vol. M. vol. J. Cvejic and T.” Proceedings of the IEEE. D. Shamoon. pp. Gruhl. Belfast. Lu. Germany. [19] X. and A. “Spread-spectrum watermarking of audio signals. 1997. Gruber and K. 2nd ed. no. and T.” in Proceedings of the AIAA Aviation Technology. pp. Kalker. Cox. 2007. 1996. 4. [21] I.” IEEE Transactions on Information Theory. L. USA. J. pp. Miller. Hofbauer. May 2001. 1–5. no. “Towards selective addressing of aircraft with voice radio watermarks. and T.und Raumfahrtkongress). Hofbauer. “Efficiently synchronized spread-spectrum audio watermarking with improved psychoacoustic model. Hering and K. [12] H. “Techniques for data hiding. 3/4. vol. 1020–1033. 194 . “A comparison of estimation methods for the VHF voice radio channel. Brétignysur-Orge. 2004. 87. J. 7. 2003. Cox. M. pp. pp.

“Bandwidth extension of telephone speech aided by data embedding. 64921. USA.Bibliography [23] J. pp. PA. 2007. 2005. Speech and Signal Processing (ICASSP). Leung. Bäuml. 2005. Proceedings of SPIE. no. R. Hawaii. 2007. Hagmüller. no. 4. P. vol. Shterev. D. pp. pp. Geiser. Sagi and D. pp. Sorenson. 1653–1656. Mar. pp. pp.” in Proceedings of the IEEE International Conference on Acoustics. 2007. Sep. T. Salt Lake City. A.” in Proceedings of the 195 . 2005. Wang. 2003. pp. and Y. “Pitch and duration modification for speech watermarking. [26] X. Malah. Tzschoppe. pp. and linear filtering invariance. 2004. Abrardo. 8. vol. Sakai. vol. Sharma. J. vol. Delft University of Technology.” EURASIP Journal on Advances in Signal Processing. 17–20. Girod.” in Proceedings of the European Signal Processing Conference (EUSIPCO). UT. “The effect of polarity inversion of speech on human perception and data hiding as an application.” in Proceedings of the IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP). and Language Processing. Sakaguchi. “Rational dither modulation: A high-rate data-hiding method invariant to gain attacks. 4. “Speech bandwidth extension by data hiding and phonetic classification. [33] B. [30] S. 2270–2277. vol. no. C. Jax. Lisbon. 1003–1019. Speech and Signal Processing (ICASSP).” in Multimedia Systems and Applications V. security. “Scalar Costa scheme for information embedding. 3960–3975. 258–267. W. “Digital watermarking based on process of speech production.” IEEE Transactions on Audio.” in Proceedings of the IEEE International Conference on Acoustics. ser. Qi. Arai. 1497–1500. 3. “Spread spectrum signaling for speech watermarking. 10. G. Philadelphia. and G.D. “A new adaptive digital audio watermarking based on support vector regression. and P. 51. 2007. Niu. Hatada. 2. and P.” in Proceedings of the European Conference on Speech Communication and Technology (EUROSPEECH). and A. vol. Murahara. Celik. May 2001. M. H. Austria. Barni. 2002. [25] I. “Quantization-based watermarking: Methods for amplitude scale estimation. 1337–1340. Vary. and A. USA. Kröpfl.” IEEE Transactions on Signal Processing. [34] S. Apr. Honolulu. [28] M. T. R. 53. vol. 4861.” IEEE Transactions on Signal Processing. Oct. Portugal. 2007. dissertation. [29] M. [31] M. Perez-Gonzalez. and Y. Eggers. Tekalp. “Speech watermarking for air traffic control. USA. N. Speech. Vienna. no. Chen and H. “Artificial bandwidth extension of speech supported by watermark-transmitted side information. Apr. Yamazaki. Hering. [24] F.” Ph. 593–596. vol. Komatsu. Cheng and J. Sep. [27] Q. pp. and B. 15. M. Nov. Kubin. Mosquera. [32] A.

63–74. J. vol. Liu and J. Canada. pp. 2. [43] D. Gray. Springer. Leung. Schulz. 2007. A.S. 196 . “Distortion measures for speech processing. no. Chen and H. Gurijala and J. B. [37] A. 2. pp. Malik. 633–636. Smith.” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS). 2008. Nishimura. 2007. S. pp. no. and Y. “Concurrent data transmission through PSTN by CDMA. “On the robustness of parametric watermarking of speech. pp. Chen. no. 15. pp. A.” IEEE Transactions on Acoustics. pp. Island of Kos. Matsuyama. and H. “Audio watermarking through deterministic plus stochastic signal decomposition. 6. May 2004. ser. Speech and Signal Processing (ICASSP). 2005. 2007. O. 501–510. Y. B. 1.” in Advances in Multimedia Modeling. vol. pp. Marchand. pp. Kubin. 4577/2007. [42] G. 498–507. [38] R. Istanbul. [39] S.” EURASIP Journal on Information Security. 593–598. and W. S. vol. A.” U. [36] Y. R. “Performance of noise excitation for unvoiced speech. Greece. Suzuki. Canada. “Multiple watermarks for stereo audio signals using phase-modulation techniques. 7/8. “Improving audio codecs by noise substitution. 4. T. Khokhar. Leung. A. 1–12. 1. vol. [35] L. Buzo. Atal. Springer. Brittan. Matsuoka. vol. May 2007. 53. Nakashima. and Signal Processing. Lecture Notes in Computer Science. [45] H. Speech. and T. “Watermarking of speech signals using the sinusoidal model and frequency modulation of the partials. 1996. Gray.Bibliography IEEE International Conference on Acoustics. R. Jul. 1296–1304.. 1. 2007. Speech and Signal Processing (ICASSP). 917–920. Feb. Tucker and P. “Telephony speech enhancement by data hiding. Aug. Speech. Patent US 2003/0 028 381 A1. pp. “Robust data hiding in audio using allpass filters. Feb. Montreal. Girin and S. 28. and Y. Jun. May 2006.” IEEE Transactions on Signal Processing. 806–815. Saint-Adele. F. and A. Turkey.” in Proceedings of the IEEE Workshop on Speech Coding for Telecommunications. no. Kawahara. 2003. 0302-9743. vol. C. [44] A. Lecture Notes in Computer Science. vol. Oct.” in Multimedia Content Analysis and Mining.” in Proceedings of the IEEE International Conference on Acoustics. 367–376. Kleijn. 3001–3004. ser. vol. Deller. [46] H. 1993. [40] S. [41] R.” Journal of the Audio Engineering Society. 35–36. “Acoustic OFDM: Embedding high bit-rate data in audio. pp. Ding. 44. Ansari. H. vol. Feb. no. 2000. no.-W. vol. 4. Jr. Yoshimura. Takahashi. M.” IEEE Transactions on Audio. 56. “Method for watermarking data. pp. pp.” IEEE Transactions on Instrumentation and Measurement. 1980. and Language Processing.

Thomas.” Ph. Spoken Language Processing: A Guide to Theory. A. M. Swanson. 1999. D. [58] W. [55] M. 3. “Data embedding in speech signals using perceptual masking. H. Wiley. A. no. pp. Eds. “Perceptual and squared error aspects in speech and audio coding.Bibliography [47] X. vol. Y. 88. pp. [50] H. no. M. Speech and Signal Processing (ICASSP). Spanias. Pearson. “Robust audio watermarking using perceptual masking. Elements of Information Theory.” in Proceedings of the IEEE International Conference on Acoustics. [51] S. Zwicker. 357–377. Sep. “Data hiding via phase manipulation of audio signals. Kleijn. Elsevier. 2006. A.” Proceedings of the IEEE. Phoenix. AZ. 4th ed. Montreal. Canada.-W. 29–32. [52] T. 5. Painter and A. “On phase perception in speech. Jensen. Paliwal. Bocko. [48] P. [61] R. no. Dec. Algorithm and System Development. 581–584. 66. Malah.” in Proceedings of the 12th European Signal Processing Conference (EUSIPCO’04). no. 2005. J.. USA. 35. pp. pp. Haykin. Mar. 9. dissertation.” Journal of Auditory Research.” Multimedia Tools and Applications. 2007. Springer. Pobloth. 451–515. Dong. Armand. 2001. 2004. A. Stockholm. Zhu. Liew and M. Apr. May 2004. Heusdens. Psychoacoustics. vol. A. Boney. Tewfik. pp. Sagi and D. Leung. 1982. van de Par. 1. Vienna. Royal Institute of Technology (KTH). Ignjatovic. B. K. vol.” in Proceedings of the IEEE International Conference on Acoustics. “Effects of white noise on the most comfortable level for speech with normal listeners. 1998. 2004. 4. vol. Communication Systems. 337–355. Zentil. Kohlrausch. 1. 2000. 8. vol. 71–76. 3rd ed. 2005. 377–380. no. A. [54] S. “Perceptual coding of digital audio. vol. R. 1292–1304.” IEEE Signal Processing Letters. “Inaudible watermarking via phase manipulation of random frequencies. Beattie. Speech and Signal Processing (ICASSP).” EURASIP Journal on Applied Signal Processing. A. vol. pp. Speech Coding and Synthesis. and Z. Chen and H. C. Svihovec. H. 3. 1995. [59] T. Acero. [60] X. Pobloth and W. Wiley. 2nd ed. 22. “Concurrent data transmission through analog speech channel using data hiding. [49] H.D. 12. “A perceptual model for sinusoidal audio coding based on spectral integration. pp. vol.. Hon. Austria. 2001. and H. Eds. 2005. B. and D. pp. Sweden. [53] H.” Signal Processing. Huang. Cover and J. Kleijn and K. Jensen. [56] A. [57] S. and L. Nov. no. and S. 2006. 197 . Fastl and E.

“A Kaiser window approach for the design of prototype filters of cosine modulated filterbanks. Malvar. Laroche. University of Massachusetts (Amherst). [66] J.D. “The channel capacity of discrete time phase modulation in AWGN. 6. no. Salehi. Communication Systems Engineering. 2544–2547. [64] R. [68] R. 2003. [74] H. 175–205. “Usefulness of phase spectrum in human speech perception. Malvar. vol.” IEEE Transactions on Speech and Audio Processing. Larson and B. “Lapped transforms for efficient transform/subband coding. S.” Ph. 39. “Non-parametric techniques for pitch-scale and time-scale modification of speech.. 2002. Shubert. [70] K. dissertation. pp. Proakis and M. [76] Y. vol. Jun. 2nd ed. 1985. [67] G. no. 355–364. Burr. and Signal Processing.” IEEE Transactions on Information Theory. no. its time-varying forms.” IEEE Transactions on Signal Processing.” IEEE Transactions on Information Theory. G. 1982. Artech House. 1990.” IEEE Transactions on Acoustics.Bibliography [62] H. pp. E. no.” IEEE Transactions on Signal Processing. 39. Principles and Practice of Information Theory. [75] S.” IEEE Transactions on Speec and Audio Processing. 132–134. 40. 11. [69] D. pp. “Projection of a spherical distribution and its inversion. Signal Processing with Lapped Transforms.” IEEE Signal Processing Letters. 1. 1. 2117–2119. Speech. vol. vol. pp. vol. Addison-Wesley. 1992. pp. Switzerland. 184–185. [73] H. 2703–2714. Nov. Jul. 969–978. S.” in Proceedings of the International Conference on Spoken Language Processing (INTERSPEECH). Blachman. [65] J. 1. applications. pp. 1987. Shlien. “Signal space channel coding: Codes for multilevel/phase/frequency signals.” Speech Communication. 1992. Probabilistic models in engineering sciences. Lin and P. Geneva. “The modulated lapped transform. M. July 1997.-P. pp. 1993. 11. 198 . no. Wiley. Nov. vol. 1991. Blahut. vol. and fast algorithms. February 1995. [71] E. 28. 1979. Paliwal and L. “Channel coding with multilevel/phase signals. “Extended lapped transforms: properties. vol. Alsteris. 4. [72] H. Kim. Jan. J. pp. 1998. Padovani. 55–67. Jan. 5. Malvar. [63] N. no. 38. 5. Jun. and its applications to audio coding standards. no. K. O. vol. Moulines and J. Ungerboeck. pp. no. “Perceptual phase quantization of speech. 6. 2. Sep. 11. S. 16. Aldis and A. no. vol. 2003. Vaidyanathan. PrenticeHall.-S. 4.

[86] J.praat. Messerschmitt.” IEEE Transactions on Communications. [83] Y. 1432–1436. [89] ITU-T Recommendation P. Lee. pp. 1956. E. [84] E. Mazo. L. [80] P. 251–257. Wu. White. Weenink. Prentice-Hall. “The Shannon sampling theorem—its various extensions and applications: A tutorial review. Martin. International Telecommunication Union. [88] A. Lamel. Boersma and D. J. vol. Golden. 65. “A proof on the minimum and permissible sampling rates for the firstorder sampling of bandpass signals. [Online]. “Exact interpolation of band-limited functions. 17. Scott. W. Wiley. pp. Kohlenberg. D. 2005. Scott. 97–110. 9. pp. 199 .jsp?catalogId=LDC93S1 [79] P. Fisher. G. M. 4th ed. 2004. CD-ROM. [90] ITU-T Recommendation G. 1973–1984.ldc. Digital Speech Transmission. R. Adaptive Filter Theory. pp.” IEEE Transactions on Signal Processing. 1991. 12. [85] J. 1147–1157. “On nonuniform sampling of bandwidth-limited signals. no. Pallett. N. Boersma. pp.edu/ Catalog/CatalogEntry. no. 1977. vol. 4. 24. vol.48: Specification for an Intermediate Reference System. 17. Jul. 2002. 763–774. Dec. G.0. [Online]. S. 1998. and D. L. no. Jun. 4. Available: http://www.191: Software Tools for Speech and Audio Coding Standardization. “An equalizer design technique for the PCM modem: A new modem for the digital public switched network. “Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound.” Proceedings of the IEEE. 11. vol. White. R.20). D. 2007. G. R. (2008) PRAAT: doing phonetics by computer (version 5.” Digital Signal Processing.Bibliography [77] P.upenn. 46. Yen. no. Vaughan.” Journal of Applied Physics. 848–854. Jerri. Sep. no. S. Digital Communication. N. 39. E. 6. vol. pp. Ayanoglu. [81] A. F. Vary and R. no. 1953. 1565–1596. Garofolo. N. vol. pp. L. Dec. 2006. A. vol. Dahlgren. International Telecommunication Union. Nov. Nov. [87] S. G. [91] S. 4. Han. Barry. (1993) TIMIT acoustic-phonetic continuous speech corpus. Mar. N. 1993. “A general formula for channel capacity. Dagdeviren. Sep. Zue. Fiscus. 1994. 3. and V. Linguistic Data Consortium. [82] R.org/ [78] J. R. “The theory of bandpass sampling. and J. 1988. Springer. S.” IRE Transactions on Circuit Theory. no. D. pp. and D. Haykin.” Proceedings of the Institute of Phonetic Sciences. 3rd ed. 40. J. L. Available: http://www. Verdú and T. vol.” IEEE Transactions on Information Theory.

1204–1213.” in Proceedings of the IEEE Conference on Information Sciences and Systems. 2003. Springer. 3rd ed. D. Müller. Schlauweg. design and applications. Mar. pp. [98] L. Princeton. (ARINC) Characteristic 716-11. Jun. Apr. vol.” IEEE Transactions on Acoustics.862. Toronto. “Watermark synchronization for feature-based embedding: Application to speech. Aug. “Watermark synchronization: Perspectives and a new paradigm. Lecture Notes in Computer Science. “Data-hiding codes. [105] F. [106] Y. 136. 12. Shayan and T. Available: http://www. pp. Hofbauer. 1105–1106. 2006. Sharma. ser. no. pp. no. Aug.” IEEE Transactions on Communications. and Signal Processing. Honolulu.” in Proceedings of the IEEE International Conference on Acoustics. 8. 8. Kim. pp. Pröfrock.tugraz. vol.x: Perceptual Evaluation of Speech Quality (PESQ). 2005. Speech. no. Canada. pp. and E. “Soft feature-based watermark decoding with insertion/deletion correction. 2007. “Frame synchronization techniques. A. Oct. Aug. 1. Coumou and G.Bibliography [92] ITU-T Recommendation P.” IEEE Transactions on Communications. vol. Kleijn. 2007.-Y. International Telecommunication Union. [97] Airborne VHF Communications Transceiver. Lindsey. Jun.” IEE Proceedings F: Radar and Signal Processing. [102] M. 2005. vol. Speech and Signal Processing (ICASSP). E. 2083–2126. (2008) Demonstration files: Original and watermarked ATC speech. “Spectral linear prediction: Properties and applications. “A canonical representation of speech. [96] F. Phaselock Techniques. J. Ascheid. 8. 1980. USA. M. 3. J.” in Information Hiding. Nilsson. Feb. 4. 53–56. Le-Ngoc. pp. Aeronautical Radio Inc. 849–852. Synchronization in Digital Communications. Scholtz. 2007. Resch. 1980. 2006. 4567/2008. M. B. 1975. 1182–1187.spsc.at/people/hofbauer/wmdemo1/ [94] J. [93] K. 1989. R.” Proceedings of the IEEE. Wiley. Gardner. Makhoul. Wiley.” in Proceedings of the IEEE International Conference on Multimedia and Expo. pp. [Online]. Jul. Dec. 1107–1121. NJ. B. pp. Koetter. “Guest editorial: Special issue on synchronization. 93. 23. “All digital phase-locked loop: concepts. vol. 1. USA. 28. vol. no. [99] P. 28. 849–852. Hawaii.” IEEE Transactions on Communications. vol. no. [101] D. Moulin and R. no. [104] H. Meyr and G. [100] G. “Carrier and bit synchronization in data communication–a tutorial review. 1990. 200 . 237–251. pp. vol. Franks. vol. Coumou. 28. Sharma and D. pp. 283–296. [103] R. Gardner and W. and W. [95] M. 1980.

Mar. May 2008. [114] H. “ATCOSIM air traffic control simulation speech corpus. “Approximate string matching. “Fractional tap-spacing equalizer and consequences for clock recovery in data modems. Rep. Metzger. 6th ed. Kubin. S. 1–13. IN. [113] Manual of Radiotelephony. A. Pacific Grove. Gilbert.” in Proceedings of the Digital Avionics Systems Conference (DASC).Volume 2 . and R. 1992. Prinz.” Eurocontrol Experimental Centre. 44. Hofbauer. 2006. J. USA. pp. Hagmüller and G. EEC Note 04/05. 2008. Sajatovic. 1976. no. [111] D. Tech. [Online]. and G. EEC Note 05/05. USA. vol. Indianapolis. [108] P. International Civil Aviation Organization Doc 9432 (AN/925). EUROCONTROL EATMP Infocentre. 12. Petre. rev. Indianapolis. V. pp.” in Proceedings of the Asilomar Conference on Signals. EATMP Infocentre. 8. no. Kim. [115] H. no. “Joint equalization and timing recovery in a fractionally-spaced equalizer. M. 7. 2006. J.” IEEE Transactions on Circuits and Systems—Part II: Analog and Digital Signal Processing. Oct. Ungerboeck.” ACM Computing Surveys.” IEEE Transactions on Communications. [116] M. Kauffman. vol.Technical Description. Kerczewski. 4. Stutz. Hall and G. Apr.1–8. Hagmüller. Jul. “Increasing the safety of the ATC voice communications by using in-band messaging. Hofbauer and S. ARINC Voice Services Operating Procedures Handbook. 1997. [112] R. pp. vol. (ARINC). USA. MT. 1980. Aug. “Dual-loop DPLL gear-shifting algorithm for fast synchronization. Rev. [109] G.spsc.at/ATCOSIM 201 . EATMP Communications Strategy . van Roosbroek. and T. 856–864. and A. 2006. 25–29. Chari. Kroepfl. 1. Dowling. A. 4. “Safety and security increase for air traffic management through unnoticeable watermark aircraft identification tag transmitted with the VHF voice communication.” Eurocontrol Experimental Centre. Aeronautical Radio Inc. Gooch. and B. [120] K. B. 24. CA. 2005. vol. pp. “System architecture of the onboard aircraft identification tag (AIT) system. 577–586. “Technology assessment results of the Eurocontrol/FAA future communications study.” in Proceedings of the IEEE Aerospace Conference. 2005. Hering and K. AIT Initial Feasibility Study (D1–D5). 1. [110] D. [119] M. [118] M.E. USA. R. Dec. 2007. “Speech watermarking for air traffic control. 2003. Petrik. TUG-SPSC-2007-11. Budinger. Available: http://www.” Graz University of Technology. S. Artman. Big Sky. 381–402.tugraz. Kubin.” in Proceedings of the 22nd Digital Avionics Systems Conference (DASC 2003). 2003. Oct.Bibliography [107] B. EUROCONTROL [117] J. vol. 3. Celiktin and E. pp. pp. Systems and Computers. Hering.

W. Matassoni. Available: http://www. “Creating annotation tools with the annotation graph toolkit. Parsons.” in Proceedings of the International Conference on Language Resources and Evaluation (LREC). and P. Production and Validation of Speech Corpora. no. V. Royal Institute of Technology (KTH). 2000. Stumpf. Godfrey. RTO-MP-066. (1994) Air traffic control complete.C. Ehrette. Stockholm. [Online].the today use of VHF as a media for pilots/controllers communications. (2005) VOCALISE . L. 2002. 1994. Belgium. T. (2007) The HIWIRE database. Arnoux. S. Wiley. autohotkey. Shen. Benarousse. 2002. Available: http://www. van Leeuwen. [128] C. USA. Maeda. Grieco. Ma. P. jsp?catalogId=LDC94S14A [122] J.” in Proceedings of the Digital Avionics Systems Conference (DASC). H. Draxler. Available: http://www.free mouse and keyboard macro program with hotkeys and autotext. D. and A. H. Paris. Clot. and D. Bastard [131] A. pp. [130] F. 2009.edu/Catalog/CatalogEntry.Bibliography [121] J. 2005.. Antwerp. and D. J. Pigeon. Favennec. [127] K. cena.3. Sweden. Mobile Fading Channels. Breton. J. Swail.” in Proceedings of the RTO Workshop on Multilingual Speech and Language Processing. Computer program.hiwire. Maragos. Potamianos. Aeronautical Authorities and Services. Analysis and Simulation. 2001. unpublished. D. Washington D. 2003.com/ [129] Designators for Aircraft Operating Agencies. Verlag. Wiley. Arnoux. 202 . a noisy and non-native english speech corpus for cockpit communication. Graglia. Aalborg. B. C. I. E.ldc. Sep. [Online]. Linguistic Data Consortium. The Mobile Radio Propagation Channel. Graglia. [124] L. Illina.html [126] L. [133] J. Pätzold. “Vocalise: Assessing the impact of data link technology on the R/T channel. Steeneken. and D. R. “Design and characterization of the nonnative military air traffic communications database (nnMATC). and H. DVD-ROM.upenn. [132] M. [Online]. Denmark.gouv. CD-ROM. (2008) Autohotkey . X. 8585/93.fr/divisions/ICS/projets/vocalise/index_en.” in Proceedings of the International Conference on Spoken Language Processing (INTERSPEECH). A. Gemello. Segura. Pavet. France.org/ [123] S. R. “The NATO native and non-native (N4) speech corpus. Geoffrois.aviation-civile.-A. Lee. 2007. M. “Pre-processing of speech signals for noisy and bandlimited channels. Mallett. Fohr. Campos Domínguez. 1. International Civil Aviation Organization Nr. Schiel and C. Available: http://www.1–1. Bird.” Master’s thesis. [Online]. [125] C. Thiel. Modelling. Series. Mar.

com/gpx. 2000. “Literature review on terrestrial broadband VHF radio channel models. and J. Digital Communications. vol. Available: http:// www. 2007.garmin. [Online]. Evans. 37. 7. “Efficient dual-tone multifrequency detection using the nonuniform discrete Fourier transform. 403. 1999. Discrete-Time Signal Processing. Prentice Hall.sacskyranch. 2005. 50–56. D.” IEEE Signal Processing Letters. 5. 7. Graz.University of Applied Sciences. O. [149] J. V.topografix. [Online]. 83. Allan. 2005. Felder. 6. 2nd ed. Garmin. [144] MATLAB Version 7. [142] M. Sklar. Schafer. no. Buck. Schwaner. “Channel estimation for the voice radio – Basics for a measurementsbased aeronautical voice radio channel model. D. no. Cleveland and S. R. [140] D. 2005. Available: http://www. pp. 1960.” Journal of the Audio Engineering Society. N. 2007. vol. “Optimum estimation of impulse response in the presence of noise. W. Smith. R.” B-VHF.287 (R2007a). Deliverable D-15. 2007. [Online]. (1995) Alternator induced radio noise. Available: http://www. “Transfer-function measurement with maximumlength sequences. no.com [137] GPS 35-LVS Technical Specification. 2001. Ashby. Garmin. Application Note AN 1289. Available: http://www. Rife and J. Mathematics of the Discrete Fourier Transform (DFT) with Audio Applications. [136] MicroTrack 24/96 User Guide. (2006) GPX: the GPS exchange format.” Master’s thesis.4. M-Audio. 1989. [Online]. Levin.garmin. 419–444. “Locally-weighted regression: An approach to regression analysis by local fitting. Sacramento Sky Ranch Inc. [146] A. [148] D. Devlin. [145] M. J.asp [141] W. Available: http:///www.com [139] B. Available: http:// www. vol.” Agilent Technologies. Hodge. FH JOANNEUM . 2nd ed. 2nd ed. The MathWorks. Foster. “The science of timekeeping. Mason. pp.” IRE Transactions on Circuit Theory. [Online].org [135] D. S.com/altnoise. 2000. 1998. C. W.B-VHF.” Journal of the American Statistical Association. Vanderkooy. pp. L. vol. Oppenheim. and C. [143] J.Bibliography [134] BAE Systems Operations. J. 596–610.0. Prentice-Hall. 1988. W3K Publishing. J. [147] M.htm 203 . C. and B. m-audio. Gruber. Austria. [Online].com [138] eTrex Legend Owner’s Manual and Reference Guide.

[161] D.eecs.de/ People/Haas/ [163] E. Hofbauer.4.Bibliography [150] Series 200 Single-Channel Communication System. pp. Munich. Handbuch der Tonstudiotechnik. “S08 ANT-RVSM 3rd continental real-time simulation pilot handbook. Haas. (2006) Steve Harris’ LADSPA plugins (swh-plugins-0. (2008) Audacity: The free. 2007. Communications systems. Germany. Apple. [Online].3. 2002. Metzger. Zölzer. Available: http://audacity. 1. no. 51.Sound eXchange (version 13. International Civil Aviation Organization Nr.org. 6th ed. vol. [159] Digital Aeronautical Flight Information File (DAFIF).” IEEE Transactions on Vehicular Technology. [Online]. [Online]. Seeger. 1998.uk/ [156] Designators for Aircraft Operating Agencies. no.net/ [154] Core Audio Overview. 204 . Jan. 2004. 2. [165] K. Aeronautical Authorities and Services. Available: http://www.6).” Dec.” Eurocontrol Experimental Centre.0). [151] C.0. 8585/107. Seeger and H.edu/genchansim/ [166] K. [153] D.15).dlr. Aeronautical Authorities and Services. “3rd continental RVSM real-time simulation. Available: http://plugin. 2002. Available: http://www. The generic channel simulator. 3rd ed. Haas.kn-s. (2008) SoX . Dannenberg.tugraz. data sheet. Deransy. spsc. [162] E. Available: http://sox.. Dickreiter. 8585/138. O’Connor. cross-platform sound editor (version1. Available: http://developer. Available: http://www.net/ [152] U. [Online]. Rohde & Schwarz.com/documentation/MusicAudio/Conceptual/CoreAudioOverview/ [155] S.at/people/hofbauer/gcs/ [167] M. 2006. Bagwell. The MathWorks. “Aeronautical channel modeling. [164] MATLAB Communications Toolbox. Wiley. [Online]. [158] R. The generic channel simulator. International Civil Aviation Organization Nr. KG Saur.sourceforge. [Online]. 0610. apple. National GeospatialIntelligence Agency. umich. [157] Designators for Aircraft Operating Agencies. vol. 7910/122. Oct. 2006. 1996. [160] Location Indicators. Mazzoni and R. and D.sourceforge. EEC Report 315. 2006. Eurocontrol internal document. 254–264. 1997. [Online]. R. 1997. electronic database. DAFX: Digital Audio Effects. Harris. International Civil Aviation Organization Nr. Lane.

You're Reading a Free Preview

Download