You are on page 1of 5

AUDIO STEGANOGRAPHY FOR COVERT DATA TRANSMISSION BY

IMPERCEPTIBLE TONE INSERTION


Kaliappan Gopalan1 and Stanley Wenndt2
1
Department of Engineering, Purdue University Calumet, Hammond, IN 46323
gopalan@calumet.purdue.edu
2
Multi-Sensor Exploitation Branch, AFRL/IFEC, Rome, NY 13441
Stanley.Wenndt @rl.af.mil

ABSTRACT
Steganography, in general, relies on the imperfection of
This paper presents the technique of embedding data in the human auditory and visual systems. Audio
an audio signal by inserting low power tones and its steganography takes advantage of the psychoacoustical
robustness to noise and cropping of embedded speech masking phenomenon of the human auditory system [HAS].
samples. Experiments on the embedding procedure applied Psychoacoustical, or auditory masking property renders a
to cover audio utterances from noise-free TIMIT database weak tone imperceptible in the presence of a strong tone in
and a noisy database demonstrate the feasibility of the its temporal or spectral neighborhood. This property arises
technique in terms of imperceptible embedding, high data because of the low differential range of the HAS even
rate and accurate data recovery. The low power levels though the dynamic range covers 80 dB below ambient level
ensure that the tones are inaudible in the message-embedded [1, 2]. Frequency masking occurs when human ear cannot
stego signal. Besides imperceptibility in hearing, the perceive frequencies at lower power level if these
spectrogram of the stego signal also conceals the existence frequencies are present in the vicinity of tone- or noise-like
of embedded information. Both of these features render the frequencies at higher level. Additionally, a weak pure tone
detection of embedding in the stego signal difficult to is masked by wide-band noise if the tone occurs within a
accomplish. Oblivious detection of the stego signal, instead critical band. This property of inaudibility of weaker
of escrow detection, yields the embedded information sounds is used in different ways for embedding information.
accurately. In addition, results of two cases of attacks on Embedding of data by inserting inaudible tones in cover
the data-embedded stego audio, namely, additive noise and audio signal has been presented recently [3,4]. The
random cropping, show the technique is robust for covert following sections describe the tone insertion technique and
communication and steganography. its robustness in retrieving the embedded information in the
presence of noise and in cropped frames of received speech.
Keywords: Audio Steganography, Imperceptible tone
insertion 2. EMBEDDING BY TONE INSERTION

1. INTRODUCTION The tone insertion method relies on the inaudibility of


low power tones in the presence of significantly higher
Covert communication by embedding a message or data spectral components – an indirect exploitation of the
file in a cover medium has been increasingly gaining psychoacoustic masking phenomenon in the spectral
importance in the all-encompassing field of information domain. Experiments were conducted using utterances from
technology. Audio steganography is concerned with (a) the TIMIT database, and (b) the Greenflag database
embedding information in an innocuous cover speech in a consisting of noisy recordings of air traffic controllers, as
secure and robust manner. Communication and host or cover audio samples.
transmission security and robustness are essential for
transmitting vital information to intended sources while In the first experiment, two tones at frequencies f0 and f1
denying access to unauthorized persons. By hiding the are generated for embedding bit 0 and bit 1 respectively. As
information using a cover or host audio as a wrapper, the seen in Fig. 1, the host audio is divided into non-
existence of the information is concealed during overlapping segments of 16 ms in duration. For the host
transmission. This is critical in applications such as utterances used, f0 is set at 1875 Hz and f1 at 2625 Hz
battlefield communications and bank transactions, for arbitrarily. For every frame of host audio, the frame power
example. fe, is computed and only one bit of data is embedded into the
host audio frame. If the bit to be embedded is a 0, then (fe/p0) > (fe/p1), then the covert bit is declared a 0.
the power of f0 is set at 0.25 percent of fe and the power of Otherwise, the covert bit embedded in the frame is
f1 is set at 0.001 of that of f0. To embed a bit of 1, the considered a 1. It must be noted that the power at each of
power of f1 is set at 0.25 percent of fe and the power of f0 the two frequency indices is computed with zero
is set at 0.001 of the power of f1. The simultaneous bandwidth. While this can be done effectively in offline
setting of significant and extremely low powers to the processing by software, notch filters are required for
tones facilitates concealed embedding and correct hardware implementation.
detection of data. The low and relatively high power
ratios avoid one or both of the Embedding Two Bits Per Audio Frame

The second experiment extended the technique to


Host Audio double the payload capacity by using four tones. For this
experiment, one tone out of a selection of four tones,
namely, 750 Hz, 1250 Hz, 1875 Hz, and 2625 Hz,
Segment to 16 ms
frames was set to 0.25 percent of the average power of each
frame while the other tones were set to negligible values.
For detection, the ratio of frame power to power at each
Compute frame tone was used. This ratio, clearly, is a minimum for the
power, fe
tone that was set at 0.25 percent of the frame power.

To add further security in transmission, a 4-bit key for


each frame was used at the transmitter to determine which
one of the four tones would be set at high power relative
Information to the other three tones. Embedded two-bit combination
N Power of f1 → 0.25% of fe
to embed Embed 0 Power of f0 → 0.001 of f1 from each frame was recovered by using the same key at
the receiver and detecting the relative power ratio of each
tone and the frame. The results of these experiments with
noise and cropping are presented in the next section.
Y
Power of f0 → 0.25% of fe 3. EXPERIMENTAL RESULTS
Power of f1 → 0.001 of f0

In the first experiment using the TIMIT utterance,


Quantize “She had your dark suit in greasy wash water all year,”
to 16 bits which is available as 16 bit samples at the rate of 16,000
per second, nonoverlapped frames of 256 samples were
used to embed one bit in each. With 208 frames, a
Transmit
Frame random data of 208 bits were embedded by inserting
tones at 1875 Hz and 2625 Hz with appropriate powers,
and the tone-inserted stego signal was quantized to 16 bits
Fig. 1 Tone insertion algorithm for transmission. From the stego, all 208 bits were
successfully recovered from the ratios of frame power to
tones being detected in hearing or spectrogram – if only power at the tone frequencies. From informal listening
one of the tones is set to a fixed power ratio relative to the tests and from the spectrograms, the stego signal was
frame power, the other tone may be noticeable in cases found indistinguishable from the unembedded host audio
where the host frame inherently has a substantial signal.
component at the tone frequency. The second advantage
is that a known high/low ratio of power between the tones In the second experiment, four tones were used to
facilitates the detection of the embedded bit even when embed two bits in each frame. In addition, successive
the embedded amplitudes are scaled or quantized. The frames for embedding were overlapped with 50 percent to
frames with their spectral components at the tone further increase the payload capacity. After verifying the
frequencies set in accordance with the data bits imperceptibility of and the data recovery from the stego
constituted the stego signal. For transmission, the signal, the technique was extended for use in covert
embedded-frame is quantized to 16 bits, which is the battlefield communication in which the hidden
same as the original host audio sample size. information can be another utterance. For initial studies,
the utterance, “seven one” spoken by a male speaker, was
For recovering the covert information from every used as the covert message. This utterance was
received frame of audio, the frame power fe is computed represented in the Global System for Mobile
along with the power p0 and p1 at f0 and f1. If the ratio, communication half-rate (GSM 06.20) coding scheme
resulting in a compact form of 2800 bits. Two TIMIT
utterances – “Thus technical efficiency is achieved at the embedded stego was added to each frame. Random noise
expense of actual experience” and “His captain was thin at low power is unlikely to increase the power of any of
and haggard and his beautiful boots were worn and the three tones to a level that exceeds the power level of
shabby” – each with 16 bit samples and 16,000 samples/s) the significant tone. Hence, the pair of the embedded bits
were concatenated to accommodate the large covert from each of the noise-added stego frames was
information size. With two bits inserted in each host successfully recovered up to noise power set to 25 percent
frame of 256 samples, only the first 1400 overlapped of frame power. Bit errors started showing up at higher
frames out of a total of 1542 were used for embedding all noise power levels. However, speech decoded from
the covert message bits. This gives an embedding GSM-coded speech with up to 10 percent bit errors results
capacity of 2800 bits in 11.208 s , or 249.82 bits/s. in sufficient quality to convey the message albeit with
noise. Hence, noise power levels as high as 5 percent of
Tones for insertion were selected at frequencies of frame power, which resulted in only 70 to 80 bits in error
687.5 Hz, 1187.5 Hz, 1812.5 Hz, and 2562.5 Hz. These out of a total of 2800 bits, or at a bit error rate (BER) of
frequencies were either absent or weak in the host frames. about 2.9 percent can be used to transmit covert messages
With four tones, however, an additional step was in encoded form with the tone embedding technique.
necessitated to prevent the detection of embedding.
Presence of a continuous stream of 0’s or 1’s in the covert Host - TIMIT
8000
data, for instance, results in the same tone being set at
0.25 percent of the corresponding frame power. Although 6000

Frequency
a listener may not be able to perceive the tone because of 4000
its low power, the spectrogram is likely to show
2000
continuous spectral nulls or ‘holes’ at the remaining three
tone frequencies. To a malicious attacker, these artifacts 0
0 2 4 6 8 10 12
are indicative of host manipulation even without the Stego - 2 bits/frame
Time

knowledge of host spectrogram. Use of a 4-bit key that 8000

sets the order of the tones for each frame by frequency 6000
hopping avoids such an obvious detection of embedding
Frequency

[3]. 4000

2000
Figs. 2 and 3 show the host and the stego signals and 0
their spectrograms using the frequency-hopped four-tone 0 2 4 6 8 10
Time
12

insertion for embedding the covert message. No


perceptual or otherwise detectible difference was noticed Fig. 3 Spectrograms of host (top) and stego with 2800 bits
between the host and the stego signals and all the embedded (bottom)
embedded data were correctly recovered from the stego
signal. A more serious attack on the stego during transmission
than additive noise is the random deletion of a few
x 10
4 Host - TIMIT samples. Since removal of up to one in 50 samples has
4
been shown to cause no perceptible difference [5], we
studied its effect on data recovery. With the removal of
2
five samples from each stego frame of 256 samples and
0
replacing them with zeros, bit errors of 4 to 10 out of
2800, or a BER of 0.14 to 0.36 percent, were observed.
-2 When 10 samples in each stego frame were replaced
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
5 zeros, the bit error increased to 52 to 62. When the
4 Stego - 2 bits/frame x 10
10
x 10
removed samples were replaced by their neighbors, the
error increased to about 30 for five sample replacement
5 and to about 80 for 10 sample replacement. Still, as noted
previously, a malicious attack on the stego may render the
0
host message noisy while still carrying the coded covert
audio message in a perceivable form.
-5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Sample index 5
x 10 In addition to the clean host from the TIMIT database,
Fig. 2 Host (top) and stego with 2800 bits embedded the frequency-hopped tone insertion was also used in an
(bottom) experiment with a noisy host from the Greenflag
database. Obtained as 16-bit PCM data at a rate of 8000
To study the effect of noise on the recovery of samples per second, the Greenflag database consists of
embedded information, Gaussian noise with zero mean utterances from the cockpit of fighter aircraft. Because of
and average power proportional to the frame power of the high level of noise inherent in the host, externally
introduced tone or noise arising from embedding is
generally not noticeable. Figs. 4 and 5 show the result of 625 Hz, 750 Hz, 875 Hz and 1062.5 Hz, with as much as
embedding the GSM-coded covert speech, ‘seven one’ on 25 percent of frame power in the dominant tone did not
a host consisting of two utterances from the Greenflag result in any noticeable difference in speech quality or
database. Using 128 samples per frame the host of 80128 spectrogram; the inserted tones, randomized because of
samples has 1251 frames which can embed only 2502 bits the hopping key, were clearly masked by the already
out of the 2800 bits of the coded covert speech. This significant spectral components in the host. Fig. 5 shows
gives an embedding capacity of 249.8 bits/s. the spectrograms of the host and stego for comparison.

Because of the high level of intrinsic noise in the host, With additive noise raised up to each stego frame
the dominant tone power was raised to more than 10 power, all 2502 bits were correctly recovered. At higher
percent. Although the stego signal did not show any noise levels of up to 2.5 times the frame power, the bit
perceptual difference from the host, the higher tone power error was below 80, or a BER of 2.86 percent.
started showing up in the spectrogram. To mask the
dominant tone in the spectrogram, the tones were set to Cropping by zeroing or replacing from 3 to 50 samples
frequencies in the range where the host has significant in each stego frame caused no bit error due to relatively
energy. In the 400 Hz to 1000 Hz range, for example, the higher power of the inserted tones. At higher number of
host has relatively high spectral energy over almost the destroyed samples, stego became highly noisy; still, the
entire duration. Hence, inserting tones at frequencies of bit error was negligible.

4. DISCUSSION AND CONCLUSION


4 Host - GF
x 10
4

2 A method of embedding covert data in a cover audio


signal by insertion of low power tones has been presented.
0 The results of the present experiments demonstrate the
feasibility of the proposed technique for audio
-2
0 1 2 3 4 5 6 7 8 9 steganography with imperceptibility, payload and data
x 10
4 Stego - 2 bits/frame
4
x 10 recovery with zero BER.
10

5
At the embedding rate of approximately 250 bits/s, the
technique has a high payload capacity. Any attempt to
0 increase the capacity further must use more than four
tones. However, use of eight tones for embedding 3
-5 bits/frame, for example, may lead to audible and/or
0 1 2 3 4 5 6 7 8 9
Sample index 4
x 10
visible artifacts unless the selected tones are absent in the
host audio. Also, noise – intentional or unintentional –
Fig. 4 Greenflag utterance host (top) and GSM-coded bits may cause high bit errors at high capacity. Noisy host
embedded stego signals, on the other hand, can have larger payload and
use higher power without significant loss of data. In
Host - GF
general, tones selected from high energy regions of the
4000 host can be masked in hearing and spectrogram by their
3000
low power levels.
Frequency

2000
Malicious attacks involving replacement of embedded
1000 samples with zeros or neighboring values appear to cause
0
less loss of data at smaller number of samples. Cropping
0 1 2 3 4 5 6 7 8 9 of a large number of samples destroys the cover audio,
Time
4000
Stego - 2 bits/frame however.

3000 For further imperceptibility at more than two


Frequency

2000 bits/frame of embedding, which increases the number of


tones, the inserted tones may be selected from a set of
1000
psychoacoustically masked spectral points. While it is
0
0 1 2 3 4 5 6 7 8 9
preferable to use the same set of tone frequencies in all
Time the frames, it may not be possible to do so for a general
cover utterance. Instead, a set of tones from the most
Fig. 5 Spectrograms of Greenflag host (top) and stego commonly occurring masked frequencies may be chosen
with 2502 bits (bottom) for embedding. Alternatively, each frame may have its
own tone set selected from the perceptually masked set of
the frame. The collection of the frame tone frequencies [2] M.D. Swanson, M. Kobayashi, and A.H. Tewfik,
may then form an additional key for covert data “Multimedia data-embedding and watermarking
embedding and recovery. An issue that may arise with technologies,” Proc. IEEE, Vol. 86, pp. 1064-1087,
the use of the masked frequency set is that coding and June 1998.
compression of the embedded (stego) audio may alter the
signal waveform in the masked regions (and hence the [3] K. Gopalan, S. Wenndt, A. Noga, D. Haddad, and S.
power at the inserted tone frequencies). The extent of Adams, “Covert Speech Communication Via Cover
loss of embedded data due to MPEG and other such Speech By Tone Insertion,” Proc. of the 2003 IEEE
coding schemes needs to be studied. Aerospace Conference, Big Sky, MT, Mar. 2003 (on
CD).
The proposed technique shows promise as a robust
method for audio steganography under noisy and cropped [4] K. Gopalan, et al, “Covert Speech Communication
conditions. Via Cover Speech By Tone Insertion,” U.S. Patent
applied for, Oct. 2003.
REFERENCES
[5] R.J. Anderson and F.A.P. Petitcolas, “On the limits of
[1] W. Bender, D. Gruhl, N. Morimoto and A.Lu, steganography,” IEEE J. Selected Areas in
“Techniques for data hiding,” IBM Systems Journal, Communications, Vol. 16, No. 4, pp.474-481, May
Vol. 35, Nos. 3 & 4, pp. 313-336, 1996. 1998.

You might also like