You are on page 1of 3

DIGITAL AUDIO WATERMARKING WITH SEMI-BLIND DETECTION FOR IN-CAR MUSIC CONTENT IDENTIFICATION.

RON HEALY& JOE TIMONEY

DIGITAL AUDIO WATERMARKING WITH SEMI-BLIND DETECTION


FOR IN-CAR MUSIC CONTENT IDENTIFICATION
Ron Healy and Joe Timoney
Department of Computer Science, National University of Ireland, Maynooth, Co. Kildare, Ireland.
rhealy@cs.nuim.ie, jtimoney@cs.nuim.ie

Recent developments in audio watermarking techniques have gone some way towards promoting an industry-wide
acceptance of digital audio watermarking as a process that will eventually be used in all audio (and video) production.
The predominant focus of such watermarking research has been in the area of content protection, as preventing illegal
copying is an area of concern for content owners. However, digital audio watermarking may also be used for other
purposes, such as the added-value concept of real-time content identification of music. While computer-based users of
music enjoy the opportunity to identify unknown audio using online tools, identification of audio in a domestic or in-car
scenario is not so easily achieved. This paper deals with an area of digital audio watermarking that would facilitate real-
time in-car identification of the artists, title and/or other meta-data relating to music being broadcast by radio.

INTRODUCTION transmission channel. This paper examines how this


When a track is played on radio that a listener is may be achieved and discusses initial steps towards a
unfamiliar with, he or she is at the mercy of the successful prototype.
presenter or producer of the radio show for an The techniques used in the processes discussed in this
identification of the track. However, in many cases there paper make use of well known DSP algorithms such as
is no identification given. This occurs when a radio Fast Fourier Transform, Goertzel Algorithm and the
presenter plays a number of tracks in sequence and does conceptual direction of the work discussed is inspired
not identify any/all of them, or when a radio station is by ‘Covert Speech Communication Via Cover Speech
using a computerised delivery platform – particularly in By Tone Insertion’ by Gopalan, K., et al, Proc. of the
the so-called dead hours where there is no human 2003 IEEE Aerospace Conference, Big Sky, MT, March
involvement in the broadcast. It can be a frustrating 2003. It is a novel use of existing, well-understood
situation when a listener hears a track they are very mature DSP techniques.
interested in and wants to know more about it but is
given no information to identify the performer or track The process is described as semi-blind, rather than
title. Assuming the listener wants to find out more about blind, since the decode system does require knowledge
the artist, and may even want to buy the material they of some value or parameter used in the encoding
are listening to, there is no current way of facilitating process. However, it does not need access to the original
this discovery. Sales opportunities may well be lost. unwatermarked host audio.

Digital audio watermarking at source will overcome this


limitation and not only offer listeners the information IMPLEMENTATION
need to research or buy the material, but also offer an
‘added value’ technology for a manufacturer of both A representation of the information to be embedded in
domestic and in-car entertainment devices as a user- the intended host audio was created. The representation
centred selling point. An audio watermark is embedded chosen for this scenario was the track title and artist
into songs at the time of production which will allow the details. However, the representation selected may be
real-time identification of the performer, track title any information that the broadcaster wants to transmit.
and/or other content such as publisher details, Once the information to be embedded was decided on, a
ownership details etc. This information can be extracted sinusoidal waveform was generated to represent the
and displayed on existing screens in both domestic and information. This was then added to the host audio as a
in-car audio systems, in a manner similar to the way basic initial step towards a digital watermark. The
radio station data is currently displayed. The intention was to create a digital audio watermarking
information will be part of the actual audio content, system which is simple in terms of computational
rather than meta-data such as MP3 headers, so would be complexity.
transmitted with the audio, even over an analogue

AES 36th International Conference, Dearborn, Michigan, USA, 2009 June 2–4 1
DIGITAL AUDIO WATERMARKING WITH SEMI-BLIND DETECTION FOR IN-CAR MUSIC CONTENT IDENTIFICATION. RON HEALY& JOE TIMONEY

Once the sinusoid representing the watermark was choice should be left to the user, within reason, so as to
added to the host, obstacles that arose were identified maximise the potential for covert/private watermarking.
and attempts to overcome them suggested and
The results of the full encode/decode cycle performed
implemented. Use of various techniques such as notch
on a sample set of host audio, using selected values for
filtering, psychoacoustic/perceptual masking, amplitude
S, F and L are displayed in Table 1, where P is the
masking, and other well established DSP techniques is
accuracy of recovered watermarks. Note that
discussed in the context of the proposed use of the
watermarks are only considered to be accurately
watermarking system.
recovered if the system returns the complete identifier
Sample values for three parameters involved in originally embedded in the host and that partially
embedding and then decoding the watermark were used correct watermarks are discounted.
to perform the full encode/decode cycle on a set of host
audio files and their results compared to provide a set of Table 1 shows the results of the encoding/decoding
‘optimal’ values for this particular set of host audio process on 50 audio (WAV) files, using the input
files. The process is described as follows: parameters ‘F’ and ‘L’. The parameter ‘F’ represents the
base frequency that was chosen, and two values are
min d ( S , L, F ) = { P : P ≥ 98%} added to ‘F’ to define the two frequencies that were
S
modified to represent the watermark 1/0 bits. The value
Under constraint 6 kHz < F <10 kHz for ‘L’ is the length in ms of the tone for each binary bit.
S=Number of samples decoded; L=Length of watermark; The length of the watermark is a product of the value of
F=Frequency manipulated; P=Percentage accuracy ‘L’ (Watermark length in ms = L x 193) and this affects
the number of loops of the full WM that are present in a
given length of host audio.
This is explained as ‘the lowest value of S for the
function d that returns a value of P greater than or equal The decoding process has a set number of samples (S)
to 98% using the inputs S, L and F, all of which are from which to decode the watermark. The number of
integer values, with the constraint that F (frequency loops of the Watermark that could theoretically be
manipulated) must be in the range 6kHz to 10 kHz’. embedded in S is a product of S divided by the sampling
This constraint for F is chosen to maximise the potential rate (in this case 48,000 samples per second), then
of the watermark to withstand common attacks. divided by length of the Watermark.

The value S represents the number of samples used for The optimal set of parameters is defined as being the
decoding to provide the result for P and this has an lowest value of S that returns P equal to or above 98%,
increasing computational cost associated with it as S with constraints setting 5000 < F < 10000 and L > 5ms
increases. However, the value of S in producing a more but minimised without lowering P. Other values of F
accurate P does not increase continuously or and L are included for illustration
consistently so it is useful to find out at which value of
S it ceases to have a beneficial effect on P so as to Table 1.
minimise unnecessary computational cost in decoding. S (million F (Hz) L (ms per P (%)
Since the decoding process is intended to be performed samples) tone) / (sec
in real-time or better, the number of samples required per WM)
for decoding, while maintaining an acceptable level of 1 1 12000 3 / 0.579 12
accuracy, should be as low as possible. 2 1 12000 10 / 1.93 90
This process of minimising S is preferred over 3 1 12000 20 / 3.86 94
minimising L or F since the encoding process (using L) 4 1 12000 25 / 4.825 92
is only performed once, so has a comparatively low 5
effect on decoding complexity. However, as L 6 1 5000 16 / 3.088 50
increases, so the length of the watermark increases,
7
meaning there will be a lower number of watermarks for
the same number of samples decoded. The number of 8 1 7000 3 / 0.579 0
watermarks decoded can impact on the accuracy of the 9 1 7000 10 / 1.93 64
decoding process and so while it is true that S has more 10
cost associated with it, also L has some cost in the 11 1 10000 25 / 4.825 84
decoding phase, albeit less. 12
The choice of frequencies F to manipulate has no effect 13 3 6000 25 / 4.825 95
on computational cost of decoding. Also, frequency 14

AES 36th International Conference, Dearborn, Michigan, USA, 2009 June 2–4 2
DIGITAL AUDIO WATERMARKING WITH SEMI-BLIND DETECTION FOR IN-CAR MUSIC CONTENT IDENTIFICATION. RON HEALY& JOE TIMONEY

15 3 7000 25 / 4.825 98
16 3 7000 20 / 3.86 95
17
18 4 5000 16 / 3.088 76
19
20 4 7000 3 / 0.579 6
21 4 7000 5 / 0.965 72
22 4 7000 10 / 1.93 88
23 4 7000 16 / 3.088 92
24 4 7000 20 / 3.86 96
25 4 7000 25 / 4.825 98
26
27 6 5000 16 / 3.088 76 FIG.1: alternative representation of data included in Table 1

Initial observations suggest that all three factors can It is apparent from initial observations, as mentioned
affect the accuracy of the decoding process apparently above, that each of the factors used in embedding the
independently. However, some have more impact than watermark (embedding frequency, length of watermark
others. It has been observed that a watermark with a tone) as well as the factor used to decode the watermark
tone length (L) less than or equal to 5ms produces low (length of candidate audio / loops of the watermark to
accuracy, while a watermark with longer tone lengths decode) can impact on the overall success of the
generally produce better accuracy, up to a point where process. Detailed analysis of a large range of audio,
increasing the watermark length further has no benefit. using various values for these factors, will enable
optimal values to be selected with greater confidence.
The number of samples available for decoding the
watermark is a big factor in producing accurate CONCLUSION
decoding results. However, again, only up to a point.
Not only is it not correct to say more samples always The technique discussed provides a platform for a
equals higher accuracy but a larger value for S equates computationally simple watermark encoding/decoding
to a much more computationally costly decode process system that would use a set of standard parameter
so S must be minimised for an acceptable result of P. values to enable semi-blind decoding much like a
The above results would seem to indicate that 3 million public-key decryption system will allow holders of the
samples, equating to 12 complete loops of a watermark public key to decipher the encrypted message. Adding a
with 193*25ms tones, produces the same accuracy over mechanism to domestic or in-car entertainment systems
50 files as does 4 million samples, or 17 loops of the to decode the content using the hard-coded parameter
193*25ms watermark. Decoding of more samples does values will allow the identification and on-screen
not always result in better accuracy as evidenced by the display of track and artist information
results in line 15 compared to line 25, and in line 18
compared to line 27, but it does equate to an increase in Radio broadcasters could also utilise the same
computation required to produce the same results. techniques to add generic, content-dependent or
location-dependent advertising information to the
It would appear that a watermark with a 25ms tone broadcast transmission which can then also be decoded
length is optimal for accurate extraction of the content by the same mechanisms and displayed on-screen in the
of the watermark at an acceptable level of 98%. same manner. This may add an ancillary revenue stream
However, increasing the frequencies manipulated does opportunity for radio broadcasters or may be used for
not necessarily result in better accuracy as a watermark ad-hoc broadcasting of other information to a wide
with 25ms tone lengths embedded in the 10000hz range audience without impacting on the actual presentation
returns comparatively low accuracy, with a lesser of the radio shows in which it is embedded.
number of samples to decode from.

The subplots below (FIG.1) show the results in Table.1


in an alternative format, illustrating the impact of each
of the parameters S, F and L on the decoding accuracy P
– shown on the x axis

AES 36th International Conference, Dearborn, Michigan, USA, 2009 June 2–4 3