You are on page 1of 13

1

Audio Splicing Detection and Localization Using


Environmental Signature
Hong Zhao, Yifan Chen Member, IEEE, Rui Wang Member, IEEE, and Hafiz Malik Member, IEEE,

Abstract—Audio splicing is one of the most common ma- In our previous works [1]–[4], we have shown that acoustic
nipulation techniques in the area of audio forensics. In this reverberation and ambient noise can be used for acoustic envi-
arXiv:1411.7084v1 [cs.CR] 26 Nov 2014

paper, the magnitudes of acoustic channel impulse response ronment identification. One of the limitations of these methods
and ambient noise are proposed as the environmental signature.
Specifically, the spliced audio segments are detected according is that these methods cannot be used for splicing location
to the magnitude correlation between the query frames and identification. And also, researchers have proposed various
reference frames via a statically optimal threshold. The detection methods for the applications of audio source identification
accuracy is further refined by comparing the adjacent frames. [5], acquisition device identification [6], double compression
The effectiveness of the proposed method is tested on two data [7] and etc. However, applicability of these method for audio
sets. One is generated from TIMIT database, and the other
one is made in four acoustic environments using a commercial splicing detection is missing.
grade microphones. Experimental results show that the proposed This paper presents a novel approach to provide evidence of
method not only detects the presence of spliced frames, but the location information of the captured audio. The magnitudes
also localizes the forgery segments with near perfect accuracy. of acoustic channel impulse response and ambient noise are
Comparison results illustrate that the identification accuracy of used as intrinsic acoustic environment signature for splicing
the proposed scheme is higher than the previous schemes. In
addition, experimental results also show that the proposed scheme detection and splicing location identification. The proposed
is robust to MP3 compression attack, which is also superior to scheme firstly extracts the magnitudes of channel impulse
the previous works. response and ambient noise by applying the spectrum classifi-
Index Terms—Audio Forensics, Acoustic Environmental Sig- cation technique to each frame. Then, the correlation between
nature, Splicing Detection the magnitudes of the query frame and the reference frame
is calculated. The spliced frames are detected by comparing
the correlation coefficients to the pre-determined threshold. A
I. I NTRODUCTION refining step using the relationship between adjacent frames is
adopted to reduce the detection errors. One of the advantages
D IGIAL media (audio, video, and image) has become
a dominant evidence in litigation and criminal justice,
as it can be easily obtained by smart phones and carry-on
of the proposed approach is that it does not depend on any
assumptions as in [1]. In addition, the proposed method is
cameras. Before digital media could be admitted as evidence robust to lossy compression attack.
in a court of law, its authenticity and integrity must be verified. Compared with the existing literatures, the contribution of
However, the availability of powerful, sophisticated, low-cost this paper includes: 1) Environment dependent signatures are
and easy-to-use digital media manipulation tools has rendered exploited for audio splicing detection; 2) Similarity between
the integrity authentication of digital media a challenging adjacent frames is proposed to reduce the detection and
issue. For instance, audio splicing is one of the most popular localization errors.
and easy-to-do attack, where target audio is assembled by The rest of the paper is organized as follows. A brief
splicing segments from multiple audio recordings. Thus, in overview of the state-of-art audio forensics is provided in
this context, we are trying to provide authoritative answers to Section II. The audio signal model and the environmental
the following questions, which should be addressed such as signature estimation algorithm are introduced in Section III.
[1]: Applications of the environmental signature for audio au-
thentication and splicing detection are elaborated in Section
• Was the evidentiary recording captured at location, as
IV. Experimental setup, results, and performance analysis
claimed?
are provided in Section V. Finally, the conclusion is drawn
• Is an evidentiary recording “original” or was it created
in Section VI along with the discussion of future research
by splicing multiple recordings together?
directions.
Copyright (c) 2014 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be II. R ELATED W ORKS
obtained from the IEEE by sending a request to pubs-permissions@ieee.org The research on audio/speech forensics dates back to the
Hong Zhao, Yifan Chen and Rui Wang are with the Department of
Electrical and Electronic Engineering, South University of Science and 1960s, when the U.S. Federal Bureau of Investigation has
Technology of P.R. China. email: zhao.h@sustc.edu.cn, chen.yf@sustc.edu.cn, conducted an examination of audio recordings for speech
wang.r@sustc.edu.cn intelligibility enhancement and authentication [8]. In those
Hafiz Malik is with the Department of Electrical and Computer Engineering,
The University of Michigan – Dearborn, MI 48128; ph: +1-313-593-5677; days, most forensic experts worked extensively with ana-
email: hafiz@umich.edu log magnetic tape recordings to authenticate the integrity
2

of the recording [9], [10] by assessing the analog recorder tion scheme is based on the analysis of high-order singularity
fingerprints, such as length of the tape, recording continuity, of wavelet coefficients [42], where singular points detected by
head switching transients, mechanical splices, overdubbing wavelet analysis are classified as forged. Experimental results
signatures and etc. Authentication becomes more complicated show that the best detection rate of this scheme on WAV format
and challenging for digital recordings as the trace of tampering audios is less than 94% and decreases significantly with the
is much more difficult to detect than mechanical splices or reduction of sampling rate. Pan et al. [43] have proposed a
overdubbing signatures in the analog tape. Over the past few time-domain method based on higher-order statistics to detect
decades, several efforts have been initiated to fill the rapidly traces of splicing. The proposed method uses differences of
growing gap between digital media manipulation technologies the local noise levels in an audio signal for splice detection.
and digital media authentication tools. It works well on simulated spliced audio, which is generated
For instance, the electric network frequency (ENF)-based by assembling a pure clean audio and a noise audio, however,
methods [5], [11]–[15] utilize the random fluctuations in its performance in practical applications is unknown. In all
the power-line frequency caused by mismatches between the the aforementioned schemes, it is assumed that the raw audio
electrical system load and generation. This approach can recording is available, and its performance might degrade
be extended to recaptured audio recording detection [16] significantly when lossy audio compression algorithm is used.
and geolocation estimation [17]. However, the ENF-based In this paper, a robust feature, that is, the magnitude of acoustic
approaches may not be applicable if well-designed audio impulse response, is proposed for audio splicing detection
equipment (e.g., professional microphones) or battery-operated and localization. Experimental results demonstrate that the
devices (e.g., smartphones) are used to capture the recordings. proposed scheme outperforms the state-of-art works described
Although, the research on the source of ENF in battery- above.
powered digital recordings is initiated [18], its application in
audio forensics is still untouched. III. AUDIO S IGNAL M ODELING AND S IGNATURE
On the other hand, statistical pattern recognition techniques E STIMATION
were proposed to identify recording locations [19]–[24] and
acquisition devices [6], [25]–[30]. For instance, techniques A. Audio Signal Model
based on time/frequency-domain analysis (or mixed domains In audio recording, the observed audio/speech will be
analysis) [7], [31]–[33] have been proposed to identify the contaminated due to the propagation through the acoustic
trace of double compression attack in MP3 files. A frame- channel between the speaker and the microphone. The acoustic
work based on frequency-domain statistical analysis has also channel can be defined as the combined effects of the acoustic
been proposed by Grigoras [34] to detect traces of audio environment, the locations of microphone and the sound
(re)compression and to discriminate among different audio source, the characteristics of the microphone and associated
compression algorithms. Similarly, Liu et al. [35] have also audio capturing equipment. In this paper, we consider the
proposed a statistical-learning-based method to detect traces model of audio recording system in [1]. For the beginning,
of double compression. The performance of their method let the s(t), y(t), η(t) be the direct speech signal, digital,
deteriorates for low-to-high bit rate transcoding. Qiao et al. recording and background noise, respectively. The general
[36] address this issue by considering non-zero de-quantized model of recording system is illustrated in Fig. 1. In this
Modified Discrete Cosine Transform (MDCT) coefficients for paper, we assume that the the microphone noise, ηMic (t),
their statistical machine learning method and an improved and transcoding distortion, ηT C (t), are negligible because of
scheme has been proposed recently [37]. Brixen in [38] has that high quality commercial microphone is used and the
proposed a time-domain method based on acoustic reverbera- recordings are saved as raw format. One kind of transcoding
tion estimated from digital audio recordings of mobile phone distortion (lossy compression distortion) will be considered in
calls for crime scene identification. Similar schemes using the experiment later. With these assumptions, the model in
reverberation time for authentication have been studied by Fig. 1 can be simplified as,
Malik [2], [39]. However, these methods are limited by their
low accuracy and high complexity in model training and test- Ac
ou
stic
Art
ing. Additionally, their robustness against lossy compression ifac
ts Microphone
is unknown. In [1]–[4], [30], [40], model-driven approaches to Audio
Encoding
Transcoding Audio
Recording
estimate acoustic reverberation signatures for automatic acous- t Sign
al
Direc
tic environment identification and forgery detection where the
se

Microphone Transcoding
Encoding
oi

Artifacts
N

Artifacts Artifacts
d

Audio Source
detection accuracy and robustness against MP3 compression
un
ro
kg
ac

attacks are improved.


B

Recently, audio splicing detection has attracted a number Acoustic


Response
of research interests. However, all the above methods were hRIR (t) Microphone Encoding Transcoding
not primarily designed for audio splicing detection and their
s(t) + hMic(t) + + + y(t)
applicability for audio splicing detection is unclear. A method Audio
Recording
based on higher-order time-differences and correlation analysis Ș(t)
ȘMic(t) Șe(t) ȘTC(t)

has been proposed by Cooper [41] to detect traces of “butt-


splicing” in digital recordings. Another audio splicing detec- Fig. 1. A simplified digital audio recording system diagram.
3

y(t) ≈ hRIR (t) ∗ hMic (t) ∗ s(t) + η(t)


(1) |Y (k, l)|2 = |S(k, l)|2 × |H(k, l)|2 + |V (k, l)|2 +
≈ ḣRIR (t) ∗ s(t) + η(t)
2 × |S(k, l)| × |H(k, l)| × |V (k, l)| × cos θ
where ∗ represents the convolution operation; hRIR denotes (3)
the Room Impulse Response (RIR) (or Acoustic Channel where θ = ∠S(k, l) + ∠H(k, l) − ∠V (k, l). Dividing both
Response (ACR)) and ḣRIR (t) = hMic (t) ∗ hRIR (t). sides of (3) by |S(k, l)|2 × |H(k, l)|2 and transforming the
As a remark note that, hRIR (t) is uniquely determined resulting expression using logarithm operation results in:
by the acoustic propagation channel. hMic (t) characterizes
the property of the microphone used to capture the audio. log |Y (k, l)| − log |S(k, l)| = log |H(k, l)| + log ε(k, l) (4)
Compared to hRIR (t), hMic (t) is flatter, less dynamic and
weaker for high quality microphone. It has been proved that where
hMic (t) is also useful to authenticate the integrity of audio p
[30]. Therefore, we consider the combined effect of hMic (t) ε= (ξ −1 + 2ξ −0.5 cos θ + 1) , (5)
and hRIR (t), and estimate them simultaneously. In the remain-
and
ing of this paper, we shall refer to ḣRIR (t) as environmental
signature, as hRIR (t) is predominant. Since the environment |S(k, l)|2 × |H(k, l)|2
signature uniquely specifies the acoustic recording system, it ξ= , (6)
|V (k, l)|2
can be used to authenticate the integrity of captured audio.
The blind estimation algorithm of environment signature in Gaubitch et al. in [44] has shown the way to ex-
tract log |H(k, l)| and log ε(k, l) from the noisy observation
the existing literature will be elaborated in Section III-B, and
our proposed authentication and detection algorithm will be |Y (k, l)|. However, in the application of audio forensics,
introduce in Section IV. the background noise provides useful information and its
effectiveness has been experimentally verified in [4], [51] and
[1]. Thus, we define the environmental signature H as follows
B. Environmental Signature Estimation
L
Blind dereverberation is a process of separating reverberant 1X
H(k) = (log |Y (k, l)| − log |S(k, l)|) (7)
and dry signals from a single channel audio recording without L
l=1
exploiting the knowledge of ḣRIR (t). Dereverberation meth-
ods can be divided into the following categories: (i) frequency Note that in the definition of environmental signature (7),
domain methods [44]–[47], (ii) time domain methods [48], Y (k, l) is the spectrum of the observation audio signal, and
[49], (iii) joint time-frequency domain methods [50]. Blind S(k, l) is the spectrum of the clean audio signal, which can
dereverberation has found a wide range of applications ranging be estimated as follows.
from speech enhancement [45], [46] to distant speech recog- Step 2: Clean Log-Spectrum Estimation
nition [47], acoustic environment identification [1]–[4], audio In practice, S(k, l) is not available and can be approximated
forensics [2], [51], etc. In our previous works [1]–[4], it was via the Gaussian Mixture Model (GMM) [52]. Given a data set
also experimentally verified that reverberation is quite effective consisting of clean speeches with various speakers and con-
for environment identification [1], [3], [4] and audio forensics tents, each sample, s(n), is divided into overlapping windowed
[2], [51] applications. In this paper, we shall use the frequency frames and proceeded by the STFT to obtain S(k, l). Mean
domain method to estimate the RIR. Specifically, we first substraction is used to smooth the log-spectrum S(k, l),
introduce the environment signature which presents the log- K
1 X
spectrum magnitude of RIR contaminated by ambient noise, Slog (k, l) = log |S(k, l)| − log |S(k, l)| (8)
and then, elaborate on the corresponding estimation algorithm K
k=1
[44]. It can be archived by three steps approach as follows. where K is the total number of frequency bins and Slog (k, l)
Step 1: Signature Expression represents the smoothed log-spectrum.
The Short Time Fourier Transform (STFT) representation of For each frame, the RASTA filtered Mel-Frequency Cepstral
(1) can be expressed as: Coefficients (MFCC) [53] (more robust than pure MFCC)
are estimated, which results in an N -dimensional feature
Y (k, l) = S(k, l) × H(k, l) + V (k, l) (2) vector mfccs (l) = [mf ccs (1, l), . . . , mf ccs (N, l)]T , where
where Y (k, l), S(k, l), H(k, l) and V (k, l) are the STFT l denotes the lth audio frame. The target M -mixture GMM
coefficients (in the k th frequency bin and the lth frame) of defined by a mean vector, µm , diagonal covariances matrix,
y(t), s(t), ḣRIR (t) and η(t), respectively. It is important to Σm , and weights vector, πm , of each mixture, is trained by
mention that (2) is valid if and only if the frame length of the using the MFCCs. The mixture probabilities are expressed as,
STFT is larger than the duration of RIR, hRIR (t). Generally, πm N (mfccs (l)|µm , Σm )
acoustic channel varies much more slowly than the speech γl,m = PM (9)
and, therefore, it is reasonable to assume that |H(k, l)| does i=1 πi N (mfccs (l)|µi , Σi )
not vary significantly with l. Power spectrum of the noisy where N (mfccs (l)|µm , Σm ) represents the multivariate
observation y(t) is expressed as Gaussian distribution.
4

The log-spectrum of clean speech can be obtained by


weighted averaging over all frames as:
ρ = N CC(Ĥ q , Ĥ r )
PL
γl,m × Slog (k, l)
S̄log (k) = l=1 PL (10) P
K
(Ĥ q (k) − µq )(Ĥ r (k) − µr )
l=1 γl,m k=1
= s s
Step 3: Signature Estimation P
K PK
Then, the clean speech model of (10) is now used to (Ĥ q (k) − µq )2 (Ĥ r (k) − µr )2
estimate the acoustic distortion in (7). More specifically, k=1 k=1

the following steps are involved in estimating acoustic dis- PK PK (13)


1 1
tortion given in (7): (i) a feature vector, mfccy (l) = where µq = K k=1 Ĥ q (k), µ r = K k=1 Ĥ r (k).
[mf ccy (1, l), . . . , mf ccy (N, l)]T , of observation y(t) is ob- 3) Compare the cross-correlation coefficient with an thresh-
tained by applying RASTA filtering on estimate MFCC feature old T to determine the authentication of the query
vector, (ii) the posterior probability of mfccy (l) is calculated audio. Specifically, the following hypothesis testing rule
by substituting filtered feature vector into (9), which results in is used,
γ̄l,m for each mixture m = 1, 2, . . . , M . With the availability ρ<T : Forged
(14)
of posterior probability, spectrum of the direct signal of the ρ ≥ T : As Claimed
lth frame can be expressed as:
" M #
X
Environment
Ŝ(k, l) = exp γ̄l,m × S̄log (k) (11)
Signature Extraction
m=1 As Claimed
Query Audio YES
Substituting (11) into (7), the estimted signature can be Similarity
>T
expressed as: Calculation

1 Xh
L  PM i NO
Forgied
Environment
Ĥ(k) = log |Y (k, l)| − log e m=1 γ̄l,m ×S̄log (k) Signature Extraction
L i=1
(12) Recaptured
Audio
As a remark note that the accuracy of the estimated RIR
b l), and
depends on the error between estimated spectrum, S(k, Fig. 2. Block diagram of the proposed audio source authentication method.
the true spectrum, S(k, l).
The performance of the above authentication scheme de-
IV. A PPLICATIONS TO AUDIO F ORENSICS pends on the threshold T . In the following part, an criteria is
A novel audio forensics framework based on the acoustic proposed to derive the optimal threshold.
impulse response is proposed in this section. Specifically, 1) Formulation of Optimal Threshold: Supposing that fe (ρ)
we consider two application scenarios of audio forensics: (i) represents the distribution of ρ, which is the correlation
audio source authentication which determines whether the coefficient between the reference audio and the query audio
query audio/speech is captured in a specific environment or captured in identical environment. In this case, Ĥ q and Ĥ r
location as claimed; and (ii) splicing detection which deter- are highly positive correlated. Similarly, fg (ρ) represents the
mines whether the query audio/speech is original or assembled distribution of ρ, which is the correlation coefficient between
using multiple samples recorded in different environments and the reference audio and the query audio captured in different
detects splicing locations (in case of forged audio). environments. In this case, Ĥ q and Ĥ r are expected to be
independent. Due to the inaccurate estimation of signatures
A. Audio Source Authentication and the potential noise, the resulted correlation coefficients
will diverge from the true values. The statistical distributions
In practice, the crime scene is usually secured and kept
of ρi s will be used to determine the optimal threshold. The
intact for possible further investigation. Thus, it is possible to
binary hypothesis testing based on an optimal threshold, T in
rebuild the acoustic environmental setting, including the loca-
(14), contributes to the following two types of errors:
tion of microphone and sound source, furniture arrangement,
etc. Hence, it is assumed that the forensic expert can capture • Type I Error: Labeling an authentic audio as forged. The
reference acoustic environment signature, denoted as Ĥ r , and probability of Type I Error is called False Positive Rate
compare it to the signature Ĥ q extracted from the query audio (FPR), which is defined as,
recording. As illustrated in Fig. 2, the proposed authentication Z T
scheme is elaborated below: FP R (T ) = fe (ρ)dx
−∞
1) Estimate the RIRs, Ĥ q and Ĥ r , from the query audio
and reference audio according to the method described = 1 − Fe (T ) (15)
in Section III-B, respectively,
2) Calculate the normalized cross-correlation coefficient ρ where, Fe is the Cumulative Distribution Function (CDF)
between Ĥ q and Ĥ r according to (13), of fe .
5

• Type II Error: Labeling a forged audio as authentic. The


PN h i
probability of Type II Error is called False Negative Rate N ρ exp ρ δ̂ −1
1 X i=1 i i e
(FNR), which is defined as, − δ̂e − ρi + P h i =0 (22)
Z ∞ N i=1 N −1
exp ρi δ̂e i=1
FN R (T ) = fg (ρ)dx
T Numerical methods such as gradient descent can be used to
solve (22). Fig. 3 shows the plots of the true distributions of ρ,
= 1 − Fg (T ) (16)
for the case of that the query audio and the recaptured audio
where, Fg is the (CDF) of fg . are in the same environment, and the extreme value distribu-
Then, the optimal decision boundary T can be determined tion with estimated parameters µ̂e = 0.72 and δ̂e = 0.11. It
by minimizing the combined effect of these errors as follows, can be observed from Fig. 3 that, the estimated distribution is
skewed left, and therefore, the extreme value distribution can
model the true distribution reasonably well.
T := arg min [λ × FP R (T ) + (1 − λ) × FN R (T )] (17)
T ∈[−1,1]

where λ ∈ [0, 1] is the control factor chosen by the forensic


expert according to the practical applications. In the following 3
True Distribution

section, we are trying to fit ρ using statistical models. Modeled Extreme Value Distribution

Deriving the optimal threshold in (17) requires the prior 2.5

knowledge of fe (ρ) and fg (ρ). One intuitive solution is using


the empirical distributions (histograms) of ρ. However, this 2

Density
strategy results in high computing complexity of determining
1.5
the optimal threshold in (17). The anther disadvantage is
that the closed-form of the threshold can not be determined
1
according to specified FPR or FNR. In the next section, we
fit the distributions using statistical models and elaborate the 0.5

estimation of the model parameters, which can be used to


calculate the threshold. 0
−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
2) Distribution Modeling and Parameters Estimation: To ρ

model distribution of ρ, we consider the same acoustic envi-


ronment case first. That is, for identical acoustic environments, Fig. 3. Plots of the true distribution of ρ and the extreme value distribution
correlation coefficient, ρ, is very close to the extreme value with estimated parameters µ̂e = 0.72 and δ̂e = 0.11.
(maximum value) of interval [−1, 1], e.g. ρ → 1. We exper-
imentally find that the extreme value distribution is a good Similarly, correlation coefficient ρ between Ĥ q (k) and
model in this case due to its simplification and low fitting er- Ĥ r (k) estimated from the query audio and the recaptured
ror(shown in Fig. 3 and SectionV-B1). The Probability Density audio which were made in two different environments is
Function (PDF) of extreme value distribution is expressed as expected to be close to zero. In this case, correlation coefficient
[54], ρ is expected to obey a heavy-tailed distribution. To this
end, the distribution of ρ(i), i = 1, . . . , M is modeled using
   the Generalized Gaussian [56] distribution. Motivation behind
ρ − µe ρ − µe
fe (ρ|µe , δe ) = δe−1 exp − exp (18) considering the generalized Gaussian model here is due to
δe δe
the fact that thicker tails will lead to more conservative error
where µe and δe represent the location and scale parameter, estimates, as demonstrated in Fig. 4. The PDF of Generalized
respectively. The parameters, µe and δe , can be estimated Gaussian distribution is given as follows,
via maximum likelihood estimation [55]. The likelihood of "  β #
drawing N samples ρi from an extreme value distribution with 1 |ρ − µg | g
fg (ρ|αg , βg , µg ) = exp −
parameters µe and δe can be expressed as, 2αg Γ(1/βg ) αg
(23)
N
Y h  i where αg , βg , and µg represent the scale, shape and mean
P (ρ1 , . . . , ρN |µe , δe ) = δe−1 exp ρi − exp ρi (19) parameters, respectively.
i=1 The parameters can be estimated using the method of
and moments [56] as follows,
ρ i − µe
ρi = (20)
δe N
1 X
The maximum likelihood estimation of µ̂e and δˆe can be µ̂g = ρi (24)
N i=1
expressed as follows,
" # β̂g = G−1 (m̂21 /m̂2 ) (25)
1 X
N  
−1 Γ(1/β̂g )
µ̂e = δe log exp ρi δ̂e (21) α̂g = m̂1 (26)
N i=1 Γ(2/β̂g )
6

and B. Audio Splicing Detection and Localization


In this section, we will extend the method proposed in
N
1 X Section IV-A to audio splicing detection. To illustrate au-
m̂1 = |ρi − µ̂g | (27) dio splicing process, an audio clip is assembled from three
N i=1
N
segments, Seg1, Seg2, and Seg3 consisting of M1 , M2 ,
1 X and M3 frames, respectively. Without loss of generality, it is
m̂2 = |ρi − µ̂g |2 (28)
N i=1 assumed that audio segments Seg1 and Seg3 are recorded in
2 acoustic environment A and audio segment Seg2 is recorded
[Γ(2/x)]
G(x) = (29) in environment B. Fig.5 illustrates the temporal plot of the
Γ(1/x)Γ(3/x) resulting audio. It can be observed from Fig. 5 that splicing
where Γ(x) is the Gamma function. does not introduce any visual artifacts in the resulting audio.
Fig. 4 shows the plot of the distribution of normalized
correlation coefficients between Ĥ q (k) and Ĥ r (k), which are
estimated from audio samples captured in different environ-
ments. It can be observed from Fig. 4 that the distribution
of ρ fits reasonably well into the generalized Gaussian PDF.
The goodness of fit is evaluated based on the Kullback-Leibler
Seg1 Seg2 Seg3
divergence (KLD). The KLD between the true distribution and M1 frames M2 frames M3 frames
generalized Gaussian fitted distribution is equal to 0.24 which
indicates that the assumed generalized Gaussian distribution Fig. 5. Plot of the spliced audio obtained assembled from three segments,
Seg1, Seg2, and Seg3 consisting of M1 , M2 , and M3 frames, respectively.
fits reasonably well with the assumed model. It can be Here, audio segments Seg1 and Seg3 are recorded in acoustic environment
observed from Figs. 3 and 4 that both distributions are heavy- A and audio segment Seg2 is recorded in environment B.
tailed and therefore will overlap.
The difference between splicing detection and source au-
4.5
thentication is that the reference audio is usually not available.
True Distribution
Generalized Gaussian Distritubution Fitted Hence, in our proposed scheme, we shall use the RIRs
4
extracted from the previous frames as reference, and compare
3.5 it with the coming frames. Specifically, the proposed scheme
3
can be divided into two phase: raw detection and refinement.
The procedure of raw detection is elaborated below:
2.5
1) Divide the query audio in to overlapped frames and
Density

i
2
estimate the RIR, Ĥ q , from the ith frame according to
1.5 the method in Section III-B,
1
2) Calculate the cross-correlation coefficient between the
RIRs of the ith frame and the previous frames according
0.5
to (33),
 i
0
−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6
 1 P (j) (i)
ρ

i N CC(Ĥ q , Ĥ q ), if ρ̇i−1 > T, i ≤ MT
j=1
ρ̇i =

 1 M PT (j) (i)
Fig. 4. Plot of the true distribution of ρ and generalized Gaussian distribution  N CC(Ĥ q , Ĥ q ), otherwise
with estimated parameters µ̂g = 0.077, α̂g = 0.14 and β̂g = 1.74. MT j=1
(33)
In some practical applications, the threshold should be
selected to satisfy the maximum allowed FPR (τ ) requirement. MT = min {i | ρ̇i < T } and ρ̇1 = 1 (34)
i∈[1,M]
The optimal threshold regarding to the specific FPR τ can be
obtained via (i)
where Ĥ q is the estimated channel response of the ith
frame from the query audio. The term M is the length
T = Fe−1 (1 − τ ) (30)
of query audio. MT is the rough estimation of M1 . Figs.
where Fe−1
represents the inverse function of Fe . Subse- 6 (a) and (b) show the correlation calculation methods
quently, substituting (30) into (16) results in the FNR as expressed in (33) for i ≤ MT and i > MT , respectively.
follows 3) Identify the spliced frame of the query audio according
to (35)
F N RT = 1 − Fg (Fe−1 (1 − τ )) (31)
ρ̇i < T : Spliced Audio Frame
With these fitted statistical model (18), (30) can be deduced (35)
ρ̇i ≥ T : Original Audio Frame
to   
1 It should be noted that this framework for localizing forgery
Tmodel = δ̂e ln ln + µ̂e (32)
1−τ locations in audio requires an accurate estimate of the impulse
7

2014ᒤ5ᴸ1ᰕ -2014ᒤ4ᴸ17ᰕ
2014ᒤ5ᴸ31ᰕ- 2014ᒤ5ᴸ18ᰕ
ith Frame MT th Frame
as follows,
Pk=i+W/2
k=i−W/2 pk − pi
qk = 1, if > Rs (37)
W
Environmental Signature Extraction where, W and Rs represent the window size (e.g.,
W =3 or 5) and similarity score threshold (e.g., Rs ∈
Correlation Calculation [0.7, 0.9]), respectively. qk = 1 indicates that the k th
rL rL rL-L frame is detected as spliced.
Å ·
Eq.(37) means that, for the ith suspected frame, if the
rL
(a) All the previous frames are used as reference. ratio of its adjacent frames being suspected exceeds Rs , the
2014ᒤ5ᴸ16ᰕ 2014ᒤ5ᴸ16ᰕ
- 2014ᒤ6ᴸ16ᰕ- 2014ᒤ6ᴸ15ᰕ ith frame will be labeled as spliced. It have been observed
MTth Frame ith Frame
through experimentation that the refined algorithm signifi-
cantly reduces the false alarm rates and improves detection
performance. Section V provides performance analysis of the
Ă proposed method on both synthetic and real-world recordings.
Environmental Signature Extraction
Ă V. P ERFORMANCE E VALUATION
Correlation Calculation
The effectiveness of the proposed methodology is evaluated
rL rL Ă r07 -L r07 L
Å using two data sets: synthetic data and real world human
rL
·
speech. Details for the data sets used, the experimental setup
(b) The first MT frames are used as reference. and the experimental results are provided below.

Fig. 6. AnNflow chart


L of correlation calculation proposed for audio splicing A. Data Set and Experimental Setting
detection. and represents the correlation calculation operator and
arithmetical average operator, respectively. Firstly, the speech data of TIMIT corpus [57] was used to
train the clean log-spectrum and generate the synthetic data.
TIMIT consists of 6300 sentences: ten sentences spoken by
response, Ĥ q (k). Any distortions in Ĥ q (k), will result in a each of 438 male and 192 female speakers. The data set
number of false alarms. However, due to the lack of substan- is divided into a training set (462 speakers) and a test set
tial reference frames, the average-based smoothing technique (168 speakers) with entirely different sentence contents apart
(33) is adopted to reduce the detection errors. Eq. (33) first from the dialect diagnostics. Each utterance is approximately
(1) 3 seconds long with sampling frequency fs = 16 kHz. For
calculates the cross-correlation coefficient between Ĥ q and
(2) training, the entire training set was processed using Hanning
Ĥ q . If ρ̇2 > T , the second frame is authentic (come from windowed frames (128ms for each) with overlapping rate
the same location with the first frame). Then, it calculates the of 50%. 12 RASTA-MFCCs of each frame was calculated
(1) (3) (2)
cross-correlation coefficients between Ĥ q and Ĥ q , Ĥ q and used to train the GMM with 512 mixtures. For test, the
(3)
and Ĥ q , updates the coefficients according to the smoothing ten sentences of each speaker in the test set were concate-
method in (33) and moves forward to the next frame, and nated to form one utterance with an approximate duration
so on. The average-based smoothing technique used in (33) of 30s. Synthetic room response generated by source-image
results in a stable estimation of RIRs, which will reduce the method [58] for a rectangular room is convolved with the test
decision error of (35). speech. Each time, all the parameters used for room response
Our experimental results show that the detection perfor- simulation were randomly selected to generate the RIRs of
mance can be further improved by exploiting the correlation different environments. Table I shows the candidate interval
between adjacent frames. The motivation behind this is that of parameters used.
tampering usually occurs in many consecutive frames instead For the second data set, real world data consisting of
of a few scattered frames. That means if neighborhoods of the 60 speech recordings were used (the same data set used in
current frame are forged, the current frame has a significatively [39]). The speeches were recorded in four different environ-
high probability of being labeled as forged also. Based on this ments: (1) outdoor; (2) small office (3.4 m×3.8 m×2.7 m,
reasonable assumption, local correlation between the estimated predominantly furnished with carpet and drywalls); (3) stairs
signatures extracted from adjacent frames is used to make the (predominantly ceramic tiles and concrete walls); (4) restroom
forgery localization approach robust to estimation errors. The (5.2 m×4.3 m×2.7 m, predominantly furnished with ceramic
refinement procedure is summarized as follows: tiles). In each recording environment, each of the three speak-
ers (one male S1 , two females S2 and S3 ) read five different
1) Detect the suspected frame using the raw detection texts. The audio samples were captured by a commercial-grade
method and set the label pk of the ith frame to 1 as external microphone mounted on a laptop computer using
follows, Audacity 2.0 software. The audio was originally recorded with
pk = 1, if ρ̇i < T (36) 44.1 kHz sampling frequency, 16 bits/sample resolution, and
then downsampled to 16 kHz. The real world data set will be
2) Refine the results using neighborhood similarity score used to verify the effectiveness of the proposed scheme.
8

TABLE I
T HE CANDIDATE RANGE OF EACH PARAMETER USED FOR ROOM IMPULSE estimated from the synthetic speeches in environments A and
RESPONSE SIMULATION B are represented as Ĥ A,i and Ĥ B,i , respectively, where
0 < i ≤ Nt , Nt denotes total number of testing speeches.
Parameter Interval Parameter Interval
The distributions of intra-environment correlation coefficient,
Room Height Mic Position
[2.5, 3.5] [0.5, rx − 0.5] ρA,i→A,j = N CC(Ĥ A,i , Ĥ A,j ), i 6= j and inter-environment
rz mx
Room Width Mic Position correlation coefficient ρA,i→B,k = N CC(Ĥ A,i , Ĥ B,k ), are
[2.5, 5] [0.5, ry − 0.5] shown in Fig. 8. As discussed in Section IV-A, ρA,i→A,j and
rx my
Room Length Sound Position ρA,i→B,k can be modeled as the extreme value distribution
[rx , 10] [0.4, rz − 0.8]
ry sz and the generalized Gaussian distribution, respectively. The
T 60 [0.1, 0.7]
Sound Position
[0.3, rx − 0.9]
optimal threshold T for these distributions is determined as
sx 0.3274 by setting λ = 0.5, which results in the minimal overall
Mic Position Sound Position error of 2.23%. It is important to mention that depending
[0.5, 1.0] [0.3, ry − 0.9]
mz sy on the application at hand, λ and the corresponding optimal
decision threshold, T , can be selected.

B. Experimental Results 4

1) Results on Synthetic Data: In this section, synthetically Generalized Gaussian Distribution Fitted
3.5
generated reverberant data were used to verify the effective- Extreme Distribution Fitted

ness of the proposed algorithm. First, randomly selected pa- 3


0.8
rameters from Table I were used to generate two different RIRs 2.5
0.6

HA and HB , which characterize two different environments A 0.4

Density
0.2
and B. Both of them were convoluted with randomly selected 2
0.2 0.25 0.3 0.35 0.4

speeches from the test set. Shown in Fig. 7 are the magnitude 1.5 Optimal T = 0.3274

plots of the estimated Ĥ A and Ĥ B . It can be observed that


1
from Fig. 7 that the RIRs (Ĥ A1 and Ĥ A2 ) estimated from
the speeches in environment A are quite similar, and different 0.5

from Ĥ B , estimated from the speeches in environment B. The


0
normalized correlation coefficient between Ĥ A1 and Ĥ A2 is −0.2 0 0.2 0.4
ρ
0.6 0.8 1

equal to 0.81, which is significantly higher than that between


Ĥ A1 and Ĥ B (0.001). Similar results can be obtained from Fig. 8. The true and the fitted distributions of ρA,i→A,j and ρA,i→B,k
and the optimal decision threshold, T = 0.3274
other tests. More results can be found in [44].

We have also evaluated performance of the proposed method


Magnitude (dB)

0
−20
in terms of the Receiver Operating Characteristic (ROC). Fig.
−40
−60
9 shows the ROC curve of True Positive Rate (TPR, which
0 1000 2000 3000 4000 5000 6000 7000 8000 represents the probability that query audio is classified as
Frequency (k Hz)
(a) forged, when in fact it is) vs False Positive Rate (FPR, which
represents the probability that query audio is labeled as forged,
Magnitude (dB)

0
−20
when in fact it is not) on the test audio set. It can be observed
−40
−60
from Fig. 9 that, for FPR > 3%, T P R → 1. It indicates that
0 1000 2000 3000 4000 5000 6000 7000 8000 the proposed audio source authentication scheme has almost
Frequency (k Hz)
(b) perfect detection performance when the recaptured reference
audios are available. In addition, the forensic analyst can
Magnitude (dB)

0
−20 recapture sufficient reference audio samples, which leads to
−40 more accurate and stable estimation of environment signature
−60
0 1000 2000 3000 4000 5000 6000 7000 8000 hence better detection performance.
Frequency (k Hz)
(c)
1

0.99
Fig. 7. Magnitude (dB) plots of estimated Ĥ from two different environ-
ments: (a) Estimated Ĥ A1 from randomly tested speech 1; (b) Estimated Ĥ A2 0.98

from randomly tested speech 2, which is different from speech 1; (c) Estimated 0.97
Ĥ B from randomly tested speech 3, which is different from speeches 1 and
TPR

0.96
2.
0.95

0.94
The optimal threshold T can be determined from the
0.93
synthetic data as follows. The randomly generated RIRs HA
0.92
and HB were convolved with the test data set, respectively, 0 0.01 0.02 0.03
FPR
0.04 0.05 0.06

resulting in two simulated synthetic data sets from two virtual


environments A and B. Then, the environmental signatures Fig. 9. ROC curve of the proposed audio source authentication scheme.
9

0.8 0.8
Furthermore, we also evaluate the performance of the pro- 0.6 0.6

ρ̇i

ρ̇i
posed splicing detection algorithm for synthetic data. Two 0.4
0.2
0.4
0.2
randomly selected speeches from the test set were convolved 0
5 10 15 20 25 30
0
10 20 30 40 50
with HA and HB , resulting in two reverberant speeches RSA Frame Index
(a)
Frame Index
(b)
and RSB . And RSB was inserted into a random location 0.6 0.6

within RSA . Fig. 10(a) shows the plot (in time domain) of 0.4 0.4

ρ̇i

ρ̇i
0.2
0.2
the resulting audio. It is barely possible to observe the trace 0
0
−0.2

of splicing by visual inspection, even for forensic experts. Fig. 20 40


Frame Index
60 80 100 20 40 60 80
Frame Index
100 120 140

(c)
10(b) shows the detection output of the proposed algorithm (d)

listed in Algorithm ?? with a frame size of 3s. It can be 0.4 0.4


0.2
0.2

ρ̇i

ρ̇i
observed from Fig. 10 that the proposed forgery detection and 0
0
−0.2
localization algorithm has successfully detected and localized −0.2
50 100 150 200 50 100 150 200 250
Frame Index Frame Index
the inserted audio segment, RSB with 100% accuracy. (e) (f)

0.01 Fig. 11. Detecting performance of the proposed scheme with various frame
sizes: (a)3s, (b)2s, (c)1s, (d)750ms, (e)500ms, (f)400ms, Red line represents
Amplitude

the threshold T = 0.3274 used to classify the suspected forged frame


0

RSA1 RSB RSA2


−0.01
0 1 2 3 4 5 6 7 8 and RSB , respectively. Noisy speech samples were generated
t 5

(a) Spliced Audio


x 10 by adding white Gaussian noise with various Signal-to-Noise
0.8 Ratios (SNRs) to RSA and RSB , and noisy forged recordings
0.6
Threshold T = 0.3274 were generated by inserting RSB in the middle of RSA . A
ρ̇i

0.4 total of 1344 noisy spliced audio recordings were obtained


0.2
by varying SNR values. Fig. 12 shows the detection results
0
5 10 15 20 25 30 of our proposed scheme under various SNR values. For each
Frame Index
(b) Detecting Result forged sample, roughly, the 13th through 44th frames belong
to RSB , others come from RSA . SN RA and SN RB represent
Fig. 10. Forgery detection and localization results (with frame size = 3s). the SNRs of reverberant speeches RSA and RSB , respectively.
It can be observed from Fig. 12(a)-(f), if the SNR of either
The impact of frame size on performance of the proposed speech recording (say RSA ) is kept relatively high, then the
scheme is investigated next. Fig. 11 shows the detection proposed scheme can detect and localize the inserted frames
performance of the proposed forgery detection and localization with very high accuracy, regardless of the SNR of another
algorithm as a function of the frame size (used to estimate speech (RSB ). As acoustic environment signature is generally
the environmental signature Ĥ). It is intuitive that a larger modeled as a combination of the RIR and the background
frame size would result in more accurate estimate of Ĥ, noise, the divergence of background noise is also a trace of
hence higher detection accuracy. Forgery localization, on the counterfeit [43], [51]. It has been observed that by decreasing
other hand, has the localization granularity (≈ half of frame the SNR level, detection performance of the proposed scheme
size) and thus requires a smaller frame size. In other words, deteriorates. It can also be observed from Fig. 12(g)-(h)
decreasing frame size leads to a better localization resolution, that the FPR performance of the proposed scheme increases
but less accurate estimate of environment signature, which may gradually as SNR decreases. However, even under low SNR
reduce detection accuracy. It can be observed from Fig. 11 conditions (e.g., 10dB), the proposed scheme exhibits good
that for frame size > 1, the splicing parts can be identified detection and localization performance.
and localized with 100% accuracy. Subsequently, for frame 2) Results on Real World Data: Effectiveness of the pro-
sizes between 500 ms and 1 s, the proposed scheme can still posed audio splicing detection method has also been verified
detect the splicing segments for carefully chosen threshold, on real-world speech data. To achieve this goal, a data set
instead of the trained “Optimal Threshold” in Section IV-A. consisting of 60 real world speech recordings of three speakers
For instance, in Fig. 11(e), we can still obtain good detection reading five different texts made in four acoustically different
and localization performance if the threshold is equal to environments is used. In addition, the audio recordings made
0.17. Reducing the frame size further deteriorates both the in the same environment with the same speaker (different
performance of detection accuracy and localization granularity text contents) is downsampled to 16 kHz and concatenated
(see also Fig. 11(f)). It has been observed through extensive to generate each audio recording of 90s to 120s duration. For
experimentation that for the data sets used here the frame size faircomparison, each forged audio is created by inserting audio
of 1s may be a good compromise and the optimal frame size segment (from other recording) into the middle of another
should be determined according to the application at hand. recording.
The effect of noise on the proposed scheme has also been Fig. 13 shows the detection results of the proposed scheme
investigated. To this end, two randomly generated RIRs HA on real world recordings created by inserting speech recording
and HB were convolved with two randomly selected speeches of the 1st speaker in the 1st environment (red part in Fig.
from the test data set, resulting in reverberant speeches RSA 13(a)) in the middle of speech recording of the 2nd speaker
10

Amplitude
0.4 0.4
ρi

ρi
0.2 0.2 0
0 0
10 20 30 5 10 15 20 25 30 −1
(a) SNRA = 30dB, SNRB = 30dB (b) SNRA = 30dB, SNRB = 25dB
0.5 1 1.5 2 2.5 3
t 6
(a) x 10
0.5 0.5
0.4 0.4
0.6
ρi

ρi
0.3 0.3 0.4

ρ̇i
0.2 0.2 0.2
0.1
10 20 30 5 10 15 20 25 30 0
(c) SNRA = 30dB, SNRB = 20dB (d) SNRA = 30dB, SNRB = 15dB 20 40 60 80 100 120 140
Frame Index
0.5
0.5
0.4
(b) Frame Size = 3s
0.4
0.6
0.3
ρi

0.3 ρi 0.4
0.2

ρ̇i
0.2 0.1
0.2
10 20 30 40 5 10 15 20 25 30 35
(e) SNRA = 30dB, SNRB = 10dB (f ) SNRA = 30dB, SNRB = 5dB 0
20 40 60 80 100 120 140 160 180 200
0.7
0.75 Frame Index
0.6 0.7 (c) Frame Size = 2s
ρi

ρi

0.5 0.65
0.6
0.4 0.6
0.4

ρ̇i
10 20 30 40 10 20 30 40
0.2
(g) SNRA = 20dB, SNRB = 15dB (h) SNRA = 10dB, SNRB = 10dB
0
50 100 150 200 250 300 350 400
Fig. 12. Detecting performance of the proposed scheme under with various Frame Index
SNRs. The frame size is 3s for all runs. Blue circles and red triangles (d) Frame Size = 1s
represent the frames from RSA and RSB , respectively. The red lines are
the corresponding “optimal” decision thresholds. Fig. 13. Detecting performance of the proposed scheme on real-world data
with various frame sizes: (b)3s, (c)2s, and (d):1s. Blue circles and red triangles
represent the frames from E1S1 and E3S2, respectively

in the 3rd environment [blue part in Fig. 13(a)]. It can be


observed from Fig. 13(a) that all three segments contain strong quantitative measures, TPR and FPR, are used to evaluate
but similar background noise and splicing did not introduce the frame-level detection performance. To achieve a smooth
any visual artifacts. The resulting audio is analyzed using the ROC plot, the TPR/FPR pair is computed by averaging over
proposed splicing detection and localization method. For frame 50 runs of different frame sizes. Fig. 14 shows the resulting
size ≥ 1s, the optimal decision threshold was set to be 0.4, ROC curves computed at different frame sizes. It can be
i.e., T = 0.4. It has been observed however that detection observed from Fig. 14 that the overall detection accuracy
performance gradually deteriorates [e.g., both the FPR and improves for larger frame sizes. This is expected as larger
FNR (the probability of query audio is labeled as authentic frame sizes result in more accurate and stable estimation of
when in fact it is forged.) increases] for smaller frame sizes. environment signature. In addition, it can also be observed
This fact is illustrated in Fig. 13(c)&(d). It can be observed from Fig. 9 and Fig. 14 that, the detection performance of
that the proposed scheme is still capable of detecting/localizing the proposed scheme deteriorates in the presence of ambient
the inserted frames for frame sizes 2s and 1s. However, noise. Detection performance in the presence of ambient noise
optimal decision threshold selection becomes very difficult for can be improved by using larger frame sizes but at the cost of
smaller framework size. Pattern recognition methods such as lower forgery localization accuracy.
support vector machine [59] may be applied to address these
limitations. Nevertheless, discussion on this is not the focus 1

of this paper.
It has also been observed through experimentation that some 0.8

of the recordings in the data set contain stronger background


noise than the signal shown in Fig. 13(a), which deteriorates 0.6
TPR

the accuracy of environment signature estimation, and hence


degrades the performance of our proposed scheme. The noise 0.4

effect can be reduced by using long-term average but at the Frame Size = 4s
cost of localization resolution. Experimental results indicate 0.2 Frame Size = 3s
Frame Size = 2s
that frame sizes between 2s and 3s resulted in good detection Frame Szie = 1.5s
performance for median noise level, and for strong background 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
noise, frame sizes over 5s may be needed for acceptable FPR
performance. Similar results were obtained for other forged
Fig. 14. ROC curves of the proposed scheme on real-world data with various
recordings. frame sizes.
In our next experiment, the detection accuracy of the
proposed scheme for the real-world data set is evaluated. 3) Robustness to MP3 Compression: In this experiment,
For this experiment, spliced audio is generated from two we tested the robustness of our scheme against MP3 compres-
randomly selected audio recordings of different speakers made sion attacks. To this end, audio recordings were compressed
in different environments from the real-world data set. Two into MP3 format using FFMPEG [60] with bitrates R ∈
11

{16, 32, 64, 96, 128, 192, 256} kbps. Spliced audio recordings and E3S3 (red). It can be observed from Fig.15(a) that both
were generated by randomly selecting compressed audio original recordings, e.g., E1S1 and E3S3, contain the same
recordings captured in different environments. It is important background noise level. Fig. 15(b) and Fig. 15(c) show the
to mention that each audio segment used to generate forged frame-level detection performance of the proposed scheme
audio was compressed to MP3 format with the same set of and Pan’s scheme [43], respectively. It can be observed from
parameters, e.g., the same bit-rate, etc. Fig. 15(b)&(c) that the proposed scheme is capable of not
For forgery detection, MP3 compressed forged audio is only detecting the presence of the inserted frames but also
decoded to the wav format, which is then analyzed using the localizing these frames. On the other hand, Pan’s scheme
proposed method. Each experiment was repeated 50 times and is unable to detect or localize the inserted frames. Inferior
detection performance was averaged over 50 runs to reduce performance of Pan’s scheme can be attributed to the fact that
the effect of noise. Table II shows the detection accuracy (the it only depends on the local noise level for forgery detection,
probability of correctly identifying the inserted and original which is almost the same in the forged audio used for analysis.
frames) of the proposed scheme under MP3 attacks for various
1
bit-rates and frame sizes.

Amplitude
TABLE II 0
D ETECTION ACCURACIES OF THE PROPOSED SCHEME UNDER MP3
ATTACKS WITH VARIOUS BIT- RATES AND DIFFERENT FRAME SIZES −1
0.5 1 1.5 2 2.5 3
t 6
x 10
(a) Spliced Audio
Frame MP3 Compression Bit Rate (bps)
Size 16k 32k 64k 96k 128k 192k 256k 0.6

4s 0.86 0.92 0.93 0.93 0.93 0.93 0.94


0.4

ρ̇i
3s 0.89 0.92 0.92 0.93 0.93 0.93 0.94
2s 0.90 0.92 0.91 0.91 0.93 0.93 0.93 0.2
20 40 60 80 100 120 140 160 180
1s 0.89 0.83 0.83 0.85 0.83 0.83 0.84 Frame Index
(b) Detection Result of the proposed scheme

0.01
The following observations can be made from Table II.
Noise Level

Firstly, the detection performance improves for higher bit- 0.005


rates, which introduce less distortion. The detection accuracy
0
keeps relatively unchanged (less than 1% loss) for bit-rates 20 40 60 80 100 120 140 160 180
Frame Index
R ≥ 32 kbps. This indicates that the proposed scheme is robust (c) Detection Result of Pan’s Scheme
to MP3 compression attack with bit-rate R > 16 kbps.
Secondly, the detection performance deteriorates (on aver- Fig. 15. Detection performance of the proposed scheme and Pan’s scheme
age 7.6%) by changing the frame size from 2s to 1s. It has [43]. For fair comparison, frame size is set to 2s for both experiments. Red
triangles represent the frames of recording of the 3rd speaker in the 3rd
been verified in our previous experiments that smaller frame environment (E3S3) and blue circles represent frames of recording of the 1st
sizes result in relatively less accurate estimate of environment speaker in the 1st environment (E1S1).
signature.
Finally, it can be observed that frame sizes less than 3s result To further demonstrate the superiority of our proposed
in relatively unstable detection performance, which can be at- scheme, a forged audio is generated by splicing together two
tributed to the random noise. In addition, for MP3 compression speech recordings with different levels of background noise.
attack, the detection performance of the proposed scheme is More specifically, the speech recording of the 3rd speaker
relatively more sensitive to frame size. As the performance made in the 2nd environment (E2S3) is inserted at the middle
of forgery localization method depends on frame size used of speech recording of the 2nd speaker made in the 1st
for analysis, forensic analyst can select an appropriate frame environment (E1S2). Fig. 16(a) shows the time-domain plot
size for the application and test audio at hand. It has been of the forged audio assembled from E1S2 (blue) and E2S3
observed that for the data set used, frame size > 2s results (red). It can be observed that the segment from E2S3 exhibits
in good compromise between the detection and localization higher noise level than the segments from E1S2. The resulting
accuracies. forged recording is analyzed using both the proposed scheme
4) Comparison Results with Previous Works: In our last and Pan’s scheme [43]. Fig. 16(b) and Fig. 16(c) illustrate the
set of experiments, performance of the proposed framework the plots of frame-level detection performance of these two
is compared to the existing splicing detection scheme [43], schemes, respectively. It can be observed from Fig. 16(b) that
which uses inconsistency in local noise levels estimated from the proposed scheme has successfully detected and localized
the query audio for splicing detection. the inserted frames with much higher accuracy than the result
To this end, forged (or spliced) audio recording is gen- in [43]. In addition, it can be observed from Fig. 16(c) that
erated by splicing audio recordings made in two different Pan’s scheme is unable to localize the inserted frames. Further
environments. More specifically, speech recording of the 3rd more, Pan’s scheme achieves a detection accuracy close to
speaker made in the 3rd environment (E3S3) is inserted in 0.6, which is far lower than the accuracy achieved by the
the middle of speech recording of the 1st speaker made in proposed scheme, 0.93. Finally, the detection performance of
the 1st environment (E1S1). Fig. 15(a) presents the time- both schemes are also evaluated for larger frame sizes. As
domain plot of the spliced audio assembled from E1S1 (blue) expected performance of the proposed improved significantly
12

for frame size > 2s. However, no significant improvement is the reference audio has been used to determine the opti-
observed for Pan’s scheme. mal decision threshold, which is used for the frame-level
forgery detection. A frame with similarity score less than
1
the optimal threshold is labeled as spliced and as authentic
Amplitude

0
otherwise. For splicing localization, a local correlation based
on the relationship between adjacent frames has been adopted
−1
0.5 1 1.5 t 2 2.5 3 to reduce frame-level detection errors. The performance of
(a) Spliced Audio 6
x 10 the proposed scheme has been evaluated on two data sets
0.6
with numerous experimental settings. Experimental results has
0.4
validated the effectiveness of the proposed scheme for both
ρ̇i

forgery detection and localization. Robustness of the proposed


0.2
scheme to both MP3 compression and additive noise attacks
20 40 60 80 100 120 140 160 180 200
Frame Index has also been tested. Performance of the proposed scheme
(b) Detection Result of the proposed scheme
2
−3
x 10 has also been compared against the state-of-the-art for audio
splicing detection [43]. Performance comparison has indicated
Noise Level

1 that the proposed scheme outperforms the approach presented


in [43]. The proposed scheme may be a good candidate for
0
20 40 60 80 100 120 140 160 180 200 the practical digital audio forensics applications.
Frame Index
(c) Detection Result of Pan’s Scheme

ACKNOWLEDGMENT
Fig. 16. Detection performance of the proposed scheme and Pan’s scheme
[43]. For fair comparison, the frame size is set to 2s for both experiments. This work is supported by 2013 Guangdong Natural Science
Red triangles represent the frames of recording of the 3rd speaker in the 2nd Funds for Distinguished Young Scholar (S2013050014223).
environment (E2S3) and blue circles represent the frames of recording of the
2nd speaker in the 1st environment (E1S1).
R EFERENCES
[1] H. Zhao and H. Malik, “Audio recording location identification using
In our final experiment, the performance of the proposed acoustic environment signature,” IEEE Transactions on Information
scheme and Pan’s scheme [43] is evaluated on forged audio Forensics and Security, vol. 8, no. 11, pp. 1746–1759, 2013.
clips. Forged audio clips were generated by randomly selecting [2] H. Malik and H. Farid, “Audio forensics from acoustic reverberation,”
in Proceedings of the IEEE Int. Conference on Acoustics, Speech, and
audio clips (of approximately 3s duration) from TIMIT and Signal Processing (ICASSP’10), Dallas, TX, 2010, pp. 1710 – 1713.
modifying each of them by adding white Gaussian noise [3] H. Malik and H. Zhao, “Recording environment identification using
(SN R ∈ {15dB, 20dB, 25dB, 30dB}) into the first 50% of the acoustic reverberation,” in Proceedings of the IEEE Int. Conference on
frames (Other frames are kept to be clean.). It can be observed Acoustics, Speech, and Signal Processing (ICASSP’12), Kyoto, Japan,
2012, pp. 1833 – 1836.
that Pan’s scheme performs reasonably well for this forged [4] H. Zhao and H. Malik, “Audio forensics using acoustic environment
data set. However, when this experiment is repeated for audio traces,” in Proceedings of the IEEE Statistical Signal Processing Work-
clips of 30 second duration, FPRs, for Pan’s scheme increase shop (SSP’12), Ann Arbor, MI, 2012, pp. 373–376.
[5] A. Hajj-Ahmad, R. Garg, and M. Wu, “Spectrum combining for ENF
significantly. This experiment implies that generalization of signal estimation,” IEEE Signal Processing Letters, vol. 20, no. 9, pp.
Pan’s scheme [43] requires further verification. 885–888, 2013.
[6] Y. Panagakis and C. Kotropoulos, “Telephone handset identification by
Performance comparison of the proposed scheme to Pan’s feature selection and sparse representations,” in 2012 IEEE International
scheme [43] can be summarized as follows: (i) superior Workshop on Information Forensics and Security (WIFS), 2012, pp. 73–
performance of the proposed scheme can be attributed to RIR 78.
[7] R. Korycki, “Time and spectral analysis methods with machine learning
which is an acoustic environment related signature and is more for the authentication of digital audio recordings,” Forensic Science
reliable than the ambient noise level alone; (ii) the proposed International, vol. 230, no. 1C3, pp. 117 – 126, 2013.
scheme models acoustic environment signature by considering [8] B. Koenig, D. Lacey, and S. Killion, “Forensic enhancement of digital
audio recordings,” Journal of the Audio Engineering Society, vol. 55,
both the background noise (including the level, type, spectrum, no. 5, pp. 352 – 371, 2007.
etc.) and the RIR (see (7)); and (iii) the proposed scheme relies [9] D. Boss, “Visualization of magnetic features on analogue audiotapes is
on the acoustic channel impulse response, which is relatively still an important task,” in Proceedings of Audio Engineering Society
39th Int. Conference on Audio Forensics, 2010, pp. 22 – 26.
hard to manipulate as compare to the background noise based [10] D. Begault, B. Brustad, and A. Stanle, “Tape analysis and authentication
acoustic signature. using multitrack recorders,” in Proceedings of Audio Engineering Society
26th International Conference: Audio Forensics in the Digital Age, 2005,
pp. 115–121.
VI. C ONCLUSION [11] D. Rodriguez, J. Apolinario, and L. Biscainho, “Audio authenticity:
Detecting ENF discontinuity with high precision phase analysis,” IEEE
In this paper, a novel method for audio splicing detec- Trans. on Information Forensics and Security, vol. 5, no. 3, pp. 534
tion and localization has been proposed. The magnitude of –543, 2010.
[12] C. Grigoras, “Application of ENF analysis method in authentication of
acoustic channel impulse response and ambient noise have digital audio and video recordings,” Journal of the Audio Engineering
been considered to model the intrinsic acoustic environment Society, vol. 57, no. 9, pp. 643–661, 2009.
signature. The acoustic environment signature is jointly esti- [13] A. J. Cooper, “The electric network frequency (ENF) as an aid to au-
thenticating forensic digital audio recordings-an automated approach,” in
mated using the spectrum classification technique. Distribution Proceedings of Audio Engineering Society 33rd Conf., Audio Forensics:
of the correlation coefficient between the query audio and Theory and Practice, 2008.
13

[14] E. Brixen, “Techniques for the authentication of digital audio record- [37] M. Qiao, A. H. Sung, and Q. Liu, “Improved detection of mp3 double
ings,” in Audio Engineering Society 122nd Convention, 2007. compression using content-independent features,” in IEEE International
[15] D. Bykhovsky and A. Cohen, “Electrical network frequency (ENF) Conference on Signal Processing, Communication and Computing (IC-
maximum-likelihood estimation via a multitone harmonic model,” IEEE SPCC), 2013, pp. 1–4.
Transactions on Information Forensics and Security, vol. 8, no. 5, pp. [38] E. Brixen, “Acoustics of the crime scene as transmitted by mobile
744–753, 2013. phones,” in Proceedings of Audio Engineering Society 126th Convention,
[16] H. Su, R. Garg, A. Hajj-Ahmad, and M. Wu, “ENF analysis on Munich, Germany, 2009.
recaptured audio recordings,” in IEEE International Conference on [39] H. Malik, “Acoustic environment identification and its application to
Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 3018– audio forensics,” IEEE Transactions on Information Forensics and
3022. Security, vol. 8, no. 11, pp. 1827–1837, 2013.
[17] R. Garg, A. Hajj-Ahmad, and M. Wu, “Geo-location estimation from [40] H. Malik, “Securing speaker verification systen against replay attack,”
electrical network frequency signals,” in IEEE International Conference in Proceedings of AES 46th Conference on Audio Forensics, 2012.
on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 2862– [41] A. J. Cooper, “Detecting butt-spliced edits in forensic digital audio
2866. recordings,” in Proceedings of Audio Engineering Society 39th Conf.,
[18] J. Chai, F. Liu, Z. Yuan, R. W. Conners, and Y. Liu, “Source of ENF in Audio Forensics: Practices and Challenges, 2010.
battery-powered digital recordings,” in 135th Audio Engineering Society [42] J. Chen, S. Xiang, W. Liu, and H. Huang, “Exposing digital audio
Convention, in processing, 2013. forgeries in time domain by using singularity analysis with wavelets,”
[19] C. Kraetzer, A. Oermann, J. Dittmann, and A. Lang, “Digital audio in Proceedings of the first ACM workshop on Information hiding and
forensics: a first practical evaluation on microphone and environment multimedia security, 2013, pp. 149–158.
classification,” in Proceedings of the 9th workshop on Multimedia and [43] X. Pan, X. Zhang, and S. Lyu, “Detecting splicing in digital audios
Security, Dallas, TX, 2007, pp. 63–74. using local noise level estimation,” in Proceedings of IEEE Int. Conf.
[20] A. Oermann, A. Lang, and J. Dittmann, “Verifier-tuple for audio- on Acoustics, Speech, and Signal Processing (ICASSP’12), Kyoto, Japan,
forensic to determine speaker environment,” in Proceedings of the ACM 2012, pp. 1841–1844.
Multimedia and Security Workshop, New York, USA, 2005, pp. 57–62. [44] N. D. Gaubitch, M. Brooks, and P. A. Naylor, “Blind channel magnitude
[21] R. Buchholz, C. Kraetzer, and J. Dittmann, “Microphone classification response estimation in speech using spectrum classification,” IEEE
using fourier coefficients,” in Lecture Notes in Computer Science, Transations Acoust, Speech, Signal Processing, vol. 21, no. 10, pp.
Springer Berlin/Heidelberg, vol. 5806/2009, 2010, pp. 235–246. 2162–2171, 2013.
[22] R. Malkin and A. Waibel, “Classifying user environment for mobile [45] G. Soulodre, “About this dereverberation business: A method for ex-
applications using linear autoencoding of ambient audio,” in Proceedings tracting reverberation from audio signals,” in AES 129th Convention,
of IEEE International Conference on Acoustics, Speech, and Signal 2010.
Processing, vol. 5, 2005, pp. 509 – 512. [46] M. Wolfel, “Enhanced speech features by single-channel joint compen-
[23] S. Chu, S. Narayanan, C.-C. Kuo, and M. Mataric, “Where am i? scene sation of noise and reverberation,” IEEE, Trans. on Audio, speech, and
recognition for mobile robots using audio features,” in Proceedings of Lang. Proc., vol. 17, no. 2, pp. 312–323, 2009.
IEEE International Conference on Multimedia and Expo, 2006, pp. 885 [47] A. Sehr, R. Maas, and W. Kellermann, “Reverberation model-based
– 888. decoding in the logmelspec domain for robust distant-talking speech
recognition,” IEEE Trans. on Audio, Speech and Language Processing,
[24] A. Eronen, “Audio-based context recognition,” IEEE Trans. on Audio,
vol. 18, no. 7, pp. 1676–1691, 2010.
Speech, and Language Processing, vol. 14, no. 1, pp. 321–329, 2006.
[48] B. J. Borgstrom and A. McCree, “The linear prediction inverse modu-
[25] D. Garcia-Romero and C. Espy-Wilson, “Speech forensics: Automatic
lation transfer function (LP-IMTF) filter for spectral enhancement, with
acquisition device identification,” Journal of the Audio Engineering
applications to speaker recognition,” in IEEE International Conference
Society, vol. 127, no. 3, pp. 2044–2044, 2010.
on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 4065–
[26] C. Kraetzer, M. Schott, and J. Dittmann, “Unweighted fusion in mi-
4068.
crophone forensics using a decision tree and linear logistic regression
[49] N. Tomohiro, Y. Takuya, K. Keisuke, M. Masato, and J. Biing-Hwang,
models,” in Proceedings of the 11th ACM Multimedia and Security
“Speech dereverberation based on variance-normalized delayed linear
Workshop, 2009, pp. 49–56.
prediction,” IEEE Transactions on Audio, Speech, and Language Pro-
[27] Y. Panagakis and C. Kotropoulos, “Automatic telephone handset iden- cessing, vol. 18, no. 7, pp. 1717–1731, 2010.
tification by spares representation of random spectral features,” in [50] D. Simon and M. Marc, “Combined frequency-domain dereverberation
Proceedings of Multimedia and Security, 2012, pp. 91–96. and noise reduction technique for multi-microphone speech enhance-
[28] C. Kraetzer, K. Qian, M. Schott, and J. Dittmann, “A context model for ment,” in Proceedings of the 7th IEEE/EURASIP International Workshop
microphone forensics and its application in evaluations,” in Proceedings on Acoustic Echo and Noise Control (IWAENC01), 2001, pp. 31–34.
of SPIE Media Watermarking,Security, and Forensics III, vol. 7880, [51] S. Ikram and H. Malik, “Digital audio forensics using background
2011, pp. 78 800P–1–78 800P–15. noise,” in Proceedings of IEEE Int. Conf. on Multimedia and Expo,
[29] C. Kraetzer, K. Qian, and J. Dittmann, “Extending a context model 2010, pp. 106–110.
for microphone forensics,” in Proceedings of SPIE Media Watermark- [52] C. M. Bishop, Pattern Recognition and Machine Learning. New York:
ing,Security, and Forensics III, vol. 8303, 2012, pp. 83 030S–83 030S– Springer, 2007.
12. [53] H. Hermansky and N. Morgan, “Rasta processing of speech,” IEEE
[30] S. Ikram and H. Malik, “Microphone identification using higher-order Transactions on Speech and Audio Processing, vol. 2, no. 4, pp. 578–
statistics,” in Audio Engineering Society Conference: 46th International 589, 1994.
Conference: Audio Forensics, Jun 2012. [54] S. Kotz and S. Nadarajah, Extreme value distributions. World Scientific,
[31] R. Yang, Z. Qu, and J. Huang, “Detecting digital audio forgeries by 2000.
checking frame offsets,” in Proceedings of the 10th ACM Workshop on [55] S. R. Eddy, “Maximum likelihood fitting of extreme value distributions,”
Multimedia and Security (MM&Sec ’08), 2008, pp. 21– 26. 1997, unpublished work, citeseer.ist.psu.edu/370503.html.
[32] R. Yang, Y. Shi, and J. Huang, “Defeating fake-quality MP3,” in [56] J. A. Dominguez-Molina, G. González-Farı́as, R. M. Rodrı́guez-
Proceedings of the 11th ACM Workshop on Multimedia and Security Dagnino, and I. C. Monterrey, “A practical procedure to
(MM&Sec)’09, 2009, pp. 117– 124. estimate the shape parameter in the generalized gaussian
[33] R. Yang, Y. Shi, and J. Huang, “Detecting double compression of audio distribution,” technique report I-01-18 eng.pdf, available through
signal,” in Proceedings of SPIE Media Forensics and Security II 2010, http://www.cimat.mx/reportes/enlinea/I-01-18 eng.pdf, vol. 1, 2001.
vol. 7541, 2010, pp. 75 410k–1–75 410K–10. [57] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L.
[34] C. Grigoras, “Statistical tools for multimedia forensics,” in Proceedings Dahlgren, and V. Zue, TIMIT Acoustic-Phonetic Continuous Speech
of Audio Engineering Society 39th Conf., Audio Forensics: Practices Corpus. Linguistic Data Consortium, Philadelphia, 1993.
and Challenges, 2010, pp. 27 – 32. [58] E. A. Lehmann and A. M. Johansson, “Diffuse reverberation model
[35] Q. Liu, A. Sung, and M. Qiao, “Detection of double mp3 compression,” for efficient image-source simulation of room impulse responses,” IEEE
Cognitive Computation, Special Issue: Advances in Computational In- Transactions on Audio Speech and Language Processing, vol. 18, no. 6,
telligence and Applications, vol. 2, no. 4, pp. 291–296, 2010. pp. 1429–1439, 2010.
[36] M. Qiao, A. Sung, and Q. Liu, “Revealing real quality of double [59] C. Chang and C. Lin. (2012) Libsvm: A library for support vector
compressed mp3 audio,” in Proceedings of the international conference machines. [Online]. Available: http://www.csie.ntu.edu.tw/∼ cjin/libsvm
on Multimedia, 2010, pp. 1011–1014. [60] www.ffmpeg.org.

You might also like