You are on page 1of 14

1024 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 14, NO.

5, AUGUST 2020

Recurrent Convolutional Structures for Audio


Spoof and Video Deepfake Detection
Akash Chintha , Bao Thai , Saniat Javid Sohrawardi , Kartavya Bhatt , Andrea Hickerson ,
Matthew Wright , and Raymond Ptucha

Abstract—Deepfakes, or artificially generated audiovisual ren-


derings, can be used to defame a public figure or influence public
opinion. With the recent discovery of generative adversarial net-
works, an attacker using a normal desktop computer fitted with an
off-the-shelf graphics processing unit can make renditions realistic
enough to easily fool a human observer. Detecting deepfakes is
thus becoming important for reporters, social media platforms,
and the general public. In this work, we introduce simple, yet
surprisingly efficient digital forensic methods for audio spoof and
visual deepfake detection. Our methods combine convolutional
latent representations with bidirectional recurrent structures and Fig. 1. Overview of the problem space. Audio and visual tracks are extracted
entropy-based cost functions. The latent representations for both and processed through spoof and deepfake detection models.
audio and video are carefully chosen to extract semantically rich
information from the recordings. By feeding these into a recurrent
framework, we can detect both spatial and temporal signatures
contribute to information anarchy, stoking unfounded fears in
of deepfake renditions. The entropy-based cost functions work the public. At a minimum, false information may manipulate
well in isolation as well as in context with traditional cost func- emotions and opinions. At worst, it could lead to organized
tions. We demonstrate our methods on the FaceForensics++ and and destabilizing public actions united behind false ideas or
Celeb-DF video datasets and the ASVSpoof 2019 Logical Access impressions.
audio datasets, achieving new benchmarks in all categories. We
also perform extensive studies to demonstrate generalization to new
Deepfakes are artificially generated audiovisual renderings
domains and gain further insight into the effectiveness of the new of a person. These recordings, which are typically done without
architectures. consent, can be used to defame a public figure or influence public
opinion. Much like malicious computer viruses, deepfake gen-
Index Terms—Convolution, deep learning, deepfake, entropy,
spoof. eration [3]–[17] and deepfake detection [18]–[27] techniques
are continually evolving. As time passes, deepfake generation
I. INTRODUCTION methods are not only becoming more realistic or believable,
but they are learning to better circumvent detection methods. In
HILE misinformation being spread through interper-
W sonal communication and mass media is hardly new, we
are at a pivotal point in history, where advances in digital com-
parallel, deepfake detection methods are learning to recognize
subtle hints or fingerprints inadvertently introduced by improved
generation methods.
munication and AI threaten to magnify the scale, persistence, and
Deepfake creation methods typically rely on face swap-
consequence of misinformation [1], [2]. Historically, journalists
ping [3], [5] or face synthesis [6] combined with audio dubbing,
have played a significant role in vetting information before
voice conversion or voice synthesis [28]. Detection methods
publication and dissemination. Now, however, with deceptive
have concentrated on structure [18]–[23], [25] or soft biomet-
information harder to verify, journalists accidentally threaten to
rics [24], [26], [27] in visual data, as well as spoof detection for
audio [29]–[31]. Digital forensic methods can be very good at
Manuscript received December 1, 2019; revised March 4, 2020; accepted detecting targeted forgeries, but they struggle at out-of-sample,
May 16, 2020. Date of publication June 1, 2020; date of current version August
24, 2020. The guest editor coordinating the review of this manuscript and or unforeseen creation methods.
approving it for publication was Dr. Luisa Verdoliva. (Corresponding author: Inspired by the XceptionNet [32] architecture and convolu-
Akash Chintha.) tional recurrent neural network methods [20], [33], we introduce
Akash Chintha, Bao Thai, and Raymond Ptucha are with the Department of
Computer Engineering, Rochester Institute of Technology, Rochester, NY 14623 simple, yet effective architectures for the detection of both
USA (e-mail: ac1864@rit.edu; baothai120708@gmail.com; rwpeec@rit.edu). deepfake video and audio. As shown in Fig. 1, in agreement
Saniat Javid Sohrawardi, Kartavya Bhatt, and Matthew Wright are with with current datasets and challenges, we treat audio and video
the Department of Computing Security, Rochester Institute of Technol-
ogy, Rochester, NY 14623 USA (e-mail: js8365@rit.edu; kb8077@g.rit.edu; as separate problems. The audio and video streams are referred
matthew.wright@rit.edu). to as spoof detection and deepfake detection, respectively.
Andrea Hickerson is with the School of Journalism and Mass Commu- The novel contributions of this research include:
nications, University of South Carolina, Columbia, SC 29208 USA (e-mail:
hickera@mailbox.sc.edu). 1) We propose a simple, yet effective, convolutional bidirec-
Digital Object Identifier 10.1109/JSTSP.2020.2999185 tional recurrent architecture for deepfake video detection.
1932-4553 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
CHINTHA et al.: RECURRENT CONVOLUTIONAL STRUCTURES FOR AUDIO SPOOF AND VIDEO DEEPFAKE DETECTION 1025

2) We develop an analogous architecture for audio spoof Zhu et al. [36] and Kim et al. [37] built on the GAN con-
detection. cept and replaced the noise vector with input images. This
3) We explore the use of entropy-based loss functions for modification enabled their cycle-consistent GANs to alter the
deepfake detection in both isolation and ensemble form. domains of the output images based on the input image domains.
4) We demonstrate that our visual and audio methods set Similarly, in deepfake generation, it is possible to retain the
new benchmarks on the FaceForensics++ and Celeb-DF facial expressions of a source person while transferring identities
video datasets and ASVSpoof 2019 Logical Access audio to a target person. Lu et al. [38] proposed an identity-guided
dataset. conditional CycleGAN to translate low-resolution face images
5) We perform ablation studies that show the robustness of to high-resolution face images. Kim et al. [39] accomplished
these methods across a range of conditions. a similar transfer, with the difference being that they transfer
expressions as well as 3D pose into a target image, creating
II. RELATED WORK a video in the process. Faceswap-GAN [5] created realistic
looking imagery using adversarial and perceptual losses. The
A. Video: Deepfake Generation
addition of a perceptual loss was shown to minimize unnat-
Deepfake creation is a relatively new field in digital forgery. ural artifacts such as awkward eyeball movements. Temporal
Although the generation process began with traditional vision smoothing of the frame-to-frame face detection box and an
and voice impersonation, most recent works involve generative attention mask made the created videos appear more realistic.
adversarial networks (GANs). We will first describe the tradi- NeuralTextures [14] is a facial re-enactment forgery that relies
tional approaches, and then the adversarial methods. directly on facial landmarks. These fakes are generated using
In general, there are two types of deepfake generation, face a patch-based adversarial loss alongside a photometric recon-
swapping and face re-enactment. In face swapping, the face of struction loss. In an interesting twist, Suwajanakorn et al. [40]
a target person is overlaid on the face of a source. The overlaid achieved photorealistic results by only requiring audio as an
faces are post-processed to blend the edges to match the source’s input to generate forged videos. Using the weekly addresses
facial outline. This can be used, for example, to make Nicholas of Barack Obama, the raw audio features are first mapped to
Cage (the source actor) appear in a movie in place of the original mouth shapes, then mouth textures, then 3D pose mapping,
(target) actor.1 Faceswaps [34] are a graphical approach, where and finally compositing of the head and torso from stock
the facial landmarks (nose, eyes, eyebrows, lips, chin, and cheek footage.
areas) play a major role in morphing the target’s face with the
source’s, and the output is post-processed with edge polishing
and color correction. B. Video: Deepfake Detection
Facial re-enactment, on the other hand, is used to make a Since the threats posed by deepfake video manipulations
target appear to act and speak like the source. This is used, became apparent in early 2018, several works have looked
for example in the videos where former U.S. President Barack into detecting them. A few of the detection techniques targeted
Obama is made to say whatever the source video says.2 Facial re- handcrafted features [24], [26], [41] like blinking inconsisten-
enactment techniques model both the source and target faces to cies, biological signals, and unrealistic details. Most of these
identify facial landmarks and manipulate the target’s landmarks manually crafted detection features exploit known weaknesses
to match the source’s facial movements. Face2Face [6] is a face in creation methods. Like a cat-and-mouse game, deepfake
re-enactment method that translates the facial expressions of a creation methods quickly adapted to circumvent detection, and
source subject with a target while maintaining the facial features the cycle repeats. Recent detection methods rely on machine
of the target. learning on deepfake datasets to automatically discover forgeries
More recent Deepfake methods are based on Generative Ad- from real videos.
versarial Networks (GANs) [4]. GANs consist of two competing MesoNet [18] uses a shallow convolutional network to detect
neural networks: a generator G that generates fake samples that forgery at a mesoscopic (or intermediate) level of detail, inten-
mimic real samples from a target dataset, and a discriminator tionally avoiding focusing too much on microscopic features
D that tries to tell fakes apart from real samples. These two that could be lost due to video compression. They also intro-
networks are trained simultaneously, such that over the training duced a variant of their model that replaces regular convolution
period, both the generator and discriminator improve. Upon blocks with MesoInception blocks to get slight improvements.
successful convergence, the generator can then be used to create The Capsule-Forensics method proposed by Nguyen et al. [21]
realistic-looking examples. To promote variation, G is seeded uses capsule networks [42] for the detection of replay attacks
with a noise vector, but this noise vector can be paired with a as well as computer-generated images and videos. They argue
latent representation of an object (word, image, sentence), to that the chances of detecting high-quality forgeries would be
constrain the resulting image [35]. increased with the agreement between capsules through dynamic
routing [43]. Cozzolino et al. [23] applied an autoencoder-based
1 [Online]. Available: https://www.theguardian.com/technology/ng- architecture to show its usefulness for transfer learning. Nguyen
interactive/2019/jun/22/the-rise-of-the-deepfake-and-the-threat-to- et al. [25] extended Cozzolino et al.’s network by replacing the
democracy
2 [Online]. Available: https://www.buzzfeednews.com/article/davidmack/ standard decoder with a decoder that additionally generates a
obama-fake-news-jordan-peele-psa-video-buzzfeed mask of the manipulated region through multi-task learning.

Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
1026 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 14, NO. 5, AUGUST 2020

XceptionNet [32] is one of the more promising deep neural the spoofed audio is generated using either a text-to-speech or a
models for feature extraction when trained either from scratch voice conversion software. With a text-to-speech synthesis soft-
or with pre-trained ImageNet [44] weights. As such, we choose ware, a text input can be converted to a speech output with a voice
XceptionNet as an embedding space instead of the networks used similar to that of a target speaker. A voice conversion software
in prior works [20], [33]. The network architecture not only uses allows a speech given by a source speaker to be converted to
skip connections akin to ResNet [45] with an Inception-like [46] a different utterance which has the same linguistic content (i.e.
convolutional arrangement, but it also has a modified version textual information) but with a different speaker identity. The PA
of depth-wise separable convolutional layers for reducing the scenario focuses on security systems that use voice detection.
number of parameters with marginally better performance. In this scenario, a pre-recorded speech from the target speaker
While the previously mentioned methods targeted intra-frame is replayed in order to trick the security system to believe the
inconsistencies, Güera and Delp [20] introduced a spatio- target speaker is actually speaking. For this research, we focus
temporal model with InceptionV3 [46] as the feature extraction on the LA scenario since this type of attack allows the adversary
network to tackle deepfakes. The features from their time- to easily create fake media by falsely representing the identity of
distributed extraction network are forwarded to a unidirectional the speaker or falsely creating the content with a given speaker
Long Short-Term Memory (LSTM) network that ultimately identity.
makes the classification decision. Sabir et al. [47] evaluated the Modern methods of computer-generated audio spoofing using
same architecture with different feature extractors [45], [48]. LA need only a few minutes of recorded audio of a target speaker
Their face-extraction process was adjusted by aligning the faces to create accurate spoofs that are hard to detect by humans.
from consecutive frames using facial landmarks to maintain van den Oord et al. [52] introduced WaveNet, a deep neural
temporal consistency. While their performance on manipulated network model that can generate speech from a text in a fully
videos [20] from the HOHA dataset [49] was quite good, per- probabilistic and autoregressive manner and can be conditioned
formance on the larger and more complex FaceForensics++ [22] on speaker identity. Arik et al. [53] introduced a voice cloning
dataset was not as effective. model based on speaker adaptation and speaker encoding that
In a short paper, Sohrawardi et al. [33] described and very can generate speech with a similar voice using only a few training
briefly evaluated models based on the FaceNet [50] CNN ar- samples. Kameoka et al. [54] used a StarGAN-based [9] model
chitecture, with weights initialized from the VGGFace2 [51] to perform many-to-many voice conversion without the need
facial recognition task. The trained CNN maps the faces to a for parallel training data. By replacing images with spectro-
compact latent representation where similar faces map close and grams and representing speaker identity with domain vectors,
dissimilar faces map far apart. These features were then passed generated spectrograms are converted back to raw audio using
into a ConvLSTM network [20]. the Griffin-Lim algorithm [55]. Tanaka et al. [56] proposed a
Agarwal et al. [27] took an alternative approach to detection sequence-to-sequence recurrent model that uses attention and
by treating the problem as one of anomalous behavior detection. context preservation mechanisms to perform voice conversion
They learn the typical behavior of the subject from real videos with realistic results.
by extracting facial landmarks and temporal actions and use
them to train a one-class SVM. Although this method cannot
be used on unforeseen faces, it is an effective option to pre- D. Audio: Spoof Detection
serve robustness against future improvements in other aspects To counter the spoofing problem, different speech features,
of deepfake generation. For example, they claim there is already such as constant-Q cepstrum coefficients (CQCC) [57], or
sufficient data for most world leaders, although it would be a MFCCs, have been combined with Gaussian Mixture Models
considerable undertaking to train a detector for all government (GMMs) [58] and deep neural networks. Competitions such as
officials, corporate leaders, or persons of interest. the Voice Conversion Challenge 2018 [28] show many promis-
Conclusion. Although all of the aforementioned methods ing parallel and non-parallel voice conversion techniques, which
excel at specific types of deepfake detection, they are not as fuel new ideas and training sets for spoof detection.
effective at handling unforeseen deepfake manipulations on Early speaker identification techniques used vector quantiza-
which they have not been trained, referred to as cross-domain tion [59], GMMs, and feature decomposition paired with Sup-
adaptation. With many possible ways to generate deepfakes, port Vector Machines (SVMs) [60]. Newer methods use neural
it is critical for realistic and robust open-world detection to network based approaches [29]–[31]. Balamurali et al. [61]
have a model that performs well across multiple types of ma- reviewed traditional and newer machine learning spoof de-
nipulated videos. In our work, we not only demonstrate new tection methods. Their best methods use a GMM Universal
benchmark performance on existing datasets, we also show that Background Model to fuse traditional features (MFCC, spec-
our approach leads to robust detection performance in a range trogram, CQCCs, etc.) with an autoencoder neural network
of scenarios including in cross-domain tests. latent vector representation. Chettri et al. [62] used an ensemble
of 2-D convolution-recurrent networks, 1-D convolutional net-
works, GMMs, and SVMs to perform spoofing detection on the
C. Audio: Spoof Generation ASVSpoof 2019 dataset [63]. For the same dataset, Alzantot
There are two main categories of audio spoofing attacks: et al. [64] used a family of three different 2-D CNNs with
Logical Access (LA) and Physical Access (PA). In the LA case, residual skip-connections. The three variants accepted MFCC,

Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
CHINTHA et al.: RECURRENT CONVOLUTIONAL STRUCTURES FOR AUDIO SPOOF AND VIDEO DEEPFAKE DETECTION 1027

module are passed into a first bidirectional-LSTM layer. The


outputs of the first bidirectional-LSTM layer are passed to a
second bidirectional-LSTM layer to produce secondary feature
abstraction. The feature vector from the last LSTM unit of this
second bidirectional layer is passed into a fully-connected layer
and finally to a classification layer. Dropout is added to the
fully-connected layer for regularization.
1) Loss Functions: Faces from real videos are hypothesized
to have their own embedding distribution, while different types
of generated fake videos can either be clustered together or
in disparate distributions. The main goal in our system is to
discriminate the real video distribution from that of the forged
videos. To this end, we compare the use of two loss functions
for learning the discrimination accurately: Cross-entropy and
Kullback-Leibler (KL) divergence.
Cross-entropy loss is the conventional loss used in classifica-
tion and is defined as
 
e yc
Fig. 2. XceptionNet as a feature extraction model, converting each video frame LCE = − log 1+nf (1)
to a latent vector embedding. yj
j=1 e

log-magnitude STFT, and CQCC input features. Das et al. [65] where yc is the ground truth logit value and (1 + nf ) represents
proposed to use long-range acoustic features based on the octave one class for real videos alongside nf deepfake classes.
power spectrum instead of the linear power spectrum for spoof- KL divergence (relative entropy) is a natural approach to
ing detection. While these methods achieve very good results on measure the differences amongst probability distributions. In-
attack methods they were trained on, most fail to detect attacks spired from the application of KL divergence in variational
using unseen methods. autoencoders [67], the high-level idea is to disentangle different
probability distributions via parameter learning. We hypoth-
III. PROPOSED DEEPFAKE DETECTION ARCHITECTURE esize that the probability distribution of the real and ersatz
material [68] when mapped into a two-dimensional space from
Our architecture for deepfake detection is inspired by the the latent feature representation generated after the primary
XceptionNet [32] architecture along with recurrent processing fully-connected layer can be disentangled.
used in ConvLSTM [20] and FaceNetLSTM [33]. In this setting, mean (μ) and variance (σ) of the bivariate
We use a convolutional architecture to obtain a vector rep- normal distribution N are estimated to be computed from the
resentation of a facial region of a frame, fi . A sequence of latent feature vector. The loss function encourages the model
such facial regions from frames, f1 , f2 , . . . , ff are passed into to distinguish one distribution from another. The bivariate KL
a bidirectional LSTM module to learn a latent representation divergence is represented as:
capable of discriminating between facial manipulations and
original faces. DKL (N ((μ1 , μ2 )T
, diag(σ12 , σ22 )) || N (0, I)) =
(2)
λ ( ni=1 σi2 + μ2i − log(σi ) − 1)
A. Preprocessing
We use λ = (1/2).
The dlib [66] face detector determines the primary face over Similar to the Fisher linear discriminant loss, the KL loss is
each frame in the video. Canonical face images are generated by equivalent to the sum of class-wise KL divergences. Intuitively,
cropping to the dlib facial bounding box, resampling to 299 × the KL loss distributes all the learned samples of each class in
299 pixels, and normalized to zero mean and unit variance. a densely packed normal distribution and clusters the samples
As dlib extracts faces individually for each frame, we ob- from one class further away from other classes. The loss com-
served that the subsequent face images have subtle differences position of the KL-divergence metric is defined as:
in the box coordinates. Similar to Faceswap-GAN [5], we apply  
n
a linear smoothing filter over the box coordinates of consecutive yi
LKL = yi log (3)
frames to mitigate the temporal inconsistency introduced in the ŷi
i=1
extraction process.
The choice of using a two-dimensional space was determined
B. Model empirically.
We also propose to use an ensemble of the learning proce-
Canonical faces are encoded using the XceptionNet archi- dures with a convex combination of the cross-entropy and KL
tecture (Fig. 2) [32]. We propose XcepTemporal, a convolu- divergence losses. The ensemble loss function is defined as:
tional recurrent network with multiple levels of temporal fea-
ture abstraction. The spatial features from the XceptionNet LEN = λ1 LKL + λ2 LCE (4)

Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
1028 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 14, NO. 5, AUGUST 2020

Fig. 3. XcepTemporal model. Faces are first extracted and normalized to a canonical representation. These canonical images are passed into a convolutional
recurrent model to make a prediction. μ and σ 2 represent the mean and variance of the distribution in a 2-D space.

where λ1 and λ2 are the weights allocated to respective loss This dataset uses 32 subjects, each with 10 face swap videos of
metrics to provide a flexible overall loss function. We typically both low and high quality for a total of 620 fake videos.
allocate a larger weight to KL divergence than cross-entropy for FaceForensics++ [22] contains four different types of deep-
deepfake detection, as KL divergence aids in driving the real and fakes, namely Face2Face [6], FaceSwap [5], Deepfakes, and
fake distributions further apart. NeuralTextures [14], alongside corresponding real videos. The
2) Variants: We propose four variants of the XcepTemporal dataset contains 1000 real source videos and 1000 of each of the
model: four deepfake generation methods. Since the traditional vision
1) XcepTemporal (CE). and face swap methods used to create the UADFV and Deep-
a) XcepTemporal with cross-entropy loss LCE . fakeTIMIT datasets, respectively, are also in FaceForensics++,
b) It follows the horizontal line path to the blue box most works do not report on UADFV or DeepfakeTIMIT.
labelled “Class Layer” in Fig. 3. The FaceForensics++ dataset was updated on 23 rd August
2) XcepTemporal (KL). 2019 to host the Deep Fake Detection (DFD) dataset from
a) XcepTemporal with KL divergence loss LKL . Google and JigSaw [22], which consists of 360 original source
b) It follows the dashed line path to the yellow box videos and 3000 manipulated videos.
labelled “KL-Divergence” in Fig. 3. Celeb-DF [69] was built from 400 original YouTube videos
3) XcepTemporal (EN). and 800 synthesized videos. Unlike the aforementioned datasets,
a) XcepTemporal with ensemble loss LEN . these fakes are refined to address issues with color inconsistency,
b) This variant has two classes in the classification layer low-frequency smoothing, and temporal flickering. Additional
of the model (real and fake). refinements include synthesis of higher-resolution fakes (256 ×
c) It has a Y-shaped final layer where the outputs are 256), whereas the algorithms used in previous works synthesized
a two-class classification layer and a sample in two- low-resolution (64 × 64) fakes.
dimensional space. Deepfake Detection Challenge (DFDC) [73] is a preview ver-
4) XcepTemporal (EN1+n ). sion of a larger dataset from a challenge developed by Facebook,
a) XcepTemporal with ensemble loss LEN . and it consists of 5000 videos (original and manipulated). There
b) This variant has 1 + nf classes in the classification are two types of fakes available in the dataset, but the pedigree
layer of the model (one real and n fakes). of each has not been released yet.
c) It has a Y-shaped ultimate layer where the outputs are a YouTube (YT) consists of 20 real and 20 fake videos that we
1 + nf -class classification layer and a sample in two- collected from YouTube.com as a way to test accuracy on unseen
dimensional space. samples. The 20 real videos all feature a single camera-facing
d) This variant not only distinguishes the distribution of subject, while the 20 fake videos are from the Ctrl Shift Face
fakes from original faces but also determines the type YouTube channel.3 We do not know which deepfake technique
of fake. the creator of this channel uses, however we do know that he
uses face-swapping as opposed to facial re-enactment methods.

IV. DEEPFAKE EXPERIMENTS B. Baselines


A. Visual Datasets We compare our work with six high-performing deep-
fake detection models proposed in prior work: Capsule-
UADFV [70] uses traditional vision approaches to generate
Forensics [21], ClassNSeg [25], ConvLSTM [20], FaceNetL-
manipulated videos. The generation process maps the facial
STM [33], DenseNetAligned [47], and XceptionNet [32]. We
landmark points of the source to that of the target. This dataset
consists of 49 real and 49 fake videos.
DeepfakeTIMIT [71] was built from the VidTIMIT [72] 3 [Online]. Available: https://www.youtube.com/channel/
dataset using a GAN-based face swap [5] over dataset subjects. UCKpH0CKltc73e4wh0_pgL3g/

Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
CHINTHA et al.: RECURRENT CONVOLUTIONAL STRUCTURES FOR AUDIO SPOOF AND VIDEO DEEPFAKE DETECTION 1029

TABLE I
WITHIN-DOMAIN ACCURACY: FRAME-LEVEL ACCURACY ON THE FACEFORENSICS++ [22] AND CELEB-DF [69] OFFICIAL
TEST SETS. ABBREVIATIONS: FACE2FACE (F2F), FACESWAP (FS), DEEPFAKE (DF), AND NEURALTEXTURES (NT)

use code provided by the authors for this purpose. The perfor- TABLE II
DATA SPLITS FOR OUR COMBINED DATASET, WHICH IS COMPOSED OF THE
mance of most of these models have been previously reported FACEFORENSICS++ AND CELEB-DF DATASETS
on the same datasets [22], [69]. For a fair comparison, however,
we tested them along with our variants of XcepTemporal on
identical train and test splits.
Prior work did not describe how to address making a detection
decision on an entire video instead of a single frame. As we are
interested in results on the entire video and to provide a fair set of
baselines for comparison, we propose a simple mechanism for results, we set the sequence length to eight frames with a stride
converting a frame-level model into a video-level model. We first of eight. All the variants of the model were trained end-to-end.
pass the frame-level results, which produce a higher value for
frames that more likely to be fake, through a median filter with
D. Within-Domain Results
a window size of five, which helps to reduce false positives by
smoothing out the resulting output. We then take the maximum The most basic test for deepfake detection is to train and
output from these windows as the overall result for the video. test on the same datasets. We show accuracy based on eval-
While this can result in some false positives, it is important to uating one frame at a time in Table I, and show accuracy
detect a video as fake when even just a short portion of the video across the entire video in Table III. We find that XceptionNet
is fake. models are very effective almost uniformly across all samples.
We applied this procedure to all of the baseline frame-level In particular, the XceptionNet (KL) and XcepTemporal (KL)
models to produce video-level models, and we report the re- models achieve 100% accuracy for the entire FaceForensics++
sults for both frame-level and video-level models for multiple dataset for both frame-level and video-level detection, while
methods of comparison. XcepTemporal (CE) is close behind at 99.71% for frame-level
detection and 100% for video-level detection.
Among prior works, only XceptionNet [32] and Capsule-
C. Training
Forensics [21] could achieve 97% or above, where XceptionNet
We trained our model and the baseline models on the com- particularly suffers from false positives at over 8% FPR in frame-
bination of the full FaceForensics++ [22] and Celeb-DF [69] level detection and over 11% FPR in video-level detection.
datasets. For the training, validation, and test splits, we used The two ensemble methods do not fare as well in Face-
the instructions provided along with the datasets. As Celeb-DF Forensics++ at about 97% accuracy on frames and 99% ac-
does not offer any validation split, we randomly chose 50 real curacy on full videos, but they reach new standards for accu-
and 134 fake videos from the training data to create a validation racy on the Celeb-DF dataset. XcepTemporal (EN1+n ) has the
split. The test sets for both datasets was left unaltered. The splits overall best accuracy on frames at 97.83% and, together with
used are shown in Table II. XcepTemporal (CE) and XcepTemporal (KL), 99.16% on full
We set the learning rate to 1e-4 with a decay factor of 1e-5. videos.
The optimizer is Adam, with β1 set to 0.9 and β2 set to 0.999, In our experience with training the models, using the 1.4 M
these being the default values suggested by Kingma and Ba’s facial frames in the training set (see Table II), the XcepTemporal
original paper on Adam [74]. A dropout of 0.5 is added to the (CE) takes four epochs to converge on the training data, whereas
first fully-connected layer. Based upon hyperparameter tuning the KL variant takes only two epochs to converge. The models

Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
1030 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 14, NO. 5, AUGUST 2020

TABLE III
WITHIN-DOMAIN RESULTS: VIDEO-LEVEL ACCURACY ON FACEFORENSICS++ [22] AND CELEB-DF [69] OFFICIAL
TEST SETS. ABBREVIATIONS: FACE2FACE (F2F), FACESWAP (FS), DEEPFAKE (DF), AND NEURALTEXTURES (NT)

TABLE IV as fakes. Table IV shows that the KL variant of our model


CROSS-DOMAIN RESULTS: MODELS TRAINED ON FACEFORENSICS++ [22]
AND CELEB-DF [69] AND THEN TESTED ON OUT-OF-SAMPLE DFDC [73]
performs significantly better than the CE variant. We believe
AND YT DATASETS. METHOD-A (M-A) AND METHOD-B (M-B) ARE TWO this is because the KL loss naturally encourages larger margins
TYPES OF MANIPULATED VIDEOS RELEASED IN THE DFDC PREVIEW DATASET between the two different distributions. By encouraging existing
classes (real vs. forgery) to be far apart, unforeseen cross-domain
variations are more likely to fall on the correct side of the
boundary.

F. Compression
A simple but effective way to bypass deepfake detection
is to apply compression to the deepfake video. This section
examines the performance of our models on two different types
of compression techniques (JPEG and MPEG).
JPEG Compression. OpenCV [75] provides an option to
compress individual frames with a reduction in quality as shown
in Fig. 4. The quality factor is a measure of the compression rate
of JPEG images, ranging between 0 and 100. High numbers
with LKL loss tend to converge faster than the cross-entropy between 90-100 represent minimal visual quality loss, while
loss LCE because the latent distribution generated by the model numbers less than 60 represent visually significant quality loss.
is distinct between the original and the forged faces. By contrast, Lower quality factor images take up less disk space. Table V
EN variants of the model take the longest to train (ten epochs) shows the test accuracies for our models as well as for the
because the gradients from both the losses compete with each DenseNetAligned and XceptionNet models, two of the best
other. The convergence is slower in the early stages of the models in our results for uncompressed video. Our models
training, but once both the channels start learning, the rate of perform the best on most settings in the Face2Face, Deepfakes,
convergence nearly doubles. FaceSwap, and Celeb-DF datasets, while XceptionNet performs
The next sections will explore these models further to help the best for JPEG compression on NeuralTextures fakes. We note
understand how the architectural variations affect performance. a severe degradation in deepfake detection performance across
all the models for highly compressed faces. The accuracies are
significantly more affected by compression on re-enactment-
E. Cross-Domain Results
type fakes (Face2Face and NeuralTextures) than the faceswap-
A critical property of deepfake detection models for real- type (Deepfakes and FaceSwap).
world use is good inference performance on deepfake types MPEG Compression. The models were additionally tested on
not included in the training dataset. One way to measure this the effects of video compression using videos saved at two differ-
performance is to train on publicly available deepfake datasets ent H.264 quantization factor levels, as previously explored by
from Table II, and test on the unforeseen cross-domain samples Agarwal et al. [27]. The quantization factor reflects the MPEG
in the Deepfake Detection Challenge preview dataset [73]. As compression rate of the videos. The higher the quantization
shown in Table IV, we use binary class models, such that we factor, the lower the quality and greater the compression. We
expect the unforeseen fakes, M-A and M-B to be both classified selected a quantization value of 20 to represent high-quality

Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
CHINTHA et al.: RECURRENT CONVOLUTIONAL STRUCTURES FOR AUDIO SPOOF AND VIDEO DEEPFAKE DETECTION 1031

TABLE V
COMPRESSION: RESULTS FOR DIFFERENT RATES OF COMPRESSION. RESULTS
IN BOLD INDICATE THE BEST RESULTS ACROSS THE FOUR TESTED MODELS
FOR THE GIVEN SETTING. NOTE THAT HIGH JPEG QUALITY MEANS LESS
COMPRESSION, WHILE HIGH MPEG QUANTIZATION MEANS MORE
COMPRESSION

Fig. 4. Visualization of image compression on a test sample. The original


frame is shown in the top-left corner. Compression using JPEG Quality 75,
50, and 25 are displayed in the top-right, lower-left, and lower-right corners,
respectively.

MPEG videos and 40 to represent low quality MPEG videos.


Table V shows a drastic drop in accuracy at the quantization
factor of 40.
Data Augmentation. As the compression artifacts divulge an
area of weakness for all methods, we considered retraining
models with data augmentation. The models from Table V are
retrained using both the original training data as well as with their
corresponding JPEG compressed faces with a quality factor of
50. The test results are presented in Table VI. With a minimal
samples suffers. We train our model on FaceForensics++ and
loss in accuracy on the original test sets, all the models gained
transfer learn to Celeb-DF. During transfer learning, all the
robustness in detecting forged faces which are compressed at
layers but the classification layer (for CE variant) are frozen.
various compression levels. In this setting, our models become
After convergence, we lower the learning rate and fine-tune all
the top performers in all but three of the 28 configurations tested.
layers. To prevent catastrophic forgetting, we include 50 (out
Impressively, the XcepTemporal (KL) model attains 84.8-100%
of 720) randomly chosen samples from each FaceForensics++
accuracy for JPEG Quality of 25, which has a substantial re-
deepfake type along with the Celeb-DF training data. The results
duction in visual quality, as shown in Figure 4. In contrast,
are shown inTable VII. We note that after the final stage of
neither DenseNetAligned nor XceptionNet got above 81% in
training, our models perform well on all datasets, showing their
any of the tests with this low quality. On MPEG compression,
ability to learn new types of fakes without forgetting how to
our models also perform best in the Face2Face, NeuralTextures,
classify original fakes.
and Deepfakes, with XcepTemporal (CE) getting above 83% on
these datasets. On the FaceSwap dataset, however, XceptionNet
H. Ablation Study
was significantly better – this marks the main exception to our
overall findings that showed our models to be superior. To help understand our model further, we explore several
An interesting observation is that the test accuracies for the aspects and elucidate on the decisions involved in defining the
MPEG compression rise after retraining on datasets augmented model.
just with the JPEG compression samples and not with the MPEG Recurrent layers: To understand the importance of recurrent
samples. layers in the XcepTemporal model, Table VIII shows the com-
binations between uni/bidirectional and single/double LSTM
layers. The results in Table VIII suggest that the backward pass
G. Transfer learning / Domain Adaptation.
within a recurrent layer extracts meaningful temporal informa-
As new deepfake generation algorithms are released, com- tion. In addition, the added parameters from the secondary layer
panion deepfake detection algorithms follow. In a phenomenon help improve results.
referred to as catastrophic forgetting, when new data samples are Length and Stride: We now examine the variation in perfor-
introduced into a training regiment, the performance on original mance for different lengths of the input sequence and different

Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
1032 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 14, NO. 5, AUGUST 2020

TABLE VI
COMPRESSION: MODELS TRAINED ON AUGMENTED TRAINING DATA
CONSISTING OF THE ORIGINAL AND JPEG COMPRESSED (QUALITY-50)
VERSIONS OF ALL THE TRAINING FACES

Fig. 5. CRNNSpoof model (left) consists of five convolution layers to extract


features and downsample the input audio before an LSTM and two fully-
connected layers. The WIRENetSpoof (right) uses log-melspectrogram from the
first four seconds of audio before strided convolution and max-pooling layers
perform feature extraction.

A. Audio Model
CRNNSpoof: Inspired by our XcepTemporal model, our
first model for spoofing detection is a convolution-recurrent
neural network (Fig. 5 left). Since raw audio is used instead of
lengths of strides between two consecutive input sequences. spectral features as the input to the network, five 1-D convolution
Longer sequence lengths enable temporal features of longer layers are used to learn useful representations. Additionally, the
duration but require more compute resources. The combined convolutions are strided to downsample the input signal from
accuracy on both the test sets shown in Table II for different 16 kHz to 100 Hz, which reduces the memory footprint while
lengths and strides is shown in Table IX. While the overall also speeding up the training process. The extracted features
accuracy does not change much, we chose a length and stride are then passed to a bidirectional LSTM layer. The hidden state
of eight because 1) shorter sample lengths translate into more from the last LSTM timestep is used to perform prediction using
training samples; and 2) shorter sample lengths are more capable two fully-connected layers. Dropout and batch normalization
at detecting short sequences of manipulated frames embedded are used after each layer to perform regularization. The network
between real frames. is trained using the negative log-likelihood loss function. Due
to the unbalanced nature of the dataset, a misclassification of a
spoofed speech sample incurs heavier loss than misclassification
V. PROPOSED SPOOF DETECTION ARCHITECTURE of a real speech sample.
None of the deepfake detection datasets include manipulated WideBlock: We introduce a WideBlock (Fig. 6) architecture,
audio. The ASVSpoof2015 challenge [76], ASVSpoof2017 named for the high number of paths in each block. We use this
challenge [29], and ASVSpoof2019 challenge [63] have helped block in a second fully-convolutional audio approach described
spur deepfake audio research, referred to in the research com- in the next section. The architecture of the WideBlock, taking in-
munity as spoof detection. We introduce our spoof detection spiration from ResNeXt blocks used in image classification [77],
methodologies in isolation, while we anticipate the creation consists of several parallel streams, each consisting of bottleneck
of multimodal deepfake datasets, where the visual and audio 1 × 1 convolution layers before and after a normal convolution
channels can mutually benefit one another for more advanced layer. The bottleneck layers reduce the complexity of the model
forgery detection. by reducing the number of parameters required by the middle

Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
CHINTHA et al.: RECURRENT CONVOLUTIONAL STRUCTURES FOR AUDIO SPOOF AND VIDEO DEEPFAKE DETECTION 1033

TABLE VII
TRANSFER LEARNING/DOMAIN ADAPTATION: MODELS ARE INITIALIZED ON FACEFORENSICS++ [22] AND TRAINED ON CELEB-DF [69]

TABLE VIII WIRENetSpoof: The second proposed model is a convolution


ABLATION: ACCURACY OF XCEPTEMPORAL (CE)
WITH DIFFERENT RECURRENT STRUCTURES
neural network (Fig. 5 right), called Wide Inception Residual
Network (WIRENet) Spoof. Since a countermeasure only needs
to produce one score for an entire utterance instead of one
prediction per timestep, the architecture uses strided convolution
and max pooling operations to reduce the length of the feature
map as it passes through the network. The input audio is either
clipped or repeated to a fixed length of four seconds before the
log-mel spectrogram is obtained. The network uses a series
TABLE IX of WideBlocks to capture different levels of temporal infor-
ABLATION: ACCURACY OF XCEPTEMPORAL (CE) ON THE COMBINED TEST mation. Dropout and batch normalization are used after each
SETS FOR VARIOUS LENGTH AND STRIDE COMBINATIONS
layer to avoid overfitting. Similar to the CRNNSpoof model, the
WIRENetSpoof model is also trained using weighted negative
log-likelihood.
For both models, the detection score is defined as:
score = log(p(real|s, θ) − log(p(f ake|s, θ)
where s and θ are the given audio sample and the parameters
of the model.

B. Audio Datasets
ASVSpoof 2019 Challenge [63] consists of two spoofing
scenarios: logical access and physical access. In the logical
access scenario, fake audio is created using speech-synthesis
or speech-conversion software. In the physical access scenario,
fake audio is created by replaying a pre-recorded audio using
a splice of the real speaker data. For this research, we use only
the logical access data, since it is easier and more dangerous to
create speeches from arbitrary text. There are 17 different types
of fakes in the dataset, with six designated as known attacks and
11 as unknown attacks. Only known attacks are present in the
train and development set, while all 17 are present in the test set.
Fig. 6. WideBlock consists of multiple paths, each with different kernel size
to capture different levels of temporal dependencies. A skip-connection is used
to aid with gradient flow. C. Baselines
We compare the results of our model with both baseline results
convolution operation. Instead of keeping the same filter size (Baseline 1 and Baseline 2) provided by the organizers of the
for all paths, we draw inspiration from Inception networks and ASVSpoof 2019 contest and state-of-the-art benchmark systems
employ filters with different sizes in each parallel path. We use that we refer to simply as Benchmarks A-E. Baseline 01 and
nine filter sizes, the widths of which are odd numbers between 3 Baseline 02 are GMM models trained on Linear Frequency Cep-
and 19. The different filter sizes allow the model to pick up both stral Coefficients (LFCC) and Constant-Q cepstral coefficients
short-term and long-term temporal context. The output from (CQCC) input features, respectively. Benchmark A is a CNN
each path is then summed before being added to the input of architecture proposed by Chettri et al. as a countermeasure for
each block, forming a skip connection. replay attacks [78]. Benchmarks B, C, and D are the same CNN

Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
1034 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 14, NO. 5, AUGUST 2020

TABLE X TABLE XI
RESULTS OF PROPOSED COUNTERMEASURES WITH ABLATION: ACCURACY OF CRNNSPOOF WITH
OTHER BENCHMARKS AND BASELINE METHODS DIFFERENT RECURRENT STRUCTURES

TABLE XII
COMPRESSION: RESULTS FOR DIFFERENT RATES OF AUDIO COMPRESSION.
RESULTS IN BOLD INDICATE THE BEST RESULTS ACROSS THE FOUR TESTED
MODELS FOR THE GIVEN SETTING. NOTE THAT LOWER
BITRATE MEANS MORE COMPRESSION

architecture trained using MFCC, power spectrogram, and con-


stant Q cepstrum coefficients input features, respectively [64].
Benchmark E is a fusion model using benchmarks B, C, and D.

D. Audio Results
For consistency with the prior work, we evaluated the spoof-
detection models using metrics provided by the organizers of
the ASVSpoof 2019 challenge: tandem detection cost function
(t-DCF) and equal error rate (EER). In biometric security sys-
tems, EER is used to determine the threshold at which the false
model achieves the 9th best result in terms of t-DCF, the primary
positive rate (FPR) equals the false negative rate (FNR). At that
metric of the challenge, and 6th best in terms of EER when
threshold, F P R = F N R = EER. t-DCF is a metric proposed
evaluated on the evaluation set. It is notable that all models
by the organizers of the ASVSpoof2019 challenge as a means
that perform better in terms of t-DCF, as well as EER, use
to evaluate the performance of a spoofing detection method in
an ensemble of classifiers. When only single classifiers are
conjunction with a given automatic speaker verification (ASV)
considered, the proposed CRNNSpoof model achieves the best
method. The t-DCF is computed as:
result in both t-DCF and EER. The next best non-ensemble
min cm
t − DCFnorm = min{βPmiss (s) + Pfcm
a (s)}, (5) model from the Interspeech 2019 challenge achieves t-DCF of
s
0.1404 and EER of 5.74%.
cm
where Pmiss and Pfcm
a correspond to the FNR and FPR of the Recurrent layers. Table XI shows results from an abla-
countermeasure system, respectively. β represents the perfor- tion study on our CRNNSpoof model. Similar to our visual
mance of the ASV system. For the ASVSpoof 2019 challenge, XcepTemporal models, we explore the usage of single and dou-
the organizer provides the ASV score, so β is a fixed value. β ble recurrent layers, as well as uni- and bi-directional LSTM. The
is inversely proportional to the false acceptance rate of the ASV results suggest that the model with single unidirectional LSTM
system to a specific attack. layer performs best. We believe the other models are overfitting,
While both models fail to perform better than the given which we could solve by reducing the number of parameters
baseline and benchmark methods on the development set, the (filters or fully connected layers), increased regularization, or
CRNNSpoof model achieves significantly better results, both in an increase in the training set size.
t-DCF and EER, compared to other models on the evaluation set. Compression. To test the robustness of our models concerning
The CRNNSpoof model outperforms the best baseline model the quality of the audio, we perform audio compression. Similar
(Baseline 01) by 47% in terms of EER and 37% in terms of to video deepfakes, the greater the audio compression, the more
t-DCF. Meanwhile, the WIRENetSpoof model achieves better difficult the detection of real from a spoof. The reduction in the
EER than other models, except for the CRNNSpoof model, on audio quality is done by decreasing the bitrate of the signals
the evaluation set. However, the t-DCF score of the WIRENet- via the LAME4 encoder using lossy compression schemes.
Spoof model is only better than that of Benchmark C. Table XII shows the effect of compression on our models when
In addition to the results from Table X, we compare our trained on the original audio samples while Table XIII shows a
methods to the results published at Interspeech 2019 [63]. There similar comparison when trained with original audio augmented
is no obvious way to learn what model is evaluated for each entry with compressed audio data. We compare two versions of our
on the Interspeech leaderboard because Interspeech does not
publish the authors or the paper titles. However, our CRNNSpoof 4 [Online]. Available: http://www.harmjschoonhoven.com/mp3-quality.html

Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
CHINTHA et al.: RECURRENT CONVOLUTIONAL STRUCTURES FOR AUDIO SPOOF AND VIDEO DEEPFAKE DETECTION 1035

TABLE XIII REFERENCES


COMPRESSION: MODELS TRAINED ON ORIGINAL AUDIO DATA COMBINED
WITH AUGMENTED AUDIO DATA. THE AUGMENTED DATA INCLUDED [1] J. Fletcher, “Deepfakes, artificial intelligence, and some kind of dystopia:
AUDIO SAMPLES COMPRESSED AT BOTH BITRATE EQUAL TO 192 AND 128 The new faces of online post-fact performance,” Theatre J., vol. 70, no. 4,
Dec. 2018.
[2] A. Ovadya and J. Whittlestone, “Reducing malicious use of synthetic
media research: Considerations and potential release practices for machine
learning,” 2019, arXiv:1907.11274.
[3] I. Korshunova, W. Shi, J. Dambre, and L. Theis, “Fast face-swap using
convolutional neural networks,” in Proc. IEEE Int. Conf. Comput. Vision,
Venice, pp. 3697–3705, 2017.
[4] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural
Inf. Process. Syst., 2014, pp. 2672–2680.
[5] S.-A. Lu, “faceswap-GAN,” [Online]. Available: https://github.com/
shaoanlu/faceswap-GAN.
[6] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner,
“Face2face: Real-time face capture and reenactment of rgb videos,”
in Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR),
Jun. 2016.
[7] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of GANs
for improved quality, stability, and variation,” 2017, arXiv:1710.10196.
[8] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture
for generative adversarial networks,” in Proc. IEEE Conf. Comput. Vision
Pattern Recognit., 2019, pp. 4401–4410.
CRNNSpoof (Uni + Single and Bi + Double) as well as our [9] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan:
WireNetSpoof model. The WIRENetSpoof model benefits Unified generative adversarial networks for multi-domain image-to-image
tremendously from an augmented training set when presented translation,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., 2018,
pp. 8789–8797.
with compressed audio during inference, but is not as good [10] G. Perarnau, J. Van De Weijer, B. Raducanu, and J. M. Álvarez, “Invertible
as the CRNNSpoof models. We note while the Uni + Single conditional gans for image editing,” 2016, arXiv:1611.06355.
CRNNSpoof performs best when training only on original data [11] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-
image translation using cycle-consistent adversarial networks,” 2017,
and the Bi + Double CRNNSpoof performs best when training pp. 2223–2232.
with original data augmented with compressed audio samples. [12] Y. Shen, J. Gu, X. Tang, and B. Zhou, “Interpreting the latent space of
This is as expected as the greater number of parameters in gans for semantic face editing,” 2019, arXiv:1907.10786.
[13] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
the BI + Double model benefits from the number of training learning with deep convolutional generative adversarial networks,” 2015,
samples, suggesting that an even larger dataset would help this arXiv:1511.06434.
model further. [14] J. Thies, M. Zollhöfer, and M. Nießner, “Deferred neural rendering: Image
synthesis using neural textures,” 2019, arXiv:1904.12356.
VI. CONCLUSION [15] M.-Y. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and
J. Kautz, “Few-shot unsupervised image-to-image translation,” 2019,
Inspired by recent progress in digital forgery, we introduce the arXiv:1905.01723.
[16] T.-C. Wang, M.-Y. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro, “Few-
XcepTemporal convolutional recurrent neural network frame- shot video-to-video synthesis,” 2019, arXiv:1910.12713.
work for deepfake detection. We use an XceptionNet CNN [17] S. Ha, M. Kersner, B. Kim, S. Seo, and D. Kim, “Marionette: Few-
as a salient and efficient facial feature representation. These shot face reenactment preserving identity of unseen targets,” 2019,
arXiv:1911.08139.
representations are passed into bidirectional recurrence layers to [18] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “Mesonet: A compact
aid in detecting temporal inconsistencies. Our model is trained facial video forgery detection network,” in Proc. IEEE Int. Workshop Inf.
both with traditional cross-entropy and KL divergence loss Forensics Security, 2018, pp. 1–7.
[19] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
functions. On the audio side, we introduce a companion architec- large-scale image recognition,” 2014, arXiv:1409.1556.
ture by stacking multiple convolution modules to obtain audio [20] D. Güera and E. J. Delp, “Deepfake video detection using recurrent
feature representations. These audio embeddings are similarly neural networks,” in Proc. 15th IEEE Int. Conf. Adv. Video Signal Based
Surveillance, 2018, pp. 1–6.
passed into a bidirectional recurrent layer. For visual deepfake [21] H. H. Nguyen, J. Yamagishi, and I. Echizen, “Capsule-forensics: Us-
detection we demonstrate the robustness of our methods to ing capsule networks to detect forged images and videos,” 2018,
both compression and out-of-sample inference on the popular arXiv:1810.11215.
[22] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner,
FaceForensics++ and Celeb-DF datasets, as well as the recently “Faceforensics++: Learning to detect manipulated facial images,” 2019,
introduced Deepfake Detection challenge. For the audio spoof arXiv:1901.08971.
detection, we demonstrate the robustness of our methods to both [23] D. Cozzolino, J. Thies, A. Rössler, C. Riess, M. Nießner, and L. Verdo-
liva, “Forensictransfer: Weakly-supervised domain adaptation for forgery
compression and out of sample inference on the ASVSpoof 2019 detection,” 2018, arXiv:1812.02510.
Challenge dataset. For both deepfake visual detection and spoof [24] Y. Li, M.-C. Chang, H. Farid, and S. Lyu, “In ictu oculi: Exposing ai gener-
audio detection, our methods obtain new benchmark standards ated fake face videos by detecting eye blinking,” 2018, arXiv:1806.02877.
[25] H. H. Nguyen, F. Fang, J. Yamagishi, and I. Echizen, “Multi-task learning
and are shown to generalize well to unforeseen attacks. for detecting and segmenting manipulated facial images and videos,” 2019,
arXiv:1906.06876.
ACKNOWLEDGMENT [26] U. A. Ciftci and I. Demir, “Fakecatcher: Detection of synthetic portrait
videos using biological signals,” 2019, arXiv:1901.02212.
This effort was funded in part by the Miami Foundation [27] S. Agarwal, H. Farid, Y. Gu, M. He, K. Nagano, and H. Li, “Protecting
through the Ethics and Governance of the Artificial Intelligence world leaders against deep fakes,” in Proc. IEEE Conf. Comput. Vision
Initiative. Pattern Recognit. Workshops, 2019, pp. 38–45.

Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
1036 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 14, NO. 5, AUGUST 2020

[28] J. Lorenzo-Trueba et al., “The voice conversion challenge 2018: [54] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “StarGAN-VC: Non-
Promoting development of parallel and nonparallel methods,” 2018, parallel many-to-many voice conversion using star generative adversarial
arXiv:1804.04262. networks,” in Proc. IEEE Spoken Lang. Technol. Workshop, 2018, pp.
[29] T. Kinnunen et al., “The asvspoof 2017 challenge: Assessing the limits 266–273.
of replay spoofing attack detection,” in Proc. Interspeech, 2017, pp. 2–6. [55] D. Griffin and J. Lim, “Signal estimation from modified short-time fourier
[Online]. Available: http://dx.doi.org/10.21437/Interspeech.2017-1111 transform,” IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 2,
[30] F. Tom, M. Jain, and P. Dey, “End-to-end audio replay attack detection pp. 236–243, 1984.
using deep convolutional networks with attention,” in Proc. Interspeech, [56] K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “ATTS2S-VC:
2018, pp. 681–685. Sequence-to-sequence voice conversion with attention and context preser-
[31] P. Korshunov and S. Marcel, “A cross-database study of voice presentation vation mechanisms,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
attack detection,” in Handbook of Biometric Anti-Spoofing. Springer, 2019, Process., 2019, pp. 6805–6809.
pp. 363–389. [57] M. Todisco, H. Delgado, and N. W. Evans, “A new feature for automatic
[32] F. Chollet, “Xception: Deep learning with depthwise separable convo- speaker verification anti-spoofing: Constant Q cepstral coefficients,” in
lutions,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., 2017, Odyssey, vol. 45, 2016, pp. 283–290.
pp. 1800–1807. [58] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification
[33] S. J. Sohrawardi et al., “Poster: Towards robust open-world detection of using adapted Gaussian mixture models,” Digit. Signal Process., vol. 10,
deepfakes,” in Proc. 2019 ACM SIGSAC Conf. Comput. Commun. Secur., no. 1–3, pp. 19–41, 2000.
2019, pp. 2613–2615. [59] F. K. Soong, A. E. Rosenberg, B.-H. Juang, and L. R. Rabiner, “Report:
[34] M. Kowalski, “faceswap,” https://github.com/MarekKowalski/FaceSwap. A vector quantization approach to speaker recognition,” AT&T Tech. J.,
[35] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, vol. 66, no. 2, pp. 14–26, 1987.
“Generative adversarial text to image synthesis,” in Proc. Int. Conf. Mach. [60] D. Matrouf, N. Scheffer, B. Fauve, and J.-F. Bonastre, “A straightforward
Learn., 2016, pp. 1060–1069. and efficient implementation of the factor analysis model for speaker
[36] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to- verification,” in Proc. 8th Annu. Conf. Int. Speech Commun. Assoc., 2007,
image translation using cycle-consistent adversarial networks,” 2017, pp. 1242–1245.
arXiv:1703.10593v6. [61] B. BT, K. W. E. Lin, S. Lui, J.-M. Chen, and D. Herremans, “Towards
[37] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to dis- robust audio spoofing detection: A detailed comparison of traditional and
cover cross-domain relations with generative adversarial networks,” 2017, learned features,” 2019, arXiv:1905.12439.
arXiv:1703.05192v2. [62] B. Chettri, D. Stoller, V. Morfi, M. A. M. Ramírez, E. Benetos, and B.
[38] Y. Lu, Y.-W. Tai, and C.-K. Tang, “Attribute-guided face generation using L. Sturm, “Ensemble models for spoofing detection in automatic speaker
conditional cyclegan,” 2018, arXiv:1705.09966v2. verification,” 2019, arXiv:1904.04589.
[39] H. Kim, P. Carrido, A. Tewari, W. Xu, J. Thies, M. Niessner, P. Pérez, [63] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A.
C. Richardt, M. Zollhöfer, and C. Theobalt, “Deep video portraits,” ACM Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “Asvspoof
Trans. Graph., 2018, pp. 1–14. 2019: Future horizons in spoofed and fake audio detection,” 2019,
[40] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Synthe- arXiv:1904.05441.
sizing obama: Learning lip sync from audio,” ACM Trans. Graph., 2017, [64] M. Alzantot, Z. Wang, and M. B. Srivastava, “Deep residual neural net-
pp. 1–13. works for audio spoofing detection,” in Proc. Conf. Int. Speech Commun.
[41] F. Matern, C. Riess, and M. Stamminger, “Exploiting visual artifacts to Assoc., 2019, pp. 1078–1082.
expose deepfakes and face manipulations,” in Proc. IEEE Winter Appl. [65] R. Das, J. Yang, and H. Li, “Long range acoustic and deep features
Comput. Vision Workshops, 2019, pp. 83–92. perspective on ASVspoof 2019,” in Proc. IEEE Autom. Speech Recognit.
[42] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto- Understand. Workshop, 2019, pp. 1018–1025.
encoders,” in Int. Conf. Artif. Neural Netw. Springer, 2011, pp. 44–51. [66] D. E. King, “Dlib-ml: A machine learning toolkit,” J. Mach. Learn. Res.,
[43] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between cap- vol. 10, pp. 1755–1758, Jul. 2009.
sules,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 3856–3866. [67] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2013,
[44] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: arXiv:1312.6114.
A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. [68] G. Allaire, F. Jouve, and A.-M. Toader, “Structural optimization using
Vision Pattern Recognit., 2009, pp. 248–255. sensitivity analysis and a level-set method,” J. Comput. Phys., vol. 194,
[45] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for im- no. 1, pp. 363–393, 2004.
age recognition,” CoRR, vol. abs/1512.03385, 2015. [Online]. Available: [69] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-df: A new dataset for
http://arxiv.org/abs/1512.03385 deepfake forensics,” 2019, arXiv:1909.12962.
[46] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking [70] X. Yang, Y. Li, and S. Lyu, “Exposing deep fakes using inconsistent head
the inception architecture for computer vision,” in Proc. IEEE Conf. poses,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2019,
Comput. Vision Pattern Recognit., 2016, pp. 2818–2826. pp. 8261–8265.
[47] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and P. Natarajan, [71] P. Korshunov and S. Marcel, “Deepfakes: a new threat to face recognition?
“Recurrent convolutional strategies for face manipulation detection in assessment and detection,” 2018, arXiv:1812.08685.
videos,” Interfaces (GUI), vol. 3, no. 1, 2019. [72] C. Sanderson and B. C. Lovell, “Multi-region probabilistic histograms
[48] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely for robust and scalable identity inference,” in Proc. Int. Conf. Biometrics,
connected convolutional networks,” in Proc. IEEE Conf. Comput. Vision 2009, pp. 199–208.
Pattern Recognit., 2017, pp. 4700–4708. [73] B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. C. Ferrer,
[49] Q. Wu, Z. Wang, F. Deng, and D. D. Feng, “Realistic human action “The deepfake detection challenge (DFDC) preview dataset,” 2019,
recognition with audio context,” in Proc. IEEE Int. Conf. Digit. Image arXiv:1910.08854.
Comput.: Techn. Appl., 2010, pp. 288–293. [74] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[50] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embed- 2014, arXiv:1412.6980.
ding for face recognition and clustering,” in Proc. IEEE Conf. Comput. [75] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools,
Vision Pattern Recognit., 2015, pp. 815–823. 2000.
[51] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A [76] Z. Wu et al., “ASVspoof 2015: The first automatic speaker verification
dataset for recognising faces across pose and age,” in Proc. IEEE 13th Int. spoofing and countermeasures challenge,” in Proc. 16th Annu. Conf. Int.
Conf. Autom. Face Gesture Recognit. (FG 2018)., 2018, pp. 67–74. Speech Commun. Assoc., 2015.
[52] A. van den Oord et al., “WaveNet: A generative model for raw audio,” [77] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual
CoRR, vol. abs/1609.03499, 2016. [Online]. Available: http://arxiv.org/ transformations for deep neural networks,” in Proc. IEEE Conf. Comput.
abs/1609.03499 Vision Pattern Recognition, 2017, pp. 1492–1500.
[53] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning [78] B. Chettri, S. Mishra, B. L. Sturm, and E. Benetos, “Analysing the
with a few samples,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. predictions of a CNN-based replay spoofing detection system,” in Proc.
10 019–10 029. IEEE Spoken Lang. Technol. Workshop, 2018, pp. 92–97.

Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
CHINTHA et al.: RECURRENT CONVOLUTIONAL STRUCTURES FOR AUDIO SPOOF AND VIDEO DEEPFAKE DETECTION 1037

Akash Chintha received the B.E. (Hons.) degree Matthew Wright received the B.S. degree in com-
in electrical and electronics engineering from the puter science from Harvey Mudd College, and the
Birla Institute of Technology and Science - Pilani, M.S. and Ph.D. degrees from the Department of Com-
Pilani, India, in 2018. He is currently working to- puter Science at the University of Massachusetts, in
ward the M.Sc. degree in computer engineering with 2002, 2005, respectively. His dissertation work ex-
Rochester Institute of Technology, Rochester, USA. amined attacks and defenses of systems that provide
He is currently a Graduate Research Assistant at the anonymity online. He is the Director of Research for
Machine Intelligence Laboratory at RIT. His current the Global Cybersecurity Institute (GCI) at Rochester
research interests include deep learning, data science, Institute of Technology (RIT) and a Professor of Com-
computer vision, and natural language processing. puting Security. His other interests include adversar-
ial machine learning and understanding the human
element of security. He has been the lead investigator on over $3.7 million in
funded projects, including an NSF CAREER award, and he has published 100
Bao Thai is working towards the B.S./M.S. degree in peer-reviewed papers, including numerous contributions in the most prestigious
computer engineering at Rochester Institute of Tech- venues focused on computer security and privacy.
nology, Rochester, USA. He is currently a Graduate
Research Assistant at the Machine Intelligence Lab-
oratory at RIT. His current research interests include
Andrea Hickerson received the B.A. degree in jour-
machine learning, deep learning, natural language
nalism and international relations from Syracuse Uni-
processing, audio processing, and speech recognition.
versity, the M.A. degree in journalism and the M.A.
degree in middle eastern studies from the University
of Texas at Austin, and the Ph.D. degree in commu-
nication from the University of Washington in 2009.
She is a Professor, Director of the School of Jour-
Saniat Javid Sohrawardi received the B.Sc. degree nalism and Mass Communications, and an Associate
in computer science and engineering from North Dean in the College of Information and Communi-
South University, Dhaka, Bangladesh, in August, cations at the University of South Carolina, USA.
2014. He is currently pursuing the Ph.D. degree in She conducts research on journalism routines with
computing and information sciences with RIT and an emphasis on the use of technology and political communication, particularly
is a graduate research assistant at the Center for in transnational and immigrant communities.
Cybersecurity. His current research areas include us-
able security, information privacy, deep learning and
media forensics. Raymond Ptucha received the B.S. degree in com-
puter science and the B.S. degree in electrical en-
gineering from SUNY/Buffalo, the M.S. degree in
image science from RIT, and the Ph.D. degree in
computer science from the Rochester Institute of
Kartavya Bhatt received the B.Tech. degree in infor- Technology (RIT), USA, in 2013. He is an Associate
mation and communication technology from Ahmed- Professor in Computer Engineering and Director of
abad University, Gujarat, India, in May 2019. He the Machine Intelligence Laboratory at RIT. His re-
is currently pursuing the M.S. degree in computer search includes machine learning, computer vision,
science with the Rochester Institute of Technology, and robotics, with a specialization in deep learning.
Rochester, USA. Since October 2019, he has been Ray was a research scientist with Eastman Kodak
working as a Graduate Research Assistant at the Company where he worked on computational imaging algorithms and was
Center for Cybersecurity at RIT, USA. His current awarded 32 U.S. patents. He is a passionate supporter of STEM education, is an
research areas are deep learning, image processing, NVIDIA certified Deep Learning Institute Instructor, and Chair of the Rochester
saliency, computer vision and machine learning. area IEEE Signal Processing Society.

Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.

You might also like