Professional Documents
Culture Documents
Recurrent Convolutional Structures For Audio Spoof and Video Deepfake Detection
Recurrent Convolutional Structures For Audio Spoof and Video Deepfake Detection
5, AUGUST 2020
Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
CHINTHA et al.: RECURRENT CONVOLUTIONAL STRUCTURES FOR AUDIO SPOOF AND VIDEO DEEPFAKE DETECTION 1025
2) We develop an analogous architecture for audio spoof Zhu et al. [36] and Kim et al. [37] built on the GAN con-
detection. cept and replaced the noise vector with input images. This
3) We explore the use of entropy-based loss functions for modification enabled their cycle-consistent GANs to alter the
deepfake detection in both isolation and ensemble form. domains of the output images based on the input image domains.
4) We demonstrate that our visual and audio methods set Similarly, in deepfake generation, it is possible to retain the
new benchmarks on the FaceForensics++ and Celeb-DF facial expressions of a source person while transferring identities
video datasets and ASVSpoof 2019 Logical Access audio to a target person. Lu et al. [38] proposed an identity-guided
dataset. conditional CycleGAN to translate low-resolution face images
5) We perform ablation studies that show the robustness of to high-resolution face images. Kim et al. [39] accomplished
these methods across a range of conditions. a similar transfer, with the difference being that they transfer
expressions as well as 3D pose into a target image, creating
II. RELATED WORK a video in the process. Faceswap-GAN [5] created realistic
looking imagery using adversarial and perceptual losses. The
A. Video: Deepfake Generation
addition of a perceptual loss was shown to minimize unnat-
Deepfake creation is a relatively new field in digital forgery. ural artifacts such as awkward eyeball movements. Temporal
Although the generation process began with traditional vision smoothing of the frame-to-frame face detection box and an
and voice impersonation, most recent works involve generative attention mask made the created videos appear more realistic.
adversarial networks (GANs). We will first describe the tradi- NeuralTextures [14] is a facial re-enactment forgery that relies
tional approaches, and then the adversarial methods. directly on facial landmarks. These fakes are generated using
In general, there are two types of deepfake generation, face a patch-based adversarial loss alongside a photometric recon-
swapping and face re-enactment. In face swapping, the face of struction loss. In an interesting twist, Suwajanakorn et al. [40]
a target person is overlaid on the face of a source. The overlaid achieved photorealistic results by only requiring audio as an
faces are post-processed to blend the edges to match the source’s input to generate forged videos. Using the weekly addresses
facial outline. This can be used, for example, to make Nicholas of Barack Obama, the raw audio features are first mapped to
Cage (the source actor) appear in a movie in place of the original mouth shapes, then mouth textures, then 3D pose mapping,
(target) actor.1 Faceswaps [34] are a graphical approach, where and finally compositing of the head and torso from stock
the facial landmarks (nose, eyes, eyebrows, lips, chin, and cheek footage.
areas) play a major role in morphing the target’s face with the
source’s, and the output is post-processed with edge polishing
and color correction. B. Video: Deepfake Detection
Facial re-enactment, on the other hand, is used to make a Since the threats posed by deepfake video manipulations
target appear to act and speak like the source. This is used, became apparent in early 2018, several works have looked
for example in the videos where former U.S. President Barack into detecting them. A few of the detection techniques targeted
Obama is made to say whatever the source video says.2 Facial re- handcrafted features [24], [26], [41] like blinking inconsisten-
enactment techniques model both the source and target faces to cies, biological signals, and unrealistic details. Most of these
identify facial landmarks and manipulate the target’s landmarks manually crafted detection features exploit known weaknesses
to match the source’s facial movements. Face2Face [6] is a face in creation methods. Like a cat-and-mouse game, deepfake
re-enactment method that translates the facial expressions of a creation methods quickly adapted to circumvent detection, and
source subject with a target while maintaining the facial features the cycle repeats. Recent detection methods rely on machine
of the target. learning on deepfake datasets to automatically discover forgeries
More recent Deepfake methods are based on Generative Ad- from real videos.
versarial Networks (GANs) [4]. GANs consist of two competing MesoNet [18] uses a shallow convolutional network to detect
neural networks: a generator G that generates fake samples that forgery at a mesoscopic (or intermediate) level of detail, inten-
mimic real samples from a target dataset, and a discriminator tionally avoiding focusing too much on microscopic features
D that tries to tell fakes apart from real samples. These two that could be lost due to video compression. They also intro-
networks are trained simultaneously, such that over the training duced a variant of their model that replaces regular convolution
period, both the generator and discriminator improve. Upon blocks with MesoInception blocks to get slight improvements.
successful convergence, the generator can then be used to create The Capsule-Forensics method proposed by Nguyen et al. [21]
realistic-looking examples. To promote variation, G is seeded uses capsule networks [42] for the detection of replay attacks
with a noise vector, but this noise vector can be paired with a as well as computer-generated images and videos. They argue
latent representation of an object (word, image, sentence), to that the chances of detecting high-quality forgeries would be
constrain the resulting image [35]. increased with the agreement between capsules through dynamic
routing [43]. Cozzolino et al. [23] applied an autoencoder-based
1 [Online]. Available: https://www.theguardian.com/technology/ng- architecture to show its usefulness for transfer learning. Nguyen
interactive/2019/jun/22/the-rise-of-the-deepfake-and-the-threat-to- et al. [25] extended Cozzolino et al.’s network by replacing the
democracy
2 [Online]. Available: https://www.buzzfeednews.com/article/davidmack/ standard decoder with a decoder that additionally generates a
obama-fake-news-jordan-peele-psa-video-buzzfeed mask of the manipulated region through multi-task learning.
Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
1026 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 14, NO. 5, AUGUST 2020
XceptionNet [32] is one of the more promising deep neural the spoofed audio is generated using either a text-to-speech or a
models for feature extraction when trained either from scratch voice conversion software. With a text-to-speech synthesis soft-
or with pre-trained ImageNet [44] weights. As such, we choose ware, a text input can be converted to a speech output with a voice
XceptionNet as an embedding space instead of the networks used similar to that of a target speaker. A voice conversion software
in prior works [20], [33]. The network architecture not only uses allows a speech given by a source speaker to be converted to
skip connections akin to ResNet [45] with an Inception-like [46] a different utterance which has the same linguistic content (i.e.
convolutional arrangement, but it also has a modified version textual information) but with a different speaker identity. The PA
of depth-wise separable convolutional layers for reducing the scenario focuses on security systems that use voice detection.
number of parameters with marginally better performance. In this scenario, a pre-recorded speech from the target speaker
While the previously mentioned methods targeted intra-frame is replayed in order to trick the security system to believe the
inconsistencies, Güera and Delp [20] introduced a spatio- target speaker is actually speaking. For this research, we focus
temporal model with InceptionV3 [46] as the feature extraction on the LA scenario since this type of attack allows the adversary
network to tackle deepfakes. The features from their time- to easily create fake media by falsely representing the identity of
distributed extraction network are forwarded to a unidirectional the speaker or falsely creating the content with a given speaker
Long Short-Term Memory (LSTM) network that ultimately identity.
makes the classification decision. Sabir et al. [47] evaluated the Modern methods of computer-generated audio spoofing using
same architecture with different feature extractors [45], [48]. LA need only a few minutes of recorded audio of a target speaker
Their face-extraction process was adjusted by aligning the faces to create accurate spoofs that are hard to detect by humans.
from consecutive frames using facial landmarks to maintain van den Oord et al. [52] introduced WaveNet, a deep neural
temporal consistency. While their performance on manipulated network model that can generate speech from a text in a fully
videos [20] from the HOHA dataset [49] was quite good, per- probabilistic and autoregressive manner and can be conditioned
formance on the larger and more complex FaceForensics++ [22] on speaker identity. Arik et al. [53] introduced a voice cloning
dataset was not as effective. model based on speaker adaptation and speaker encoding that
In a short paper, Sohrawardi et al. [33] described and very can generate speech with a similar voice using only a few training
briefly evaluated models based on the FaceNet [50] CNN ar- samples. Kameoka et al. [54] used a StarGAN-based [9] model
chitecture, with weights initialized from the VGGFace2 [51] to perform many-to-many voice conversion without the need
facial recognition task. The trained CNN maps the faces to a for parallel training data. By replacing images with spectro-
compact latent representation where similar faces map close and grams and representing speaker identity with domain vectors,
dissimilar faces map far apart. These features were then passed generated spectrograms are converted back to raw audio using
into a ConvLSTM network [20]. the Griffin-Lim algorithm [55]. Tanaka et al. [56] proposed a
Agarwal et al. [27] took an alternative approach to detection sequence-to-sequence recurrent model that uses attention and
by treating the problem as one of anomalous behavior detection. context preservation mechanisms to perform voice conversion
They learn the typical behavior of the subject from real videos with realistic results.
by extracting facial landmarks and temporal actions and use
them to train a one-class SVM. Although this method cannot
be used on unforeseen faces, it is an effective option to pre- D. Audio: Spoof Detection
serve robustness against future improvements in other aspects To counter the spoofing problem, different speech features,
of deepfake generation. For example, they claim there is already such as constant-Q cepstrum coefficients (CQCC) [57], or
sufficient data for most world leaders, although it would be a MFCCs, have been combined with Gaussian Mixture Models
considerable undertaking to train a detector for all government (GMMs) [58] and deep neural networks. Competitions such as
officials, corporate leaders, or persons of interest. the Voice Conversion Challenge 2018 [28] show many promis-
Conclusion. Although all of the aforementioned methods ing parallel and non-parallel voice conversion techniques, which
excel at specific types of deepfake detection, they are not as fuel new ideas and training sets for spoof detection.
effective at handling unforeseen deepfake manipulations on Early speaker identification techniques used vector quantiza-
which they have not been trained, referred to as cross-domain tion [59], GMMs, and feature decomposition paired with Sup-
adaptation. With many possible ways to generate deepfakes, port Vector Machines (SVMs) [60]. Newer methods use neural
it is critical for realistic and robust open-world detection to network based approaches [29]–[31]. Balamurali et al. [61]
have a model that performs well across multiple types of ma- reviewed traditional and newer machine learning spoof de-
nipulated videos. In our work, we not only demonstrate new tection methods. Their best methods use a GMM Universal
benchmark performance on existing datasets, we also show that Background Model to fuse traditional features (MFCC, spec-
our approach leads to robust detection performance in a range trogram, CQCCs, etc.) with an autoencoder neural network
of scenarios including in cross-domain tests. latent vector representation. Chettri et al. [62] used an ensemble
of 2-D convolution-recurrent networks, 1-D convolutional net-
works, GMMs, and SVMs to perform spoofing detection on the
C. Audio: Spoof Generation ASVSpoof 2019 dataset [63]. For the same dataset, Alzantot
There are two main categories of audio spoofing attacks: et al. [64] used a family of three different 2-D CNNs with
Logical Access (LA) and Physical Access (PA). In the LA case, residual skip-connections. The three variants accepted MFCC,
Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
CHINTHA et al.: RECURRENT CONVOLUTIONAL STRUCTURES FOR AUDIO SPOOF AND VIDEO DEEPFAKE DETECTION 1027
log-magnitude STFT, and CQCC input features. Das et al. [65] where yc is the ground truth logit value and (1 + nf ) represents
proposed to use long-range acoustic features based on the octave one class for real videos alongside nf deepfake classes.
power spectrum instead of the linear power spectrum for spoof- KL divergence (relative entropy) is a natural approach to
ing detection. While these methods achieve very good results on measure the differences amongst probability distributions. In-
attack methods they were trained on, most fail to detect attacks spired from the application of KL divergence in variational
using unseen methods. autoencoders [67], the high-level idea is to disentangle different
probability distributions via parameter learning. We hypoth-
III. PROPOSED DEEPFAKE DETECTION ARCHITECTURE esize that the probability distribution of the real and ersatz
material [68] when mapped into a two-dimensional space from
Our architecture for deepfake detection is inspired by the the latent feature representation generated after the primary
XceptionNet [32] architecture along with recurrent processing fully-connected layer can be disentangled.
used in ConvLSTM [20] and FaceNetLSTM [33]. In this setting, mean (μ) and variance (σ) of the bivariate
We use a convolutional architecture to obtain a vector rep- normal distribution N are estimated to be computed from the
resentation of a facial region of a frame, fi . A sequence of latent feature vector. The loss function encourages the model
such facial regions from frames, f1 , f2 , . . . , ff are passed into to distinguish one distribution from another. The bivariate KL
a bidirectional LSTM module to learn a latent representation divergence is represented as:
capable of discriminating between facial manipulations and
original faces. DKL (N ((μ1 , μ2 )T
, diag(σ12 , σ22 )) || N (0, I)) =
(2)
λ ( ni=1 σi2 + μ2i − log(σi ) − 1)
A. Preprocessing
We use λ = (1/2).
The dlib [66] face detector determines the primary face over Similar to the Fisher linear discriminant loss, the KL loss is
each frame in the video. Canonical face images are generated by equivalent to the sum of class-wise KL divergences. Intuitively,
cropping to the dlib facial bounding box, resampling to 299 × the KL loss distributes all the learned samples of each class in
299 pixels, and normalized to zero mean and unit variance. a densely packed normal distribution and clusters the samples
As dlib extracts faces individually for each frame, we ob- from one class further away from other classes. The loss com-
served that the subsequent face images have subtle differences position of the KL-divergence metric is defined as:
in the box coordinates. Similar to Faceswap-GAN [5], we apply
n
a linear smoothing filter over the box coordinates of consecutive yi
LKL = yi log (3)
frames to mitigate the temporal inconsistency introduced in the ŷi
i=1
extraction process.
The choice of using a two-dimensional space was determined
B. Model empirically.
We also propose to use an ensemble of the learning proce-
Canonical faces are encoded using the XceptionNet archi- dures with a convex combination of the cross-entropy and KL
tecture (Fig. 2) [32]. We propose XcepTemporal, a convolu- divergence losses. The ensemble loss function is defined as:
tional recurrent network with multiple levels of temporal fea-
ture abstraction. The spatial features from the XceptionNet LEN = λ1 LKL + λ2 LCE (4)
Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
1028 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 14, NO. 5, AUGUST 2020
Fig. 3. XcepTemporal model. Faces are first extracted and normalized to a canonical representation. These canonical images are passed into a convolutional
recurrent model to make a prediction. μ and σ 2 represent the mean and variance of the distribution in a 2-D space.
where λ1 and λ2 are the weights allocated to respective loss This dataset uses 32 subjects, each with 10 face swap videos of
metrics to provide a flexible overall loss function. We typically both low and high quality for a total of 620 fake videos.
allocate a larger weight to KL divergence than cross-entropy for FaceForensics++ [22] contains four different types of deep-
deepfake detection, as KL divergence aids in driving the real and fakes, namely Face2Face [6], FaceSwap [5], Deepfakes, and
fake distributions further apart. NeuralTextures [14], alongside corresponding real videos. The
2) Variants: We propose four variants of the XcepTemporal dataset contains 1000 real source videos and 1000 of each of the
model: four deepfake generation methods. Since the traditional vision
1) XcepTemporal (CE). and face swap methods used to create the UADFV and Deep-
a) XcepTemporal with cross-entropy loss LCE . fakeTIMIT datasets, respectively, are also in FaceForensics++,
b) It follows the horizontal line path to the blue box most works do not report on UADFV or DeepfakeTIMIT.
labelled “Class Layer” in Fig. 3. The FaceForensics++ dataset was updated on 23 rd August
2) XcepTemporal (KL). 2019 to host the Deep Fake Detection (DFD) dataset from
a) XcepTemporal with KL divergence loss LKL . Google and JigSaw [22], which consists of 360 original source
b) It follows the dashed line path to the yellow box videos and 3000 manipulated videos.
labelled “KL-Divergence” in Fig. 3. Celeb-DF [69] was built from 400 original YouTube videos
3) XcepTemporal (EN). and 800 synthesized videos. Unlike the aforementioned datasets,
a) XcepTemporal with ensemble loss LEN . these fakes are refined to address issues with color inconsistency,
b) This variant has two classes in the classification layer low-frequency smoothing, and temporal flickering. Additional
of the model (real and fake). refinements include synthesis of higher-resolution fakes (256 ×
c) It has a Y-shaped final layer where the outputs are 256), whereas the algorithms used in previous works synthesized
a two-class classification layer and a sample in two- low-resolution (64 × 64) fakes.
dimensional space. Deepfake Detection Challenge (DFDC) [73] is a preview ver-
4) XcepTemporal (EN1+n ). sion of a larger dataset from a challenge developed by Facebook,
a) XcepTemporal with ensemble loss LEN . and it consists of 5000 videos (original and manipulated). There
b) This variant has 1 + nf classes in the classification are two types of fakes available in the dataset, but the pedigree
layer of the model (one real and n fakes). of each has not been released yet.
c) It has a Y-shaped ultimate layer where the outputs are a YouTube (YT) consists of 20 real and 20 fake videos that we
1 + nf -class classification layer and a sample in two- collected from YouTube.com as a way to test accuracy on unseen
dimensional space. samples. The 20 real videos all feature a single camera-facing
d) This variant not only distinguishes the distribution of subject, while the 20 fake videos are from the Ctrl Shift Face
fakes from original faces but also determines the type YouTube channel.3 We do not know which deepfake technique
of fake. the creator of this channel uses, however we do know that he
uses face-swapping as opposed to facial re-enactment methods.
Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
CHINTHA et al.: RECURRENT CONVOLUTIONAL STRUCTURES FOR AUDIO SPOOF AND VIDEO DEEPFAKE DETECTION 1029
TABLE I
WITHIN-DOMAIN ACCURACY: FRAME-LEVEL ACCURACY ON THE FACEFORENSICS++ [22] AND CELEB-DF [69] OFFICIAL
TEST SETS. ABBREVIATIONS: FACE2FACE (F2F), FACESWAP (FS), DEEPFAKE (DF), AND NEURALTEXTURES (NT)
use code provided by the authors for this purpose. The perfor- TABLE II
DATA SPLITS FOR OUR COMBINED DATASET, WHICH IS COMPOSED OF THE
mance of most of these models have been previously reported FACEFORENSICS++ AND CELEB-DF DATASETS
on the same datasets [22], [69]. For a fair comparison, however,
we tested them along with our variants of XcepTemporal on
identical train and test splits.
Prior work did not describe how to address making a detection
decision on an entire video instead of a single frame. As we are
interested in results on the entire video and to provide a fair set of
baselines for comparison, we propose a simple mechanism for results, we set the sequence length to eight frames with a stride
converting a frame-level model into a video-level model. We first of eight. All the variants of the model were trained end-to-end.
pass the frame-level results, which produce a higher value for
frames that more likely to be fake, through a median filter with
D. Within-Domain Results
a window size of five, which helps to reduce false positives by
smoothing out the resulting output. We then take the maximum The most basic test for deepfake detection is to train and
output from these windows as the overall result for the video. test on the same datasets. We show accuracy based on eval-
While this can result in some false positives, it is important to uating one frame at a time in Table I, and show accuracy
detect a video as fake when even just a short portion of the video across the entire video in Table III. We find that XceptionNet
is fake. models are very effective almost uniformly across all samples.
We applied this procedure to all of the baseline frame-level In particular, the XceptionNet (KL) and XcepTemporal (KL)
models to produce video-level models, and we report the re- models achieve 100% accuracy for the entire FaceForensics++
sults for both frame-level and video-level models for multiple dataset for both frame-level and video-level detection, while
methods of comparison. XcepTemporal (CE) is close behind at 99.71% for frame-level
detection and 100% for video-level detection.
Among prior works, only XceptionNet [32] and Capsule-
C. Training
Forensics [21] could achieve 97% or above, where XceptionNet
We trained our model and the baseline models on the com- particularly suffers from false positives at over 8% FPR in frame-
bination of the full FaceForensics++ [22] and Celeb-DF [69] level detection and over 11% FPR in video-level detection.
datasets. For the training, validation, and test splits, we used The two ensemble methods do not fare as well in Face-
the instructions provided along with the datasets. As Celeb-DF Forensics++ at about 97% accuracy on frames and 99% ac-
does not offer any validation split, we randomly chose 50 real curacy on full videos, but they reach new standards for accu-
and 134 fake videos from the training data to create a validation racy on the Celeb-DF dataset. XcepTemporal (EN1+n ) has the
split. The test sets for both datasets was left unaltered. The splits overall best accuracy on frames at 97.83% and, together with
used are shown in Table II. XcepTemporal (CE) and XcepTemporal (KL), 99.16% on full
We set the learning rate to 1e-4 with a decay factor of 1e-5. videos.
The optimizer is Adam, with β1 set to 0.9 and β2 set to 0.999, In our experience with training the models, using the 1.4 M
these being the default values suggested by Kingma and Ba’s facial frames in the training set (see Table II), the XcepTemporal
original paper on Adam [74]. A dropout of 0.5 is added to the (CE) takes four epochs to converge on the training data, whereas
first fully-connected layer. Based upon hyperparameter tuning the KL variant takes only two epochs to converge. The models
Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
1030 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 14, NO. 5, AUGUST 2020
TABLE III
WITHIN-DOMAIN RESULTS: VIDEO-LEVEL ACCURACY ON FACEFORENSICS++ [22] AND CELEB-DF [69] OFFICIAL
TEST SETS. ABBREVIATIONS: FACE2FACE (F2F), FACESWAP (FS), DEEPFAKE (DF), AND NEURALTEXTURES (NT)
F. Compression
A simple but effective way to bypass deepfake detection
is to apply compression to the deepfake video. This section
examines the performance of our models on two different types
of compression techniques (JPEG and MPEG).
JPEG Compression. OpenCV [75] provides an option to
compress individual frames with a reduction in quality as shown
in Fig. 4. The quality factor is a measure of the compression rate
of JPEG images, ranging between 0 and 100. High numbers
with LKL loss tend to converge faster than the cross-entropy between 90-100 represent minimal visual quality loss, while
loss LCE because the latent distribution generated by the model numbers less than 60 represent visually significant quality loss.
is distinct between the original and the forged faces. By contrast, Lower quality factor images take up less disk space. Table V
EN variants of the model take the longest to train (ten epochs) shows the test accuracies for our models as well as for the
because the gradients from both the losses compete with each DenseNetAligned and XceptionNet models, two of the best
other. The convergence is slower in the early stages of the models in our results for uncompressed video. Our models
training, but once both the channels start learning, the rate of perform the best on most settings in the Face2Face, Deepfakes,
convergence nearly doubles. FaceSwap, and Celeb-DF datasets, while XceptionNet performs
The next sections will explore these models further to help the best for JPEG compression on NeuralTextures fakes. We note
understand how the architectural variations affect performance. a severe degradation in deepfake detection performance across
all the models for highly compressed faces. The accuracies are
significantly more affected by compression on re-enactment-
E. Cross-Domain Results
type fakes (Face2Face and NeuralTextures) than the faceswap-
A critical property of deepfake detection models for real- type (Deepfakes and FaceSwap).
world use is good inference performance on deepfake types MPEG Compression. The models were additionally tested on
not included in the training dataset. One way to measure this the effects of video compression using videos saved at two differ-
performance is to train on publicly available deepfake datasets ent H.264 quantization factor levels, as previously explored by
from Table II, and test on the unforeseen cross-domain samples Agarwal et al. [27]. The quantization factor reflects the MPEG
in the Deepfake Detection Challenge preview dataset [73]. As compression rate of the videos. The higher the quantization
shown in Table IV, we use binary class models, such that we factor, the lower the quality and greater the compression. We
expect the unforeseen fakes, M-A and M-B to be both classified selected a quantization value of 20 to represent high-quality
Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
CHINTHA et al.: RECURRENT CONVOLUTIONAL STRUCTURES FOR AUDIO SPOOF AND VIDEO DEEPFAKE DETECTION 1031
TABLE V
COMPRESSION: RESULTS FOR DIFFERENT RATES OF COMPRESSION. RESULTS
IN BOLD INDICATE THE BEST RESULTS ACROSS THE FOUR TESTED MODELS
FOR THE GIVEN SETTING. NOTE THAT HIGH JPEG QUALITY MEANS LESS
COMPRESSION, WHILE HIGH MPEG QUANTIZATION MEANS MORE
COMPRESSION
Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
1032 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 14, NO. 5, AUGUST 2020
TABLE VI
COMPRESSION: MODELS TRAINED ON AUGMENTED TRAINING DATA
CONSISTING OF THE ORIGINAL AND JPEG COMPRESSED (QUALITY-50)
VERSIONS OF ALL THE TRAINING FACES
A. Audio Model
CRNNSpoof: Inspired by our XcepTemporal model, our
first model for spoofing detection is a convolution-recurrent
neural network (Fig. 5 left). Since raw audio is used instead of
lengths of strides between two consecutive input sequences. spectral features as the input to the network, five 1-D convolution
Longer sequence lengths enable temporal features of longer layers are used to learn useful representations. Additionally, the
duration but require more compute resources. The combined convolutions are strided to downsample the input signal from
accuracy on both the test sets shown in Table II for different 16 kHz to 100 Hz, which reduces the memory footprint while
lengths and strides is shown in Table IX. While the overall also speeding up the training process. The extracted features
accuracy does not change much, we chose a length and stride are then passed to a bidirectional LSTM layer. The hidden state
of eight because 1) shorter sample lengths translate into more from the last LSTM timestep is used to perform prediction using
training samples; and 2) shorter sample lengths are more capable two fully-connected layers. Dropout and batch normalization
at detecting short sequences of manipulated frames embedded are used after each layer to perform regularization. The network
between real frames. is trained using the negative log-likelihood loss function. Due
to the unbalanced nature of the dataset, a misclassification of a
spoofed speech sample incurs heavier loss than misclassification
V. PROPOSED SPOOF DETECTION ARCHITECTURE of a real speech sample.
None of the deepfake detection datasets include manipulated WideBlock: We introduce a WideBlock (Fig. 6) architecture,
audio. The ASVSpoof2015 challenge [76], ASVSpoof2017 named for the high number of paths in each block. We use this
challenge [29], and ASVSpoof2019 challenge [63] have helped block in a second fully-convolutional audio approach described
spur deepfake audio research, referred to in the research com- in the next section. The architecture of the WideBlock, taking in-
munity as spoof detection. We introduce our spoof detection spiration from ResNeXt blocks used in image classification [77],
methodologies in isolation, while we anticipate the creation consists of several parallel streams, each consisting of bottleneck
of multimodal deepfake datasets, where the visual and audio 1 × 1 convolution layers before and after a normal convolution
channels can mutually benefit one another for more advanced layer. The bottleneck layers reduce the complexity of the model
forgery detection. by reducing the number of parameters required by the middle
Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
CHINTHA et al.: RECURRENT CONVOLUTIONAL STRUCTURES FOR AUDIO SPOOF AND VIDEO DEEPFAKE DETECTION 1033
TABLE VII
TRANSFER LEARNING/DOMAIN ADAPTATION: MODELS ARE INITIALIZED ON FACEFORENSICS++ [22] AND TRAINED ON CELEB-DF [69]
B. Audio Datasets
ASVSpoof 2019 Challenge [63] consists of two spoofing
scenarios: logical access and physical access. In the logical
access scenario, fake audio is created using speech-synthesis
or speech-conversion software. In the physical access scenario,
fake audio is created by replaying a pre-recorded audio using
a splice of the real speaker data. For this research, we use only
the logical access data, since it is easier and more dangerous to
create speeches from arbitrary text. There are 17 different types
of fakes in the dataset, with six designated as known attacks and
11 as unknown attacks. Only known attacks are present in the
train and development set, while all 17 are present in the test set.
Fig. 6. WideBlock consists of multiple paths, each with different kernel size
to capture different levels of temporal dependencies. A skip-connection is used
to aid with gradient flow. C. Baselines
We compare the results of our model with both baseline results
convolution operation. Instead of keeping the same filter size (Baseline 1 and Baseline 2) provided by the organizers of the
for all paths, we draw inspiration from Inception networks and ASVSpoof 2019 contest and state-of-the-art benchmark systems
employ filters with different sizes in each parallel path. We use that we refer to simply as Benchmarks A-E. Baseline 01 and
nine filter sizes, the widths of which are odd numbers between 3 Baseline 02 are GMM models trained on Linear Frequency Cep-
and 19. The different filter sizes allow the model to pick up both stral Coefficients (LFCC) and Constant-Q cepstral coefficients
short-term and long-term temporal context. The output from (CQCC) input features, respectively. Benchmark A is a CNN
each path is then summed before being added to the input of architecture proposed by Chettri et al. as a countermeasure for
each block, forming a skip connection. replay attacks [78]. Benchmarks B, C, and D are the same CNN
Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
1034 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 14, NO. 5, AUGUST 2020
TABLE X TABLE XI
RESULTS OF PROPOSED COUNTERMEASURES WITH ABLATION: ACCURACY OF CRNNSPOOF WITH
OTHER BENCHMARKS AND BASELINE METHODS DIFFERENT RECURRENT STRUCTURES
TABLE XII
COMPRESSION: RESULTS FOR DIFFERENT RATES OF AUDIO COMPRESSION.
RESULTS IN BOLD INDICATE THE BEST RESULTS ACROSS THE FOUR TESTED
MODELS FOR THE GIVEN SETTING. NOTE THAT LOWER
BITRATE MEANS MORE COMPRESSION
D. Audio Results
For consistency with the prior work, we evaluated the spoof-
detection models using metrics provided by the organizers of
the ASVSpoof 2019 challenge: tandem detection cost function
(t-DCF) and equal error rate (EER). In biometric security sys-
tems, EER is used to determine the threshold at which the false
model achieves the 9th best result in terms of t-DCF, the primary
positive rate (FPR) equals the false negative rate (FNR). At that
metric of the challenge, and 6th best in terms of EER when
threshold, F P R = F N R = EER. t-DCF is a metric proposed
evaluated on the evaluation set. It is notable that all models
by the organizers of the ASVSpoof2019 challenge as a means
that perform better in terms of t-DCF, as well as EER, use
to evaluate the performance of a spoofing detection method in
an ensemble of classifiers. When only single classifiers are
conjunction with a given automatic speaker verification (ASV)
considered, the proposed CRNNSpoof model achieves the best
method. The t-DCF is computed as:
result in both t-DCF and EER. The next best non-ensemble
min cm
t − DCFnorm = min{βPmiss (s) + Pfcm
a (s)}, (5) model from the Interspeech 2019 challenge achieves t-DCF of
s
0.1404 and EER of 5.74%.
cm
where Pmiss and Pfcm
a correspond to the FNR and FPR of the Recurrent layers. Table XI shows results from an abla-
countermeasure system, respectively. β represents the perfor- tion study on our CRNNSpoof model. Similar to our visual
mance of the ASV system. For the ASVSpoof 2019 challenge, XcepTemporal models, we explore the usage of single and dou-
the organizer provides the ASV score, so β is a fixed value. β ble recurrent layers, as well as uni- and bi-directional LSTM. The
is inversely proportional to the false acceptance rate of the ASV results suggest that the model with single unidirectional LSTM
system to a specific attack. layer performs best. We believe the other models are overfitting,
While both models fail to perform better than the given which we could solve by reducing the number of parameters
baseline and benchmark methods on the development set, the (filters or fully connected layers), increased regularization, or
CRNNSpoof model achieves significantly better results, both in an increase in the training set size.
t-DCF and EER, compared to other models on the evaluation set. Compression. To test the robustness of our models concerning
The CRNNSpoof model outperforms the best baseline model the quality of the audio, we perform audio compression. Similar
(Baseline 01) by 47% in terms of EER and 37% in terms of to video deepfakes, the greater the audio compression, the more
t-DCF. Meanwhile, the WIRENetSpoof model achieves better difficult the detection of real from a spoof. The reduction in the
EER than other models, except for the CRNNSpoof model, on audio quality is done by decreasing the bitrate of the signals
the evaluation set. However, the t-DCF score of the WIRENet- via the LAME4 encoder using lossy compression schemes.
Spoof model is only better than that of Benchmark C. Table XII shows the effect of compression on our models when
In addition to the results from Table X, we compare our trained on the original audio samples while Table XIII shows a
methods to the results published at Interspeech 2019 [63]. There similar comparison when trained with original audio augmented
is no obvious way to learn what model is evaluated for each entry with compressed audio data. We compare two versions of our
on the Interspeech leaderboard because Interspeech does not
publish the authors or the paper titles. However, our CRNNSpoof 4 [Online]. Available: http://www.harmjschoonhoven.com/mp3-quality.html
Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
CHINTHA et al.: RECURRENT CONVOLUTIONAL STRUCTURES FOR AUDIO SPOOF AND VIDEO DEEPFAKE DETECTION 1035
Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
1036 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 14, NO. 5, AUGUST 2020
[28] J. Lorenzo-Trueba et al., “The voice conversion challenge 2018: [54] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “StarGAN-VC: Non-
Promoting development of parallel and nonparallel methods,” 2018, parallel many-to-many voice conversion using star generative adversarial
arXiv:1804.04262. networks,” in Proc. IEEE Spoken Lang. Technol. Workshop, 2018, pp.
[29] T. Kinnunen et al., “The asvspoof 2017 challenge: Assessing the limits 266–273.
of replay spoofing attack detection,” in Proc. Interspeech, 2017, pp. 2–6. [55] D. Griffin and J. Lim, “Signal estimation from modified short-time fourier
[Online]. Available: http://dx.doi.org/10.21437/Interspeech.2017-1111 transform,” IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 2,
[30] F. Tom, M. Jain, and P. Dey, “End-to-end audio replay attack detection pp. 236–243, 1984.
using deep convolutional networks with attention,” in Proc. Interspeech, [56] K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “ATTS2S-VC:
2018, pp. 681–685. Sequence-to-sequence voice conversion with attention and context preser-
[31] P. Korshunov and S. Marcel, “A cross-database study of voice presentation vation mechanisms,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
attack detection,” in Handbook of Biometric Anti-Spoofing. Springer, 2019, Process., 2019, pp. 6805–6809.
pp. 363–389. [57] M. Todisco, H. Delgado, and N. W. Evans, “A new feature for automatic
[32] F. Chollet, “Xception: Deep learning with depthwise separable convo- speaker verification anti-spoofing: Constant Q cepstral coefficients,” in
lutions,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., 2017, Odyssey, vol. 45, 2016, pp. 283–290.
pp. 1800–1807. [58] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification
[33] S. J. Sohrawardi et al., “Poster: Towards robust open-world detection of using adapted Gaussian mixture models,” Digit. Signal Process., vol. 10,
deepfakes,” in Proc. 2019 ACM SIGSAC Conf. Comput. Commun. Secur., no. 1–3, pp. 19–41, 2000.
2019, pp. 2613–2615. [59] F. K. Soong, A. E. Rosenberg, B.-H. Juang, and L. R. Rabiner, “Report:
[34] M. Kowalski, “faceswap,” https://github.com/MarekKowalski/FaceSwap. A vector quantization approach to speaker recognition,” AT&T Tech. J.,
[35] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, vol. 66, no. 2, pp. 14–26, 1987.
“Generative adversarial text to image synthesis,” in Proc. Int. Conf. Mach. [60] D. Matrouf, N. Scheffer, B. Fauve, and J.-F. Bonastre, “A straightforward
Learn., 2016, pp. 1060–1069. and efficient implementation of the factor analysis model for speaker
[36] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to- verification,” in Proc. 8th Annu. Conf. Int. Speech Commun. Assoc., 2007,
image translation using cycle-consistent adversarial networks,” 2017, pp. 1242–1245.
arXiv:1703.10593v6. [61] B. BT, K. W. E. Lin, S. Lui, J.-M. Chen, and D. Herremans, “Towards
[37] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to dis- robust audio spoofing detection: A detailed comparison of traditional and
cover cross-domain relations with generative adversarial networks,” 2017, learned features,” 2019, arXiv:1905.12439.
arXiv:1703.05192v2. [62] B. Chettri, D. Stoller, V. Morfi, M. A. M. Ramírez, E. Benetos, and B.
[38] Y. Lu, Y.-W. Tai, and C.-K. Tang, “Attribute-guided face generation using L. Sturm, “Ensemble models for spoofing detection in automatic speaker
conditional cyclegan,” 2018, arXiv:1705.09966v2. verification,” 2019, arXiv:1904.04589.
[39] H. Kim, P. Carrido, A. Tewari, W. Xu, J. Thies, M. Niessner, P. Pérez, [63] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A.
C. Richardt, M. Zollhöfer, and C. Theobalt, “Deep video portraits,” ACM Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “Asvspoof
Trans. Graph., 2018, pp. 1–14. 2019: Future horizons in spoofed and fake audio detection,” 2019,
[40] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Synthe- arXiv:1904.05441.
sizing obama: Learning lip sync from audio,” ACM Trans. Graph., 2017, [64] M. Alzantot, Z. Wang, and M. B. Srivastava, “Deep residual neural net-
pp. 1–13. works for audio spoofing detection,” in Proc. Conf. Int. Speech Commun.
[41] F. Matern, C. Riess, and M. Stamminger, “Exploiting visual artifacts to Assoc., 2019, pp. 1078–1082.
expose deepfakes and face manipulations,” in Proc. IEEE Winter Appl. [65] R. Das, J. Yang, and H. Li, “Long range acoustic and deep features
Comput. Vision Workshops, 2019, pp. 83–92. perspective on ASVspoof 2019,” in Proc. IEEE Autom. Speech Recognit.
[42] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto- Understand. Workshop, 2019, pp. 1018–1025.
encoders,” in Int. Conf. Artif. Neural Netw. Springer, 2011, pp. 44–51. [66] D. E. King, “Dlib-ml: A machine learning toolkit,” J. Mach. Learn. Res.,
[43] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between cap- vol. 10, pp. 1755–1758, Jul. 2009.
sules,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 3856–3866. [67] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2013,
[44] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: arXiv:1312.6114.
A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. [68] G. Allaire, F. Jouve, and A.-M. Toader, “Structural optimization using
Vision Pattern Recognit., 2009, pp. 248–255. sensitivity analysis and a level-set method,” J. Comput. Phys., vol. 194,
[45] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for im- no. 1, pp. 363–393, 2004.
age recognition,” CoRR, vol. abs/1512.03385, 2015. [Online]. Available: [69] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-df: A new dataset for
http://arxiv.org/abs/1512.03385 deepfake forensics,” 2019, arXiv:1909.12962.
[46] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking [70] X. Yang, Y. Li, and S. Lyu, “Exposing deep fakes using inconsistent head
the inception architecture for computer vision,” in Proc. IEEE Conf. poses,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2019,
Comput. Vision Pattern Recognit., 2016, pp. 2818–2826. pp. 8261–8265.
[47] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and P. Natarajan, [71] P. Korshunov and S. Marcel, “Deepfakes: a new threat to face recognition?
“Recurrent convolutional strategies for face manipulation detection in assessment and detection,” 2018, arXiv:1812.08685.
videos,” Interfaces (GUI), vol. 3, no. 1, 2019. [72] C. Sanderson and B. C. Lovell, “Multi-region probabilistic histograms
[48] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely for robust and scalable identity inference,” in Proc. Int. Conf. Biometrics,
connected convolutional networks,” in Proc. IEEE Conf. Comput. Vision 2009, pp. 199–208.
Pattern Recognit., 2017, pp. 4700–4708. [73] B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. C. Ferrer,
[49] Q. Wu, Z. Wang, F. Deng, and D. D. Feng, “Realistic human action “The deepfake detection challenge (DFDC) preview dataset,” 2019,
recognition with audio context,” in Proc. IEEE Int. Conf. Digit. Image arXiv:1910.08854.
Comput.: Techn. Appl., 2010, pp. 288–293. [74] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[50] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embed- 2014, arXiv:1412.6980.
ding for face recognition and clustering,” in Proc. IEEE Conf. Comput. [75] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools,
Vision Pattern Recognit., 2015, pp. 815–823. 2000.
[51] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A [76] Z. Wu et al., “ASVspoof 2015: The first automatic speaker verification
dataset for recognising faces across pose and age,” in Proc. IEEE 13th Int. spoofing and countermeasures challenge,” in Proc. 16th Annu. Conf. Int.
Conf. Autom. Face Gesture Recognit. (FG 2018)., 2018, pp. 67–74. Speech Commun. Assoc., 2015.
[52] A. van den Oord et al., “WaveNet: A generative model for raw audio,” [77] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual
CoRR, vol. abs/1609.03499, 2016. [Online]. Available: http://arxiv.org/ transformations for deep neural networks,” in Proc. IEEE Conf. Comput.
abs/1609.03499 Vision Pattern Recognition, 2017, pp. 1492–1500.
[53] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning [78] B. Chettri, S. Mishra, B. L. Sturm, and E. Benetos, “Analysing the
with a few samples,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. predictions of a CNN-based replay spoofing detection system,” in Proc.
10 019–10 029. IEEE Spoken Lang. Technol. Workshop, 2018, pp. 92–97.
Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.
CHINTHA et al.: RECURRENT CONVOLUTIONAL STRUCTURES FOR AUDIO SPOOF AND VIDEO DEEPFAKE DETECTION 1037
Akash Chintha received the B.E. (Hons.) degree Matthew Wright received the B.S. degree in com-
in electrical and electronics engineering from the puter science from Harvey Mudd College, and the
Birla Institute of Technology and Science - Pilani, M.S. and Ph.D. degrees from the Department of Com-
Pilani, India, in 2018. He is currently working to- puter Science at the University of Massachusetts, in
ward the M.Sc. degree in computer engineering with 2002, 2005, respectively. His dissertation work ex-
Rochester Institute of Technology, Rochester, USA. amined attacks and defenses of systems that provide
He is currently a Graduate Research Assistant at the anonymity online. He is the Director of Research for
Machine Intelligence Laboratory at RIT. His current the Global Cybersecurity Institute (GCI) at Rochester
research interests include deep learning, data science, Institute of Technology (RIT) and a Professor of Com-
computer vision, and natural language processing. puting Security. His other interests include adversar-
ial machine learning and understanding the human
element of security. He has been the lead investigator on over $3.7 million in
funded projects, including an NSF CAREER award, and he has published 100
Bao Thai is working towards the B.S./M.S. degree in peer-reviewed papers, including numerous contributions in the most prestigious
computer engineering at Rochester Institute of Tech- venues focused on computer security and privacy.
nology, Rochester, USA. He is currently a Graduate
Research Assistant at the Machine Intelligence Lab-
oratory at RIT. His current research interests include
Andrea Hickerson received the B.A. degree in jour-
machine learning, deep learning, natural language
nalism and international relations from Syracuse Uni-
processing, audio processing, and speech recognition.
versity, the M.A. degree in journalism and the M.A.
degree in middle eastern studies from the University
of Texas at Austin, and the Ph.D. degree in commu-
nication from the University of Washington in 2009.
She is a Professor, Director of the School of Jour-
Saniat Javid Sohrawardi received the B.Sc. degree nalism and Mass Communications, and an Associate
in computer science and engineering from North Dean in the College of Information and Communi-
South University, Dhaka, Bangladesh, in August, cations at the University of South Carolina, USA.
2014. He is currently pursuing the Ph.D. degree in She conducts research on journalism routines with
computing and information sciences with RIT and an emphasis on the use of technology and political communication, particularly
is a graduate research assistant at the Center for in transnational and immigrant communities.
Cybersecurity. His current research areas include us-
able security, information privacy, deep learning and
media forensics. Raymond Ptucha received the B.S. degree in com-
puter science and the B.S. degree in electrical en-
gineering from SUNY/Buffalo, the M.S. degree in
image science from RIT, and the Ph.D. degree in
computer science from the Rochester Institute of
Kartavya Bhatt received the B.Tech. degree in infor- Technology (RIT), USA, in 2013. He is an Associate
mation and communication technology from Ahmed- Professor in Computer Engineering and Director of
abad University, Gujarat, India, in May 2019. He the Machine Intelligence Laboratory at RIT. His re-
is currently pursuing the M.S. degree in computer search includes machine learning, computer vision,
science with the Rochester Institute of Technology, and robotics, with a specialization in deep learning.
Rochester, USA. Since October 2019, he has been Ray was a research scientist with Eastman Kodak
working as a Graduate Research Assistant at the Company where he worked on computational imaging algorithms and was
Center for Cybersecurity at RIT, USA. His current awarded 32 U.S. patents. He is a passionate supporter of STEM education, is an
research areas are deep learning, image processing, NVIDIA certified Deep Learning Institute Instructor, and Chair of the Rochester
saliency, computer vision and machine learning. area IEEE Signal Processing Society.
Authorized licensed use limited to: VIT University. Downloaded on February 13,2024 at 15:51:06 UTC from IEEE Xplore. Restrictions apply.