Professional Documents
Culture Documents
1 Introduction
With present-day rapid evolution of digital technology, the use of digital devices
including smart phones, notebooks and digital cameras, has increased massively
with each passing day. Today, every common man’s day-to-day life encompasses
exchange and sharing of digital media in large volumes, especially digital images
and videos. Such images and videos, in many cases, serve as the primary sources
of legal evidence towards any event or crime, in court rooms. For example, a
CCTV footage, produced as major evidence related to a contentious scene in
the court of law. However, with the advent of low-cost easy-to-use image and
video processing software and desktop tools, manipulation to these forms of
multimedia has become an extremely easy task for even a layman. Hence, it
has become highly crucial to authenticate and the prove the trust-worthiness of
digital multimedia, before being presented/accepted as a court room evidence.
In this paper, we focus on forensic investigation of digital videos. In this
work, our main target is the detection of inter-frame video forgeries, whereby,
an attacker intelligently introduces/eliminates significant object(s) to/from a
video footage, by inserting, deleting or duplicating selected frames of the video
maliciously. Subsequently, the attacker produces the tampered video in the court-
of-law as an evidence, or shares it over online social sites to spread a fake news.
In this work, we deal with the problem of detecting this form of inter-frame
forgeries in videos. Specifically, in this work, we aim to detect frame insertion,
deletion and duplication types of inter-frame forgeries in digital videos. (The
above types of forgeries have been depicted in a schematic diagram in Fig. 1).
Recently, a number of researchers have come up with efficient techniques
to detect inter-frame video forgeries based on the followings: (A) Compression
artifacts, produced during the video encoding and decoding processes [1–4], (B)
Scene dependency, that is to rely on the visual contents of each frame in a
video [5–7]. However, all such techniques are unable to perform efficiently, for
videos with rapid scene changes, or when the number of forged frames in a video
is an integral multiple of its Group of Pictures (GOP) length [8].
In the recent years, research focus has largely moved to the use of Deep
Learning in various application domains, including object recognition [10,11],
classification [12,13] and action recognition [14,15]. Recently, forensic researchers
have also started to explore Deep Learning technique for multimedia forensic
applications. For example, the work by Long et al. in 2017 [16], where the authors
have succeeded to detect frame deletion type of inter-frame forgery in a single
video shot1
In the proposed work, we use Deep Learning technique based on 3D Convolu-
tional Neural Network (CNN) [17], to detect frame insertion, deletion and dupli-
cation types of inter-frame video forgeries in a single video shot. We have demon-
strated the tamper detection efficiency of the proposed method through extensive
experimentation on a wide range of test videos, irrespective of their compression
quality factor. Majority of the state-of-the-art video forensic approaches, such
as [2,4,18,19], are limited by fixed length GOP, and the number of frame dele-
tions, being a multiple of GOP length. We aim to overcome this limitation in the
proposed scheme. Specifically, the proposed method is completely independent
of GOP length, and of its relation to the video compression quality.
The rest of this paper is organized as follows. In Sect. 2, we present an
overview of the existing related work. The proposed deep learning technique
using 3D CNN for inter-frame video forgery detection has been described in
detail in Sect. 3. In Sect. 4, we present the experimental results followed by com-
parison with state–of–the–art techniques. Finally, we conclude the paper with
future research directions in Sect. 5.
2 Related Work
In the past decade, many significant research advances have been made in
the domain of digital multimedia forensics. These include a number of notable
researches towards video forgery and image forgery detection, such as the works
in [1,2,4,7,16,20]. Such works have largely focussed on frame deletion, insertion
and duplication types of video forgery detection. In this section, we present a
brief review of related researches towards detection of the above types of video
forgeries.
According to the existing literature, there exist broadly three approaches to
detect inter-frame video forgery. They are, scene dependency based, compres-
sion artefacts exploitation based, and Deep Learning based approaches. Scene
dependency based video forgery detection techniques [5–7] use the pixel value
of each frame in a video. Compression artefacts based video forgery detection
techniques [1–4] exploit different types of compression artifacts, produced during
the video encoding and decoding processes. Deep Learning based video forgery
detection techniques [16,21], operate by identifying and learning suitable features
from the training samples, automatically. Forgery detection accuracy depends
on the suitability of the identified features, to the specific problem.
1
A video shot is a sequence of frames, which are captured over an uninterrupted
period of time, by a single video recording device.
A Digital Forensic Technique for Inter–Frame Video Forgery Detection 307
Next, we present briefly review of related literature in all three above direc-
tions.
Recently, Zhao et al. [22] proposed a two step verification method to detect frame
insertion, deletion and duplication forgeries in a single shot video. At the first
step, the outliers are detected based on Pearson correlation [23] distribution over
Hue-Saturation (H-S) and Saturation-Value (S-V) color histograms, for every
frame. At the second step, Speed Up Robust Features (SURF) are extracted
from every outlier frames, and then matching with doubtful SURF coded frames
using Fast Library for Approximation Nearest Neighbor (FLANN) algorithm.
If no matching features are found between two adjacent frames at falling peak
locations, then frames at those locations are detected to be forged (by frame
insertion or deletion operation). And when SURF features of two frames exactly
match, those two frames are detected as duplicates of each other.
To address the issue of inter-frame video forgery detection, Aghamaleki et al. [24]
extracted the DCT coefficients of all I-frames, and used the first significant digit
distribution of the extracted DCT coefficients, to identify single compressed and
double compressed videos in spatial domain. However, double compression detec-
tion does not always imply the existence of manipulation in a video. Therefore,
a second module was proposed by Aghamaleki et al. [24] to eliminate such false
positives. This second module was proposed to detect inter-frame forgeries, based
on time domain analysis of residual errors of P-frames. Here, the authors employ
quantization traces in the time domain, to find the forgery locations. In the third
module, output of the first module (double compression detection) and output of
the second module (inter-frame forgery detection), are fed into a decision fusion
box, to classify the tested video into three categories: single compressed videos,
forged double compressed videos, and un-forged double compressed videos.
Another recent work for frame insertion, deletion and duplication detection
in videos, is proposed by Kingra et al. [25]. Here, the authors have proposed a
hybrid model by exploiting optical flow gradient and prediction residual error of
P-frames. The authors performed their experiments on CCTV footage videos.
This technique performs well for complex scene also. But, the performance of
this technique decreases for high illumination videos.
Noteworthy among the works, in which the authors explored deep learning to
detect video forgeries, are the works of [16] and [21]. Long et al. [16] use a
3D CNN for frame deletion detection in a single video shot. At a time, the
authors fed 16 consecutive frames, which are extracted from a video sequence,
308 J. Bakas and R. Naskar
as an input to the CNN. This CNN finally detects whether or not there has
been frame deletion operation between the 8th and 9th frames (out of 16). This
scheme produced high false alarm rate, especially when the video is captured
with camera motion, zoom in or zoom out. To reduce the rate of false alarms,
the authors used a post processing method, known as confidence score, on the
output of the proposed CNN.
In [21], D’Avino et al. proposed an Autoencoder with recurrent neural network
to detect chroma–key composition type of video forgery. In this work, the authors
deal with investigation of video splicing, which is to copy an object from a green
background image, and paste it into a target frame, to manipulate a video. To
detect this kind of forgery in videos, the authors divided the frames into patches
of size 128×128. From every patch, some handcrafted features using a single high-
pass third-order derivative filter, were extracted. The authors use autoencoders
to produce an anomaly score. The handcrafted features of authentic frames are
used to learn the parameters for the autoencoder. In the testing phase, the
feature vector generated from the forged frames, do not fit with the intrinsic
network model (which was trained with authentic frames), and hence produce
large error. The anomaly score is generated based on this error, which is used to
produce a “heat” map for detection of forged frames and their locations.
where ∗ is the convolution operator, Ijl is the j th output map at the lth layer.
l−1
wij (weight) is the trainable convolutional kernel connecting the ith output
feature map at l − 1th layer, and j th output feature map at lth layer. blj is the
trainable bias parameter for the j th output feature map at the lth layer.
After performing convolution operation, an activation function (such as tanh,
sigmoid, ReLUs etc.) is used on each element of the feature map to achieve
non-linearity. Without activation function, a neural network acts as a Linear
Regression Model, with less power and performance [27]. In this paper, we used
Rectified Linear Units (ReLUs) activation function, due to its fast convergence
characteristics [26]. The ReLUs activation function is defined as follows in Eq. 2:
l l
R(Im,n ) = max(0, Im,n ) (3)
l
where Im,n denotes the input patch at location (m, n) in layer l.
310 J. Bakas and R. Naskar
All obtained output feature maps, from the convolution layer, can be used for
classification. However, this would lead to high computation cost and would make
the system highly prone to over-fitting [26] errors. So, to reduce the computation
cost and solve the over-fitting problem, a pooling operation is performed. Pooling
operation is an aggregation operation, where max (/min/mean) of a particular
feature, over a certain region (spatial along with temporal in case of videos) is
considered, as a representative of that feature. In this paper, we use the max
pooling layer, which propagates the maximum value within a certain spatio-
temporal region of a video clip, to the next layer of the CNN.
The last layer of the CNN is a classification layer, which consists of fully con-
nected layers (denoted by ‘FC1’, ‘FC2’ in Fig. 2), dropout layer and softmax
layer. The learned features, which are extracted through convolution layers, pass
thorough fully connected layers, followed by dropout layers and finally fed into
the softmax classifier, the top most layer of CNN. The dropout layers [28] are
used to reduce over-fitting by dropping some neurons randomly during training
phase. Here, softmax layer uses an optimizer algorithm, to update the weight
parameters of the training model. We use the Adam optimizer [29] to train our
model.
In the next subsection, we elaborate on the parameters of the proposed 3D-
CNN for inter-frame video forgery detection.
A Digital Forensic Technique for Inter–Frame Video Forgery Detection 311
For ease of understanding, in the rest of this paper, we shall denote a video clip
with a size of c × f × h × w, where c denotes the number of color channels used
to capture the video, f is the number of frames in the video clip, h is the height
and w is the width of each frame in terms of pixels. We also denote the kernel
size by d × k × k, where d is the temporal depth of the kernel and k is its spatial
size, for 3D convolution and pooling layers.
with 1664 CUDA cores. In this section, we discuss about our experimental data
set, followed by our experimental results, related discussion and comparison with
state-of-the-art.
We have used the UCF101 [30] dataset, which was originally created for human
action recognition. The dataset consists of 13320 videos of different time lengths.
Out of 13320 videos, we select 9000 video for our experiments, randomly. For
the sake of experimentation, we generated our test forged video sequences, from
the UCF101 dataset, using the ffmpeg [31] tool. We manually induced different
types of inter-frame forgeries into the video sequences, viz., frame insertion,
deletion and duplication forgeries. The forged frame sequences were compressed
in MP4 format, with H.264 video coding standard. To encode the videos, we
used chroma sampling 4:2:2, GOP length 6, and Constant Rate Factor (CRF)
0, 8, 16, 20 and 24.
Hence, for our experiments, we generated three more sets of videos from the
original dataset, viz., frame inserted, frame deleted, and frame duplicated video
datasets. Our test dataset consists of static and as well as dynamic background
videos, with single or multiple foreground objects. Here, we have worked only
with single video shots.
In our experiments, we have divided the 9000 authentic videos into three
parts: training dataset, validation dataset and test dataset. Training dataset
consist of 5000 authentic videos and their corresponding forged videos, used for
training the network. Validation dataset contain 2000 authentic videos and their
corresponding forged videos, used to validate the network. And test dataset con-
sists of remaining 2000 authentic videos and their corresponding forged videos,
used to test the efficiency of the trained network.
The output of the CNN is a binary probability matrix, used for mapping, to
identify the authentic and forged videos. In our experiments, we measure the
performance of the proposed model in terms of accuracy, which are defined as
follows:
TP + TN
Accuracy = × 100%
TP + FP + TN + FN
where T P represents the number of True Positives or the number of forged videos
correctly detected to be forged, F P represents the number of False Positives or
the number authentic videos detected as forged, T N represents the number of
True Negatives or the number of authentic videos correctly detected as authentic,
and F N represents the number of False Negatives, that is, the number of forged
videos falsely detected as authentic.
A Digital Forensic Technique for Inter–Frame Video Forgery Detection 313
Table 1. Performance evaluation of the proposed model for frame insertion forgery
detection.
It is evident from Table 1, that the proposed method performs efficiently for
low compressed (CRF = 0), medium compressed (CRF = 8, 16) as well as highly
compressed (CRF = 24) videos. The maximum accuracy, we achieved is 99.35%,
for video quality factor CRF = 0, and number of forged frames 30.
Table 2. Performance evaluation of the proposed model for frame deletion forgery
detection.
duplicated forms, to train our CNN model. To test the proposed model, we
use 2000 authentic and corresponding 2000 forged video, generated by frame
duplication operation. We carried our experiments with light compressed videos
(CRF = 0), medium compressed videos (CRF = 8, 16) and heavy compressed
videos (CRF = 24).
We present the frame duplication detection results in Table 3, for CRF = 0,
8, 16, 24. Here, we performed our experiments videos having 10 and 20 forged
frames. This is due to the fact that, we used 49 frames at a time to train the
network, due to memory constraints. And when we tried to take >= 30 forged
frames, the minimum number of input frame requirement, to train the network,
was 30 × 2 = 60, which exceeded 49.
Table 3. Performance evaluation of the proposed model for frame duplication forgery
detection.
It can be noted from Table 3, that the accuracy of the proposed model for
frame duplication detection varies between 97.86% to 98.4%. Also, it can be
observed that we obtain satisfactory performance results for low, medium, as
well as highly compressed videos.
The scheme proposed by Kingra et al. [25], uses optical flow gradient and
prediction residual error of In the other scheme proposed by Long et al. [16], the
authors have used a C3D CNN with a confidence score, which is a post-processing
method, to detect inter-frame forgery in videos. However, this scheme is capable
to detect only frame deletion type of forgery in videos, whereas the proposed
model detects frame insertion, deletion and duplication forgeries.
The results of this comparison are presented in Table 4. The average accuracy
of the proposed method is 97%, and is significantly higher than Kingra et al.’s
scheme. Long et al.’s scheme follows our accuracy closely; however it is to be
noted here that their accuracy is only for frame deletion detection; whereas the
results of the proposed model is the average of frame insertion, deletion and
duplication types of forgery detection.
References
1. Yu, L., et al.: Exposing frame deletion by detecting abrupt changes in video
streams. Neurocomputing 205, 84–91 (2016)
2. Shanableh, T.: Detection of frame deletion for digital video forensics. Digit. Inves-
tig. 10(4), 350–360 (2013). https://doi.org/10.1016/j.diin.2013.10.004
3. Su, Y., Zhang, J., Liu, J.: Exposing digital video forgery by detecting motion-
compensated edge artifact. In: International Conference on Computational Intelli-
gence and Software Engineering, pp. 1–4, December 2009. https://doi.org/10.1109/
CISE.2009.5366884
4. Aghamaleki, J.A., Behrad, A.: Inter-frame video forgery detection and localization
using intrinsic effects of double compression on quantization errors of video coding.
Sig. Process.: Image Commun. 47, 289–302 (2016)
5. Zhang, Z., Hou, J., Ma, Q., Li, Z.: Efficient video frame insertion and deletion
detection based on inconsistency of correlations between local binary pattern coded
frames. Secur. Commun. Netw. 8(2), 311–320 (2015)
6. Li, Z., Zhang, Z., Guo, S., Wang, J.: Video inter-frame forgery identification based
on the consistency of quotient of MSSIM. Secur. Commun. Netw. 9(17), 4548–4556
(2016)
7. Liu, Y., Huang, T.: Exposing video inter-frame forgery by Zernike opponent chro-
maticity moments and coarseness analysis. Multimed. Syst. 23(2), 223–238 (2017)
8. Sitara, K., Mehtre, B.: Digital video tampering detection: an overview of passive
techniques. Digit. Investig. 18(Supplement C), 8–22 (2016). https://doi.org/10.
1016/j.diin.2016.06.003
9. Bakas, J., Naskar, R., Dixit, R.: Detection and localization of inter-frame video
forgeries based on inconsistency in correlation distribution between Haralick
coded frames. Multimed. Tools Appl. 2018, 1–31 (2018). https://doi.org/10.1007/
s11042-018-6570-8
10. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat:
integrated recognition, localization and detection using convolutional networks.
arXiv preprint arXiv:1312.6229 (2013)
11. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
12. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Advances in Neural Information Processing Sys-
tems, pp. 1097–1105 (2012)
13. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional net-
works: visualising image classification models and saliency maps. arXiv preprint
arXiv:1312.6034 (2013)
14. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human
action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
15. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recog-
nition in videos. In: Advances in Neural Information Processing Systems, pp. 568–
576 (2014)
16. Long, C., Smith, E., Basharat, A., Hoogs, A.: A C3D-based convolutional neural
network for frame dropping detection in a single video shot. In: IEEE Conference on
Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1898–1906,
July 2017. https://doi.org/10.1109/CVPRW.2017.237
A Digital Forensic Technique for Inter–Frame Video Forgery Detection 317
17. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotem-
poral features with 3D convolutional networks. In: Proceedings of the IEEE Inter-
national Conference on Computer Vision, ICCV 2015, pp. 4489–4497. IEEE Com-
puter Society, Washington (2015). https://doi.org/10.1109/ICCV.2015.510
18. Wu, Y., Jiang, X., Sun, T., Wang, W.: Exposing video inter-frame forgery based on
velocity field consistency. In: IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pp. 2674–2678, May 2014. https://doi.org/10.
1109/ICASSP.2014.6854085
19. Wang, Q., Li, Z., Zhang, Z., Ma, Q.: Video inter-frame forgery identification based
on consistency of correlation coefficients of gray values. J. Comput. Commun.
2(04), 51 (2014)
20. Su, Y., Nie, W., Zhang, C.: A frame tampering detection algorithm for MPEG
videos. In: 6th IEEE Joint International Information Technology and Artifi-
cial Intelligence Conference, vol. 2, pp. 461–464 (2011). https://doi.org/10.1109/
ITAIC.2011.6030373
21. D’Avino, D., Cozzolino, D., Poggi, G., Verdoliva, L.: Autoencoder with recurrent
neural networks for video forgery detection. Electron. Imaging 2017(7), 92–99
(2017)
22. Zhao, D.N., Wang, R.K., Lu, Z.M.: Inter-frame passive-blind forgery detection for
video shot based on similarity analysis. Multimed. Tools Appl. 77, 1–20 (2018)
23. Hall, G.: Pearson’s correlation coefficient. Other Words 1(9) (2015). http://www.
hep.ph.ic.ac.uk/∼hallg/UG 2015/Pearsons.pdf. Accessed 3 Oct 2018
24. Aghamaleki, J.A., Behrad, A.: Malicious inter-frame video tampering detection
in mpeg videos using time and spatial domain analysis of quantization effects.
Multimed. Tools Appl. 76(20), 20691–20717 (2017)
25. Kingra, S., Aggarwal, N., Singh, R.D.: Inter-frame forgery detection in H. 264
videos using motion and brightness gradients. Multimed. Tools Appl. 76(24),
25767–25786 (2017)
26. Chen, J., Kang, X., Liu, Y., Wang, Z.J.: Median filtering forensics based on con-
volutional neural networks. IEEE Sig. Process. Lett. 22(11), 1849–1853 (2015)
27. Walia, A.S.: Activation functions and it’s types-which is better? (2017). https://
towardsdatascience.com/activation-functions-and-its-types-which-is-better-
a9a5310cc8f. Accessed 13 Aug 2018
28. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn.
Res. 15(1), 1929–1958 (2014)
29. Kingma, D.P., Ba, L.: J., : ADAM: a method for stochastic optimization. In: Learn-
ing Representations (2015)
30. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions
classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
31. Developers, F.: FFmpeg tool (Version 3.2.4) [Software] (2017). https://www.
ffmpeg.org