You are on page 1of 4

Deep Fake Detection Using CNN

Project Course On Neural Network


(Mohammad Nazmul Haque) - (171285942); (Ishtiaqe Hanif) - (1721397642); (Zubair Azim Miazi) -
(1721221); (Asif Al Rafi) - (1711819642)

North South University

Abstract

Deep learning models are being used to synthesize credible Deepfake videos to deceive viewers.
Autoencoders such as Generative Adversarial Networks (GAN) are used to swap face, generate synthetic
face and transplant in videos, alter facial expressions, features, to list a few. Rapid development and
increase in accessibility to these advanced technologies have increased production of forged videos and
led to rise in people’s concern about their privacy, security and communications. Such advancements
increased the precision of Deepfakes, making it more challenging to detect by human viewers. This
phenomenon has given rise to phishing, scam and identity theft. Traditionally, Convolutional Neural
Network (CNN) has been used to create Deepfake detection models. EfficientNet B7 has so far achieved
the best results. In this study we propose a Convolutional Vision Transformer network. This has two
components : CNN as mentioned above and Vision Transformer (ViT). ViT is a variation of Transformer
Network, which has become popular in the field of Natural Language Processing (NLP) for its attention
mechanism. We will use EfficientNet, a variation of CNN to extract learnable features and then use ViT to
categorize them and add weights using attention mechanisms. This approach has the potential to
automatically prioritize the important features using ViT, yielding a better performance in tests. The
model will be trained on the Deepfake Detection Challenge (DFDC) dataset. For preparation, MobileNet
will be used to establish a working pipeline and baseline benchmark.

Keywords
Deep Learning, Convolutional Neural network, EfficientNet, MobileNet, Transformer Network.

1 Introduction

DeepFake detection is currently an active field of research. With the help of technology, credible fake
videos are being generated which are very realistic and have become a concern of privacy and security.[1]
These technologies are readily available and are promoted in social media and games. Face swapping and
face alterings are used in chatting apps for recreational purposes. These methods have matured to the
point, it has become increasingly difficult for humans to identify Deepfakes created by bad actors. It is
now a public concern that forged videos of people can be easily generated. DeepFakes are manipulated
videos or images. They have been altered from their original state and tend to look realistic. These have
been created using machine learning techniques. Altered videos and images can be created using multiple,
available software but when the term “DeepFake'' is mentioned machine learning is involved[2]. Fake
pornography, fake news, financial fraud are some of the harmful usages of DeepFakes. As a result interest
has been growing in detecting these fake faces. Conferences and competitions are promoting research in
Deepfakes, one of which is the Deepfake Detection Challenge(DFDC). There was a video that went viral,
which showed the former president Obama insulting Carson and Trump. The video was made by
manipulating Obama’s lips to match Jordan Pele’s voice.

Figure 1: President Obama’s Deepfake made using Lip-Sync method [4]

This proves that misusing this technology can spread misinformation and leave a negative impact on
viewers about the speaker[2].

2 Related Work

General Adversarial Network(GAN) comes into play when Deepfake videos are created. GANs are used
to create DeepFake videos. Video data is fed as input from which manipulated videos are created. In a
GAN the generator and the discriminator work against one another until the generator can fool the
discriminator; this means the fake videos created seem authentic. DeepFakes have been gaining
popularity and an app named FakeApp gives people the ability to create deepfakes very easily[7].

Figure 2: GAN architecture[8]

Convolutional Vision Transformer has also been used previously for deepfake detection. It used the vision
transformer model and DFDC dataset and was able to achieve an accuracy of 91% [9]. The CViT has a
unique feature learning component. This is also known as the attention mechanism of the transformer
model. This learning feature learns the features of the input images and then the Vit architecture can
identify whether the video is real or fake.

Figure 3: Convolutional Vision Transformer[9]

3 Methodology

3.1 Model Training

Our model was trained using MobileNetV2 and transfer learning of CNN architecture and tensorflow on
the backend. First the data had to be extracted. Necessary libraries were imported and a data extraction
process was initiated into two categories; real and fake images. The shape and batch size had to be
defined. The data was then split for training. Image generator was used for augmentation. The input size
of the image was 224*224 with 3 color channels.

The data augmentation showed better results concerning training accuracy and validation in the final
model training as compared to the initial instances of model training. The epoch was set to 10. First data
was trained using MobileNetV2. Then it was passed through three dense layers so that the model can
learn more complex functions and classify for better results. The three dense layers used RelU activation
and the output layer used softmax. The optimizer used was Adam; categorical cross entropy and accuracy
of both training and validation data loss and metrics was calculated.

Figure: 4 The entire design of the model architecture

4 Results
Even though the accuracy was around 94% for both training and validation, the predictions were not as
accurate as we expected. For example the system could identify the fake images but also identify the real
images as fake as well. The figure below shows the accuracy of the final model. There was a drop in
accuracy as it can be seen from the graph but it was able to recover. We did not fix the learning rate, but
the system adjusted the learning rate to its necessity. We used early stopping so that the training will stop
when the accuracy is no longer increasing or staying the same. It can be concluded that the loss value
decreases over time as epochs increase but the change in accuracy is small except for the drop that
occurred in epoch 5.
Figure 5: Left figure shows the accuracy as epochs increase and right figure shows
loss value as epochs increase.

5 Discussion and Limitations

From the model training in figure 5, we can comment that since loss value is decreasing over time and
training is slowly increasing, with better hardware and more fine tuning, it is possible to achieve better
accuracy. Hence a better model is possible to achieve. Deepfakes are hard to detect and they are becoming
more realistic everyday and every time the previous method to detect the fake is revealed, a better and
more realistic method is revealed which makes it harder for the AI to detect. Furthermore the neural
network can be trained in a more robust way by going in depth of the network architecture. We used
Google Colab Pro to train our model and the device used was a Asus Tuf A15 notebook and it still
consumes a lot of time. Having efficient hardware is expensive. Our dataset used was fairly large but due
to resource limitation we were unable to train the model in a larger dataset. This was a major drawback
for us. Each training took us 4 to 5 hours because of having a large dataset.

6 Conclusion

In this project, we explored the basics of deep learning, to be more specific we tried convolutional neural
networks in the field of deepfakes. In the process we got familiar with a popular framework known as
keras and tensorflow. Understanding the consequences of distributing false information is critical to
resolving the issue. Many areas of our society, particularly life-altering judgments like forensic and legal
video analysis, rely on the accuracy of information. Living in a collaborative society necessitates a high
level of trust. We drift more away from the media as we lose trust in it, resulting in public
misrepresentation of feelings. The chaos caused by former president Donald Trump’s deepfake video on
climate change showed how crucial it is to control and halt the spread of false information. With better
hardware it is possible to have better accuracy and train better models.

Source Code Repository


https://github.com/nazmulz4/Deepfake

References

1. Tolosana, R., Vera-Rodriguez, R., Fierrez, J., Morales, A., & Ortega-Garcia, J. (2020). Deepfakes and
beyond: A Survey of face manipulation and fake detection. Information Fusion, 64, 131–148.
https://doi.org/10.1016/j.inffus.2020.06.014
2. Pishori, Armaan & Rollins, Brittany & Houten, Nicolas & Chatwani, Nisha & Uraimov, Omar. (2020).
Detecting Deepfake Videos: An Analysis of Three Techniques.
3. Rossler, Andreas & Cozzolino, Davide & Verdoliva, Luisa & Riess, Christian & Thies, Justus &
Niessner, Matthias. (2019). FaceForensics++: Learning to Detect Manipulated Facial Images. 1-11.
10.1109/ICCV.2019.00009.
4. BuzzFeedVideo. (2018, April 17). You Won't Believe What Obama Says In This Video! [Video].
Youtube. https://www.youtube.com/watch?v=cQ54GDm1eL0
5. Zakharov, Egor. 2019. Digital Image. Few-Shot Adversarial Learning of Realistic Neural Talking Head
Models. 15 June 2020. https://arxiv.org/abs/1905.08233
6. Goodfellow, Ian J. Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville and Yoshua Bengio, "Generative Adversarial Nets," Advances in Neural Information
Processing Systems 27 (NIPS 2014), Montreal, Canada, 2014, pp. 1-9,
https://arxiv.org/pdf/1406.2661.pdf
7. Goodfellow, Ian J. Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville and Yoshua Bengio, "Generative Adversarial Nets," Advances in NeuralInformation
Processing Systems 27 (NIPS 2014), Montreal, Canada, 2014, pp. 1-9,
https://arxiv.org/pdf/1406.2661.pdf
8. Zakharov, Egor. 2019. Digital Image. Few-Shot Adversarial Learning of Realistic Neural Talking Head
Models. 15 June 2020. https://arxiv.org/abs/1905.08233
9. Deressa Wodajo, & Solomon Atnafu. (2021). Deepfake Video Detection Using Convolutional Vision
Transformer.
10. Schwartz, Oscar. \You Thought Fake News Was Bad? Deep Fakes AreWhere Truth Goes to Die." The
Guardian, Guardian News and Media,12 Nov. 2018,
https://www.theguardian.com/technology/2018/nov/12/deep-fakesfake-news-truth.

You might also like