You are on page 1of 27

INDIAN INSTITUTE OF INFORMATION TECHNOLOGY, NAGPUR

Department of Electronics and Communication Engineering

Deepfake Detection using deep learning

Group Members:
Abhay Ganvir (BT19ECE017)
Vrushaket Patwardhan (BT19ECE034)
Tejas Bhoye (BT19ECE002)

Under Supervision of : Dr. Prasad Joshi


Outline:
● Introduction
● Motivation behind research
● Literature Survey
● Deepfake image detection using VGG16 – Convolutional Network
● Deepfake video detection using InceptionResnetV2
● Transfer learning from Speaker Verification to Multispeaker Text-to-speech
Synthesis
● Future developments
● References
Introduction
What are Deepfakes?
Deepfake is a technique that uses deep learning algorithms to create fake
media usually by swapping a person’s face/audio from a source into
another person’s face in a target, with a resulting a fake output that is
sometimes hard to detect.

In our project, we aim to detect such media with the use of image
processing and deep learning methods.
Motivation behind research
Deepfakes pose a large threat to a future where fake news is everywhere, you
could watch a video of an important public figure or president giving a speech
and not be sure whether what you are seeing is real or fake.

DeepFakes involves videos, often obscene, in which a face can be swapped with
someone else's using neural networks. DeepFakes are a general public concern,
thus it's important to develop methods to detect them.

So it becomes a necessity of the present to develop an algorithm/model for the


detection of such deep fakes for the sake of privacy and security.
Literature survey

Sr. Performance/
Title Model Dataset
No. Accuracy

DeepFakes Detection in Videos using Feature Engineering Discrete Wavelet Transform + CNN FaceForensics++ dataset comprising Performance improved as compared
Techniques in Deep Learning Convolution Neural Network 1000 videos to SWT + CNN
1 Frameworks

Methods of Deepfake Detection Based on Machine DenseNet169 Images from internet with added blur Highest accuracy of 60.1% with
Learning noise DenseNet169
2

Deepfakes Creation and Detection Using Deep Learning MesoNet (CNN based architecture) 5000 images, divided into real images 80 % confidence rate.
and deep fake images.
3

Deepfake Audio Detection: A Deep Learning Based CNN + RNN Audios from internet Speech-to-text conversion achieved
Solution for Group Conversations
93% accuracy
4

Optimization and Improvement of fake news detection LSTM (Long Short Term Memory) 40,000 article among which 20,000 Accuracy of 99.88%.
using deep learning approaches for social benefits each fake and real news.
5
Literature survey

Sr. Performance/
Title Model Dataset
No. Accuracy

Deepfake Noise Investigation and detection Siamese Neural Network 512 real videos, 5299 fake videos, Accuracy of 99.15%.
518 high quality high difficulty
video
6

Deepfake Detection through Deep Learning Xception and MobileNet were chosen as The videos selected for this work Xception models - 90%
models for the experiments in this paper were all at a frame rate of 30 NeuralTextures model - 91%
frames per second. From MobileNets model - 90%
FaceForensics++
7

FakeAVCeleb: A novel Audio-Video Multimodel Ai base model to sync human face with Dataset in total Dataset for FakeAV created
Deepfake Dataset artificially generated voice contains 20,000 videos with lib
synced audio.
8
Deepfake image detection using VGG16 – Convolutional Network

● VGG16 is a convolutional neural network model proposed by K. Simonyan and A.


Zisserman from the University of Oxford in the paper “Very Deep Convolutional
Networks for Large-Scale Image Recognition”.

● We have also used TensorFlow - Python Deep Learning Neural Network


in our model.

● Data set is taken from the kaggle which has training Images and Testing images.

7
Architecture for VGG16 model

8
Deep Fake Image Detection

Introduction:¶
Here we have used convolutional neural network with TensorFlow's

9
10
Evaluation on Test Data

Model predicted every single time and got the 54%


accuracy.

11
After model Ready for Compilation

12
13
Final Results

14
Deepfake video detection using
InceptionResnetV2

InceptionResnetV2 [10]:
ResNet and Inception have been central to the largest advances in image recognition performance in recent years, with
very good performance at a relatively low computational cost.

Inception-ResNet combines the Inception architecture, with residual connections.

The convolution neural network is 164 layers deep and can classify images into 1000 object categories, such as the
keyboard, mouse, pencil, and many animals.

As a result, the network has learned rich feature representations for a wide range of images. The network has an image
input size of 299-by-299, and the output is a list of estimated class probabilities.
Database
The dataset is present on Kaggle.com.

It comprises of 800+ .mp4 file samples having real and deepfake videos.

A metadata.json accompanies each set of .mp4 files and contains the filename, label
(REAL/FAKE), and original and split columns, listed below under Columns. we will be
predicting whether or not a particular video is a deep fake.

The training data is denoted by the string "REAL" or "FAKE" in the label column. In our
submission, we are going to predict the probability that the video is a fake.
The Model
● The dataset contains video files and these files are labelled as fake or real videos in a
different file. After this, the code matches the dataset with the labelling file and finds
out for any missing file.

● After confirming the exact number of unique videos, images are extracted from video
and stored in the form of frames. We have OpenCV library in python for image
recognition and interpretation.

● The captured frames are sent to the model for pre-processing. After the pre-processing,
Inception-ResNetV2 comes into acting as a transfer learning block.

● Inception-ResNetV2 removes the loss layer and substitutes it with an output layer that
detects the deepfake loss and is called the deepfake detection loss output layer which
has been already defined during the preprocessing.
Results Obtained:

Training accuracy Validation accuracy Training Loss Validation Loss


97.20% 85.58% 0.1163 0.3988

18
Confusion Matrix

19
Scope of Improvement
Preprocessing: Currently for deepake video detection, we are capturing 5 images from every
video samples which are then used for training, so with the cost of excess computation, we can
increase this count for better testing results.

Dataset: Currently, the data set has around 800+ samples. Increasing the number of samples and
adding more variants in the samples, will lead to increasing the training performance of the model.

Model: The model used for deepfake video detection is InceptionResNetV2 which is a
combination of the Inception and ResNet family.
Likewise more architectural combinations of neural networks can be explored, that will help in
implementing new detection techniques to detect deepfakes.

20
Transfer learning from Speaker Verification to Multispeaker Text-to-speech Synthesis

● Proposed to create dataset with for deepfaked audio which is generated by collecting the frequency of voice
from many celebrities.

● Already existing dataset are lip-synced Audio-Video dataset like, FakeAVCeleb. Which will later be used at
fourth phase of project.

● Model used for text-to-speech(TTS) synthesis that us able to generate speech audio in the voice of different
speakers including those unseen during training.

DataSet

Used Vox-Celeb, a dataset for audio wave files for indian celebrity. Total size of dataset is 1GB. Containing
multiple folders and files for wav audio files.
21
Model Architecture
● Model is trained by three independent neural networks i.e encoder, synthesizer & vocoder.

● Speaker Encoder
LSTM network is used to condition the synthesis network on reference speech signal from the desired target speaker.|

22
● Synthesizer
The synthesizer is trained on pairs of text transcript and target audio. The network is trained in a transfer learning configuration
using a pre-trained speaker encoder to extract a speaker embedding from the target audio.

● Neural Vocoder
Used the sample by sample autoregressive WaveNet as a vocoder to invert synthesized mel spectrogram emitted by the
synthesis network into time-domain waveforms.
The mel spectrogram predicted by synthesizer network captures all of the relevant details needed for high quality synthesis of
a variety of voices.

Result Obtained:

By leveraging the knowledge learned by the discriminative speaker encoder, the synthesizer is able to generate high quality
speech not only for speakers seen during training, but also for speakers never seen before.

23
Summary of work done till now

Worked on a neural network model using


InceptionResNetV2.
Deepfake video detection
Achieved accuracy (validation) = 85% (training) =
97%

Worked on a model of VGG16, with an achieved


Deepfake image detection accuracy of 54% .

Text-to-speech(TTS) synthesis that able to generate


Deepfake audio detection: speech audio in the voice.

24
Future Works and Scope of Improvements

● Further this dataset and model will be used to combined detection of deepfake audio, video
or lip-synced videos.

● For lip-synced video FakeAVCeleb dataset will be used for training testing and verification
of model.

● Artificially generated audio signals can be detected using this algorithmic models.

25
References

1. Methods of Deepfake Detection Based on Machine Learning


2. Deepfakes Creation and Detection Using Deep Learning
3. Deepfake Audio Detection: A Deep Learning Based Solution for Group Conversations
4. Optimization and Improvement of fake news detection using deep learning approaches for social ben
efits
5. Deepfake Noise Investigation and detection
6. Deepfake Detection through Deep Learning
7. Dataset for Deepfake video detection
8. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
9. Voxceleb1 audio wav files for India celebrity
10.InceptionResNetV2 Simple Introduction

26
THANK YOU

27

You might also like