You are on page 1of 28

VIRUS DETECTION USING DEEP

LEARNING

By
Saurabh Malusare
Rojan Sudev
Rishabh Nrupnarayan

Under The Guidance of


Prof. Anil M. Bhadgale
INTRODUCTION

A computer virus is a program or piece of code


that, when executed replicates by reproducing
itself or infecting other computer program by
modifying them.
VIRUS DETECTING TECHNIQUES

• Signature Based Detection


• Heuristic Based Detection
• Detection using Bait
LIMITATIONS OF CONVENTIONAL
TECHNIQUES

• Time required between virus detection and


creation
• Large Database have to be maintained
• New patterns of virus cannot be detected
PROBLEM DEFINITION

Using Deep learning to classify whether a file is


virus or legitimate , while overcoming the
existing limitations of conventional techniques.
System Architecture
Important fields of PE header:
Feature Selection
• Extract only features relevant to classification
• Fisher Score algorithm for feature selection
• Fisher Score based on ranks.
• Ranks between 0 and 1
• Higher rank,more relevance
Fisher Score formula:

• µi,p = mean of positive samples for ith PE header feature


• µi,n = mean of negative samples for ith PE header feature
• σi,p = standard deviation of positive samples for ith PE
header feature
• σi,n = standard deviation of negative samples for ith PE
header feature
Feature Extraction
• Extract 21 most relevant features determined
using Fisher Score.
• These features are real values.
• Normalize features using min-max
normalization
• Features are scaled to [0,1]
• Normalized Feature values are then converted
to binary values using the condition:

If feature >mean(feature)
feature=1
else
feature-0
DBN
• Deep belief network obtained by stacking
several RBMs(Restricted Boltzmann machine)
on top of each other.
• The hidden layer of the RBM at layer `i`
becomes the input of the RBM at layer `i+1`.
• When used for classification, the DBN is
treated as a MLP, by adding a logistic
regression layer on top.
RBM

Fig. RBM

Fig. Forward phase

Fig. Backward phase


RBM Training
Contrastive Divergence-k(CD-k):
• Take a training sample v, compute the
probabilities of the hidden units and sample a
hidden activation vector h from this
probability distribution.
• Compute the outer product of v and h and call
this the positive gradient.
• From h, sample a reconstruction v1 of the
visible units, then resample the hidden
activations h1 from this.
• Repeat above step k times to calculate vk and
Training DBN
• DBN trained in semi-supervised way.
2 phases:
1)Unsupervised training phase
2)Supervised training phase
Unsupervised Training
Algorithm:
• 1. Train the first layer as an RBM that models the raw input as its visible
layer.
• 2. Use that first layer to obtain a representation of the input that will be
used as data for the second layer.
• 3. Train the second layer as an RBM, taking the transformed data
(samples ) as training examples (for the visible layer of that RBM).
• 4. Iterate (2 and 3) for the desired number of layers, each time
propagating upward either samples .
Supervised Training
• Uses Logistic Regression on top of DBN
• Logistic Regression Model trained in
Supervised way-uses labelled virus and
legitimate files
• Logistic regression is a probabilistic, linear
classifier parametrized by a weight
matrix W and a bias vector b .
Fine Tuning Parameters

• Number of hidden layers


• Number of processing units per hidden layer
• Learning rate
PERFORMANCE
EVALUATION

03/06/17 CS-152 23
SNAPSHOTS
RESULTS
• Feature Extractor capable of extracting
relevant features from dataset and input
file.
• DBN capable of classifying a given PE
structure file as virus or legitimate with an
accuracy of 94.5%.
CONCLUSION

You might also like