You are on page 1of 48

Malware Classification

using
Deep Learning

Mohd Shahril
# whoami

● Mohd Shahril (@mohd_shahril_96)


● CS graduate in AI :)
● Interested in Information Security and Artificial Intelligent
● Software Developer at Pernec Integrated Network Systems (PINS)
● CTF player with local AleJnd team
● Wargames.my Crew
What is this talk about?

● My research
○ Part of my bachelor degree’s final year project
● How Deep Learning works?
● How to leverage this Deep Learning tool for malware classification task
# 0x0: What is Deep Learning?
# 0x0: What is Deep Learning?
# 0x0: What is Deep Learning?
# 0x0: What is Deep Learning?

To be less dependent
on expert

Digitalogy
# 0x0: What is Deep Learning?

To be less dependent
on expert

Good pattern searcher

Digitalogy
# 0x0: What is Deep Learning?

Find hyperplane that separate


two regions of data

Eq. Support Vector Machine


# 0x0: What is Deep Learning?

To be less dependent
on expert

Good pattern searcher

(Very!) Good pattern


searcher

Digitalogy
# 0x0: What is Deep Learning?

● Based on Artificial Neural Network


● Was inspired by biological neural networks (brain)
# 0x0: What is Deep Learning?
Artificial Neural Network

Output Layer

Input Layer

Hidden Layer
Weight
# 0x0: What is Deep Learning? Each lines of
these have
real-number value
Artificial Neural Network

Output Layer

Input Layer

Hidden Layer
Weight
# 0x0: What is Deep Learning? Each lines of
these have
real-number value
Artificial Neural Network

Output Layer

Neuron
Input Layer
Each neuron
represent non-linear
function of the sum
Hidden Layer of its input
# 0x0: What is Deep Learning?

● Neural network is a collective of linear combinations


○ Neuron summing all its weight input
○ Do non-linear transformation (activation function)
○ Forward output value to the next neuron
# 0x0: What is Deep Learning?

● Cleverness of neural network is determined based on its weights values


○ p/s; weight value = real-number on each of the lines
● Training a neural network means process of adjusting its weight values
○ Backpropagation is a well-known algorithm to train neural network
# 0x0: What is Deep Learning?

● Based on some meth (**math)


● Has ability to learn patterns from data
○ Just another machine learning algorithms, just… more powerful
● Powerful means it can learn very complex pattern from data
# 0x0: What is Deep Learning?
# 0x0: What is Deep Learning?

● When people said deep learning, they basically refer to deep neural network.
● So, what is deep neural network?
# 0x0: What is Deep Learning?
Add this moar
plz?

(normal)
Neural Network
# 0x0: What is Deep Learning?
Add this moar
plz?

(normal) (deep)
Neural Network Neural Network
# 0x0: What is Deep Learning?

● Deeper network means low-level details can be captured


# 0x0: What is Deep Learning?

● Variety of deep learning architectures have been made


● Solved different kind of problems
○ Computer vision (Human detection from CCTV in real-time)
○ Speech recognition (Youtube’s auto-captions)
○ Essay generator (yes, we have that)
○ etc.
● These problems are hard to solve before deep learning came
# 0x1: Training DL to Classify Malware

● How can we do this?


● DL is known as a great pattern searcher
○ Can differentiate between different classes of data with high accuracy
# 0x1: Training DL to Classify Malware

● Scopes:
○ Problem: Given new suspected malware executable, how to predict on which
malware family it belongs to.
○ Focus on these malware families:
■ Cerber, Cryptowall, GandCarb, Petya, Sality, Wannacrypt
○ Focus on Windows malware executable
# 0x1: Training DL to Classify Malware

● Malware executable, is just another Windows executable


# 0x1: Training DL to Classify Malware

● How can we know if executable is doing something maliciously?


○ pstt: It does naughty things when it execute 😣
● The idea is to capture runtime behaviors when it is executing
● Run the executable inside sandbox, and collect its behavior data
○ Using Cuckoo Sandbox
○ Bypass common malware “protections”, such as packer and mutation
■ Can’t achieve that if only rely on static data
# 0x1: Training DL to Classify Malware

● Data collection (.exe) using VirusTotal


● For each malware families in the scope, get 1,000 executables
○ Total of 6,000 .exe
● Run each sample into Cuckoo Sandbox and collect behaviors logs (in JSON)
# 0x1: Training DL to Classify Malware
# 0x1: Training DL to Classify Malware

● Problem: Deep Learning requires that the input is in fixed-length format


○ Each malware do things differently, so their behavior length data is different
● Idea: Convert behavior data into another format in which DL can understand
# 0x1: Training DL to Classify Malware

● Used Natural-Language Processing (NLP) technique


○ Based on 1-gram extraction technique
○ Split every JSON data into words
○ Count occurences of word inside every JSON files
○ Collect 10,000 most occurences words
○ Maps JSON word into binary (if it exists or not) based on most occurences words
# 0x1: Training DL to Classify Malware

[“system”, “sections”, “80386”, “Win32”] [“service”, “mutex”, “shellcmds”, “http”]


Sample 1’s 1-gram Sample 2’s 1-gram

Top 1-gram Top 1-gram


Mapper Mapper

[1, 0, 1, 1] [0, 0, 1, 0]
# 0x1: Training DL to Classify Malware

Malwares
Behaviors 1-gram
extraction

Malware Cuckoo Fixed-size binary


Dataset Sandbox string

Each malware behaviors


now encoded with fixed-size
10,000 binary string
# 0x1: Training DL to Classify Malware

● Problem: 10,000 fixed-size data still large for Deep Learning training
● Curse of dimensionality
● Idea: Do dimension reduction to the binary data
○ Transform 10,000 binary data into 20 real number
High-Dimension Low-Dimensions
○ Used Deep Autoencoders
■ Special DL architecture for doing non-linear dimensionality reduction
# 0x1: Training DL to Classify Malware
Deep Autoencoders

Decoder Layer

… … … … … … … … …

[20] Original Input


Original Input
[100] [100]
[500] [500]
[3,000] [3,000]
Encoder Layer
[10,000] [10,000]
# 0x1: Training DL to Classify Malware

Fixed-size Deep
binary string Autoencoders 20 real numbers
# 0x1: Training DL to Classify Malware
# 0x1: Training DL to Classify Malware

● Now, come the fun part, train Deep Neural Network to classify malware
● Before training the network, the dataset has to be split
○ 70% goes to training set (4200 samples)
○ 30% goes to validation set (1800 samples)
● The reason for the split is to observe how well the network will predict for
unseen data
# 0x1: Training DL to Classify Malware

● Used relatively simple Deep Neural Network (DNN) architecture

Output
Input … … … … … … Class probability
20 real numbers
of malware
[6]
[20] [15]
[60] [40]
[200]

Deep Neural Network


# 0x1: Training DL to Classify Malware

}
Example

Cerber 0.00
Cerber Cryptowall 0.97
Cryptowall
Network will GandCrab 0.02
GandCrab output
probability of Petya 0.003
Petya
malware family
Sality 0.007
Sality

Wannacrypt Wannacrypt 0.0

Output Layer
Total = 1.0
1 2 3

i) Transform samples to 1-gram


Pre-processing ii) Fetch top-frequent 1-grams
iii) Map samples’ 1-gram with
top 1-grams
Malware Cuckoo
Executables Sandbox

Transformed
Bit-String Bit-String
Deep Autoencoders

Training / Evaluate
DL
Training
Set (70%)
Validate for
Split Dataset Accuracy

Validation
Set (30%)
Deep Neural
Network
# 0x1: Training DL to Classify Malware

● Well, that’s it, folks 😃


● Two deep networks that need to be trained
○ Deep Autoencoders (for dimension reduction)
○ Deep Neural Network (for malware prediction)
● Accuracy = 96.3% for unseen data
# 0x2

Demo
# 0x3: Problems

● If given executable is non-malicious, this DL will still predict as it belongs to one of


these malware family
● Same problem also exists if we try to predict malware not belong in the original six
malware families
● Two solutions that I can think off:
a. If there is no predicted family probability which greater than 0.95, then we will assume
network can’t predict the executable
b. Create another class “others”, and put outside samples and train this together
# 0x3: Problems

● This method also has one major flaw, which it can’t be used for runtime
malware detection
○ As its reliance on Sandbox is delaying the prediction process
○ Malicious payload has likely already been delivered by the time it is detected
● See paper “Early-stage malware prediction using recurrent neural networks”
○ Only capture first 5-seconds of runtime behaviors
○ Claimed to achieve 94% of accuracy
# 0x3: Problems

● Vulnerable to adversarial attack


○ Generate malware samples which can fool the network
● It attacks the nature of the neural network itself
○ It is ongoing research on how to defend against this attack
● Theoretical Solution:
○ Generate lot of adversarial samples, and train network together with those
# 0x4

Happy Hacking! 😃
https://github.com/shahril96/Malware-Classification-using-Deep-Learning

You might also like