Malware Classification Using Deep Learning: Mohd Shahril

Malware Classification
using
Deep Learning
Mohd Shahril
# whoami
● Mohd Shahril (@mohd_shahril_96)

● CS graduate in AI :)
● Interested in Information Security and Artificial Intelligent
● Software Developer at Pernec Integrated Network Systems (PINS)
● CTF player with local AleJnd team
● Wargames.my Crew
What is this talk about?
● My research
○ Part of my bachelor degree’s final year project
● How Deep Learning works?
● How to leverage this Deep Learning tool for malware classification task
# 0x0: What is Deep Learning?
To be less dependent
on expert
Digitalogy
on expert
Good pattern searcher
Digitalogy
Find hyperplane that separate

two regions of data
Eq. Support Vector Machine

on expert
Good pattern searcher
(Very!) Good pattern

searcher
Digitalogy
● Based on Artificial Neural Network

● Was inspired by biological neural networks (brain)
Artificial Neural Network
Output Layer
Input Layer
Hidden Layer
Weight
# 0x0: What is Deep Learning? Each lines of
these have
real-number value
Output Layer
Input Layer
Hidden Layer
Weight
# 0x0: What is Deep Learning? Each lines of
these have
real-number value
Output Layer
Neuron
Input Layer
Each neuron
represent non-linear
function of the sum
Hidden Layer of its input
● Neural network is a collective of linear combinations

○ Neuron summing all its weight input
○ Do non-linear transformation (activation function)
○ Forward output value to the next neuron
● Cleverness of neural network is determined based on its weights values

○ p/s; weight value = real-number on each of the lines
● Training a neural network means process of adjusting its weight values
○ Backpropagation is a well-known algorithm to train neural network
● Based on some meth (**math)

● Has ability to learn patterns from data
○ Just another machine learning algorithms, just… more powerful
● Powerful means it can learn very complex pattern from data
● When people said deep learning, they basically refer to deep neural network.
● So, what is deep neural network?
Add this moar
plz?
(normal)
Neural Network
Add this moar
plz?
(normal) (deep)
Neural Network Neural Network
● Deeper network means low-level details can be captured

● Variety of deep learning architectures have been made

● Solved different kind of problems
○ Computer vision (Human detection from CCTV in real-time)
○ Speech recognition (Youtube’s auto-captions)
○ Essay generator (yes, we have that)
○ etc.
● These problems are hard to solve before deep learning came
# 0x1: Training DL to Classify Malware
● How can we do this?

● DL is known as a great pattern searcher
○ Can differentiate between different classes of data with high accuracy
● Scopes:
○ Problem: Given new suspected malware executable, how to predict on which
malware family it belongs to.
○ Focus on these malware families:
■ Cerber, Cryptowall, GandCarb, Petya, Sality, Wannacrypt
○ Focus on Windows malware executable
● Malware executable, is just another Windows executable

● How can we know if executable is doing something maliciously?

○ pstt: It does naughty things when it execute 😣
● The idea is to capture runtime behaviors when it is executing
● Run the executable inside sandbox, and collect its behavior data
○ Using Cuckoo Sandbox
○ Bypass common malware “protections”, such as packer and mutation
■ Can’t achieve that if only rely on static data
● Data collection (.exe) using VirusTotal

● For each malware families in the scope, get 1,000 executables
○ Total of 6,000 .exe
● Run each sample into Cuckoo Sandbox and collect behaviors logs (in JSON)
● Problem: Deep Learning requires that the input is in fixed-length format

○ Each malware do things differently, so their behavior length data is different
● Idea: Convert behavior data into another format in which DL can understand
● Used Natural-Language Processing (NLP) technique

○ Based on 1-gram extraction technique
○ Split every JSON data into words
○ Count occurences of word inside every JSON files
○ Collect 10,000 most occurences words
○ Maps JSON word into binary (if it exists or not) based on most occurences words
[“system”, “sections”, “80386”, “Win32”] [“service”, “mutex”, “shellcmds”, “http”]

Sample 1’s 1-gram Sample 2’s 1-gram
Top 1-gram Top 1-gram

Mapper Mapper
[1, 0, 1, 1] [0, 0, 1, 0]
Malwares
Behaviors 1-gram
extraction
Malware Cuckoo Fixed-size binary

Dataset Sandbox string
Each malware behaviors

now encoded with fixed-size
10,000 binary string
● Problem: 10,000 fixed-size data still large for Deep Learning training
● Curse of dimensionality
● Idea: Do dimension reduction to the binary data
○ Transform 10,000 binary data into 20 real number
High-Dimension Low-Dimensions
○ Used Deep Autoencoders
■ Special DL architecture for doing non-linear dimensionality reduction
Deep Autoencoders
Decoder Layer
… … … … … … … … …
[20] Original Input

Original Input
[100] [100]
[500] [500]
[3,000] [3,000]
Encoder Layer
[10,000] [10,000]
Fixed-size Deep
binary string Autoencoders 20 real numbers
● Now, come the fun part, train Deep Neural Network to classify malware
● Before training the network, the dataset has to be split
○ 70% goes to training set (4200 samples)
○ 30% goes to validation set (1800 samples)
● The reason for the split is to observe how well the network will predict for
unseen data
● Used relatively simple Deep Neural Network (DNN) architecture
Output
Input … … … … … … Class probability
20 real numbers
of malware
[6]
[20] [15]
[60] [40]
[200]
Deep Neural Network

}
Example
Cerber 0.00
Cerber Cryptowall 0.97
Cryptowall
Network will GandCrab 0.02
GandCrab output
probability of Petya 0.003
Petya
malware family
Sality 0.007
Sality
Wannacrypt Wannacrypt 0.0
Output Layer
Total = 1.0
1 2 3
i) Transform samples to 1-gram

Pre-processing ii) Fetch top-frequent 1-grams
iii) Map samples’ 1-gram with
top 1-grams
Malware Cuckoo
Executables Sandbox
Transformed
Bit-String Bit-String
Deep Autoencoders
Training / Evaluate
DL
Training
Set (70%)
Validate for
Split Dataset Accuracy
Validation
Set (30%)
Deep Neural
Network
● Well, that’s it, folks 😃

● Two deep networks that need to be trained
○ Deep Autoencoders (for dimension reduction)
○ Deep Neural Network (for malware prediction)
● Accuracy = 96.3% for unseen data
# 0x2
Demo
# 0x3: Problems
● If given executable is non-malicious, this DL will still predict as it belongs to one of

these malware family
● Same problem also exists if we try to predict malware not belong in the original six
malware families
● Two solutions that I can think off:
a. If there is no predicted family probability which greater than 0.95, then we will assume
network can’t predict the executable
b. Create another class “others”, and put outside samples and train this together
# 0x3: Problems
● This method also has one major flaw, which it can’t be used for runtime
malware detection
○ As its reliance on Sandbox is delaying the prediction process
○ Malicious payload has likely already been delivered by the time it is detected
● See paper “Early-stage malware prediction using recurrent neural networks”
○ Only capture first 5-seconds of runtime behaviors
○ Claimed to achieve 94% of accuracy
# 0x3: Problems
● Vulnerable to adversarial attack

○ Generate malware samples which can fool the network
● It attacks the nature of the neural network itself
○ It is ongoing research on how to defend against this attack
● Theoretical Solution:
○ Generate lot of adversarial samples, and train network together with those
# 0x4
Happy Hacking! 😃
https://github.com/shahril96/Malware-Classification-using-Deep-Learning

Malware Classification Using Deep Learning: Mohd Shahril

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Malware Classification Using Deep Learning: Mohd Shahril

Uploaded by

Copyright:

Available Formats

Malware Classiﬁcation

● Mohd Shahril (@mohd_shahril_96)

Good pattern searcher

Find hyperplane that separate

Eq. Support Vector Machine

Good pattern searcher

(Very!) Good pattern

● Based on Artiﬁcial Neural Network

● Neural network is a collective of linear combinations

● Cleverness of neural network is determined based on its weights values

● Based on some meth (**math)

● Deeper network means low-level details can be captured

● Variety of deep learning architectures have been made

● How can we do this?

● Malware executable, is just another Windows executable

● How can we know if executable is doing something maliciously?

● Data collection (.exe) using VirusTotal

● Problem: Deep Learning requires that the input is in ﬁxed-length format

● Used Natural-Language Processing (NLP) technique

[“system”, “sections”, “80386”, “Win32”] [“service”, “mutex”, “shellcmds”, “http”]

Top 1-gram Top 1-gram

Malware Cuckoo Fixed-size binary

Each malware behaviors

[20] Original Input

● Used relatively simple Deep Neural Network (DNN) architecture

Deep Neural Network

Wannacrypt Wannacrypt 0.0

i) Transform samples to 1-gram

● Well, that’s it, folks 😃

● If given executable is non-malicious, this DL will still predict as it belongs to one of

● Vulnerable to adversarial attack

You might also like