You are on page 1of 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/329645564

Comparison of Three Deep Learning-based Approaches for IoT Malware


Detection

Conference Paper · November 2018


DOI: 10.1109/KSE.2018.8573374

CITATIONS READS
20 210

6 authors, including:

Nguyen Khanh Hai Le


University of Wollongong University of Tasmania
8 PUBLICATIONS   31 CITATIONS    61 PUBLICATIONS   807 CITATIONS   

SEE PROFILE SEE PROFILE

Anh Viet Phan Mizuhito Ogawa


Japan Advanced Institute of Science and Technology Japan Advanced Institute of Science and Technology
19 PUBLICATIONS   361 CITATIONS    88 PUBLICATIONS   489 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Time-Sensitive Pushdown Systems View project

Wineinformatics View project

All content following this page was uploaded by Nguyen Khanh on 04 August 2021.

The user has requested enhancement of the downloaded file.


Comparison of Three Deep Learning-based
Approaches for IoT Malware Detection
Khanh Duy Tung Nguyen Son Hai Le Tran Minh Tuan
Le Qui Don Technical University, Hanoi Le Qui Don Technical University, Hanoi University of Engineering and Technology,
Email: tungkhanhmta@gmail.com Email: lehaisonmath6@gmail.com Vietnam National University, Hanoi
Email: tuantmtb@gmail.com

Anh Phan Viet Mizuhito Ogawa


Le Qui Don Technical University, Hanoi Japan Advanced Institute of Science and Technology
Email: anhpv@mta.edu.vn Email: mizuhito@jaist.ac.jp

Nguyen Le Minh
Japan Advanced Institute of Science and Technology
Email: nguyenml@jaist.ac.jp

Abstract—The development of IoT brings many opportunities Compared to PC malware, IoT malware often does not use
but also many challenges. Recently, increasing more malware has obfuscation techniques. Thus, we can apply quite immediately
been appeared to target IoT devices. Machine learning is one of statistical methods like machine learning and easily disassem-
the typical techniques used in the detection of malware. In this
paper, we survey three approaches for IoT malware detection ble by using commercial disassemblers, such as IDApro.
based on the application of convolutional neural networks on This paper compares three different convolutional neural
different data representations including sequences, images, and networks on 1,000 real IoT malware samples for x86, collected
assembly code. The comparison was conducted on the task of by IoTPOT2 of Yokohama National University.
distinguishing malware from nonmalware. We also analyze the
results to assess the pros/cons of each method. The first model adapts the features of fixed-sized byte
sequences, which is basic and easy to implement. The second
I. I NTRODUCTION uses the features of fixed-sized color images on AlexNet
Malware threat becomes more serious every year. According CNN. From the entropy feature of a binary code, the image
to the McAfee report in the first quarter of 2018 [1], the is generated by the Hilbert curver. The last model adapts
averages are 45 million malicious files, 57 million malicious the features of the assembly instruction sequences, which
URLs, and 84 million malicious IP addresses per day. We is generated by objdump and the abstraction of register
focus on IoT malware, which is doubled each year since 2015. names/memory addresses. Different from the standard CNN,
For PC malware, commercial antivirus software investigates this model accepts variable-sized sequences. Thus, the training
syntax patterns that are analyzed from known malicious sam- requires more time and effort. We compare the effectiveness
ples, often with machine learning techniques, e.g., finding among models by experimental and address the future direc-
characteristic bytes n-gram [10]. However, recent PC malware tions.
evolved with advanced obfuscation techniques [20], [15], The rest of the paper is as follows. Section 2 is for
which make difficult to identify semantical similarity from preliminaries. Three CNN modeling for detecting IoT malware
syntax patterns [19]. Actually, Symantec confessed that anti- are presented in Section 3. The experiments are presented in
virus software can detect only 45% of PC malware on May Section 4. Section 5 concludes with the discussion.
2015. Dynamic analysis in the sandbox, e.g., CWSandbox
[23], ANUBIS 1 , is another typical approach, which observes
the behaviors on registers [5], API calls [18], and the memory II. P RELIMINARIES
[9]. However, anti-debugging and anti-tampering techniques
A typical convolutional neural network (CNN) includes
may recognize the sandbox, and the trigger-based behavior,
three types of layers including convolution, pooling, and
e.g., malicious actions occur at the specific timing, will be
fully-connected. Wherein, the success of the network mostly
rarely detected. One of the authors developed a binary code
depends on convolutional layers that are responsible to auto-
analyzer BE-PUM based on dynamic symbolic execution on
matically learn data features from the low level to high level
x86 [6]. It overcomes obfuscation techniques and provides
of abstraction. Next, we will describe a simplest CNN from
precise disassembly of malware. The drawback is the heavy
the input to the output layer.
load by the nature of dynamic symbolic execution.
1 http://anubis.seclab.tuwien.ac.at 2 https://github.com/IoTPOT/IoTPOT
A. Feature extraction and data structures reducing the number of the parameters in the model. For
There are many kinds of features to detect whether the example, Fig. 2 shows the reduction of the dimensions
file is malicious. File features can be extracted from h ∗ w ∗ d to 1 ∗ 1 ∗ d in a global pooling layer. It reduces
contents and execution traces/logs, and stored in the data by mapping each h∗w features to their mean value. It is
structure, which is classified into either fixed-sized or often used at the backend of a CNN with dense layers to
variable-sized data structures. get a shape of the data, instead of the flattening. Another
Fixed-sized data structure means that different files are example of the global pooling layer is to transform the
represented in the data structure of the same size, e.g., variable-sized data into the fixed-sized data as in Fig. 2,
vectors with the same dimension and images. Variable in which the size of the output is always 1 ∗ 1 ∗ d
-sized data structure means that different files are left independent from w and h.
in various sizes, e.g., sequences, trees, and graphs. With
variable-sized data structure, classical machine learning
models need to be adapted to fit them.
B. Convolutional layer
All three models use a layer called the convolutional h
layer, which is the core block of a Convolutional Neural h'

d
d
Network (CNN) [14], [13]. CNN is known to be effec- w'
tive, especially on the image classification. The basic w

idea of the convolutional is to combine the neighbor-


hoods to emphasize local characteristics. The idea is Fig. 1. The summerization of the local pooling layer
that localized concepts among points close to each other
will share more correlations. For example, in the image,
pixels next to each other will be likely similar unless
there is an edge, and an edge is an important feature.
In the text, words next to each other probably share the
same semantic context. In the speech data, audio signals
do not change abruptly. h

d
According to [7], many studies have tried to generalize h=1

d
CNN on other data structures, such as acoustic data [8], w=1
videos [12] and Go boards [21]. w
The convolution layers are composed either sequentially
or parallelly. The sequential composition transfers the Fig. 2. The summerization of the global pooling layer
output of a convolution layer to the input of the next
layer in the network but in [4]. The parallel composition
combines the outputs of several convolution layers to the D. Fully-connected layer
single output, which intends to avoid the vanishing gra- The output from the convolutional layers represents
dient problem caused by sigmoid activation functions. high-level features in the data. While the output can
be flattened and connected to the output layer, adding
C. Pooling layer
a fully-connected layer is a cheap way to learn these
A convolutional layer is often followed by a pooling features. Neurons in the fully connected layer have the
layer, whose function summarizes the neighborhoods to full connections to all activations in the previous layer,
reduce the size of the representation and the number of as regular Neural Networks. The fully-connected layer
parameters in the network, and to control over-fitting. is often associated with a softmax layer (Fig. 3), which
Examples of the neighborhoods are close pixels of an outputs the probability of each class.
image, adjacent nodes of a graph, and close time regions
of acoustic data. The typical functions are the max, III. T HREE APPROACHES
the mean, and the sum functions, which output the
A. CNN on byte sequences (CNN SEQ)
maximum, the mean (average) and the sum values of
each group, respectively. Fig. 4 shows CNN SEQ, inspired by MalConv [17].
There are two typical types of the pooling layers: a local CNN SEQ is simple and easy to implement and scale.
pooling layer and a global pooling layer. The operation 1) Feature extraction and data representation: Let X =
of the local pooling layer is depicted in Fig. 1, in which 0, ..., 255 be the integer representation of a byte. A binary
the output size depends on w, h, the size of sliding code is composed of the k bytes data (x1 , ..., xk ∈ X),
window, and the padding. Recently, the global pooling • if the binary code is shorter than k bytes, zeros are padded
layer is considered to minimize overfitting by drastically as the suffix until k bytes.
softmax layer
fully connected layers

Fig. 3. Combination of the fully-connected and the softmax layers

Fig. 5. Example of the data segment extraction

This results a 2-dimensional vector (x0 , x1 ). To avoid over-


fitting, we follow [17] that applies DeCov regularization [2]
to minimize the cross-covariance. The last step, the softmax
activation, evaluates the probability ai (for i = 0, 1) as follows,
where a0 and a1 are the probablity of being goodware and
malware, respectively. If a1 ≥ 0.5, we conclude malware.
exp(xi )
ai = for i = 0, 1 (1)
exp(x0 ) + exp(x1 )
B. CNN on the color images (CNN IMG)
For IoT malware detection, previously the conversion to a
greyscale image is tried and the accuracy has reached 94%
[22]. Instead, we convert a binary code into a fixed-sized color
image and AlexNet is used for the data classification.
1) Feature extraction and data representation:
A. Calculate the entropy of a binary file: Similar to
Fig. 4. Architecture of CNN SEQ for IoT malware detection SEQ SEQ, let X = 0, ..., 255 be the representation of a
byte. First, we compute the sequence of the entropy of
a byte sequence. The entropy shows how much the data
• if the binary code is larger than k bytes, they are selected is disordered, and we use Shannon entropy
from the section with the executable and the writable
255
permissions, e.g., (.init, .text, .data), and set lower priority X
H(x) = − P (xi )logb P (xi ) (2)
for read-only data segments, e.g., (.rodata, .bss). They are
i=1
extracted by the ”readelf -section” tool (Fig. 5). Each byte
xj is weighted as zj = Φ(xj ) (where the mapping Φ is where x is a sliding window, xi is the number of
learned by the network during training), and composes a occurrences of i in x, and P is the ratio (i.e., |x i|
|x| ).
matrix Z. We set the size of the sliding window to be 32x32 and
2) Description model architecture: Two parallel convolu- the base b to be 10.
tional layers are prepared for processing the matrix Z, in which B. Convert entropy to RGB color: The entropy is converted
the activation functions are to a color by following to BINVIS3 ,
b = 255 ∗ x2 (3)

ReLU (Rectified Linear Unit) max(0, x) r = 255 ∗ (F (x − 0.5)), g = 0,
f (x) = 1
Sigmoid 1+e−x
where x is the entropy, F (x) = (4x − 4x2 )4 , r, g, b are
They are combined through the gating [4], which multiplies the red, the green, and the blue values, respectively.
element-wise the matrices computed by the two layers. This C. Convert color sequence to image: A space-filling curve
avoids the vanishing gradient problem caused by sigmoid fulfills the 2-dimensional unit square by a bent line, e.g.,
activation function. The result is forwarded to a temporal max Zigzag, Z-order, Hilbert (Fig. 6).
pooling layer, which performs a 1-dimensional max pooling,
followed by a fully-connected layer with the ReLU activation. 3 https://github.com/cortesi/scurve
(a) (b) (c)

Fig. 6. Curvers (a) Zigzag, (b) Z-order, (c) Hilbert

We choose Hilbert curver for the locality preserva-


tion, i.e., keeping the close elements in 1-dimensional
as nearer as possible in 2 dimensions. The function
drawmapsquare in BINVIS is used as Hilbert curve
by setting options: parameter map = square, size (of Fig. 9. Details of the layers in AlexNet
the image) = 224, color = entropy. Fig. 7 shows an
example visualization of busybox in Ubuntu.
1) Feature extraction and data representation:
• Disassembling binary files: The first step disassembles
binary executable files to assembly codes. By reading the
file header, all of our IoT malware samples are in the
ELF file format on multiple CPU architectures (Table I).
To disassemble them, we use the objdump command
in Ubuntu, which is a multi-architectural disassembler.
Among them, we target only on x86 in the experiments.

TABLE I
Fig. 7. Visualize the binary file busybox with Hilbert curve CPU A RCHITECTURE STATISTICS

Architecture Number
2) Description model architecture: After converting a bi-
MIPS 2814
nary file to a square color image, AlexNet is applied [11]. The ARM 2774
model uses the square convolution sliding on the image and i386 2353
the square pooling to combine the outputs of neuron clusters. PowerPC 1247
sh-linux 1199
The architecture and the details of each layer in the AlexNet x86 1196
is shown in Fig. 8 and 9, respectively. m68k 1153
SPARC 1140

• Vector Representations: An instruction may vary the


name and operands, in which some may change by the
offset and the use of different registers. To abstract such
differences, the operands of block names, the register
names and the literal values are simplified by the symbols
“name”, “reg”, and “val”, respectively. For instance, the
instruction addq $32, %rsp, is converted to addq,
value, reg. As in NLP techniques, we encode each
Fig. 8. Alexnet architecture word to a 30-dimensional real-valued vector, which is
choosen randomly. Then, the i-th instruction is encoded
C. CNN on assembly sequences (CNN ASM) to
C
For IoT malware detection, previous studies use handcrafted 1 X
x̄i = x̄i,j , (4)
features (e.g., n-grams and API calls) and different machine C j=1
learning algorithms. Instead, we directly analyze the assembly
code, obtained by a commercial disassembler. The disas- where C is the number of the words and x̄i,j is the
sembled code is abstracted on register names and memory encoding of the j th word in the i-instruction, and the
addresses (which are often changed by the offset) and is sum is computed element-wise. Then, an assembly se-
tailored as a variable-sized vector. Fig. 10 shows the overview quence with n instructions is the contatenation x̄1:n =
of the processes, which was inspired by [16]. x̄1 .x̄2 . · · · .x̄n .
addl
movl
BIN
01110 ASM cmpl
10101
jl
Disassembler  Softmax
...
Fully­
connected
Vector Convolutional Pooling  Convolutional Global Layer
Representations Layer 1 Layer 2 Pooling

Fig. 10. CNN on assembly instructions for IoT malware detection

2) Description model architecture: Convolutional layers B. Comparision and discussion


The convolutional layers automatically learn the defect fea- Three approaches are compared by several aspects: the
tures from instruction sequences. We design a set of the feature pre-processing, the training data and the execution time, the
detectors (filters) to capture local dependencies in the original extensibility, and the accuracy.
sequence. Each filter is a convolution with the sliding window
to produce a feature map, i.e., at position i, the feature value • Pre-processing: CNN SEG and CNN IMG are quite
cli of the lth filter is: simple as they only perform data extraction. However,
in CNN ASM, the disassemble process depends heavily
cli = f (Wl · x̄i:i+h−1 + bl ) (5) on CPU architectures. Fortunately, IoT malware rarely
uses obfuscation techniques compared to PC malware.
where Wl ∈ Rh×k , x̄i:i+h−1 = x̄i .x̄i+1 . · · · .x̄i+h−1 , f is an • Training data generation and execution time: The byte
activation function, and bl is a bias. sequences and the color images are the fixed-sized data
In general, deeper neural networks potentially achieve better structures, and we can set the size for inputs, which is
performance [13]. However, using many layers leads to more under the tradeoff between the accuracy and the execution
parameters, which require large datasets for training networks. time for the training. For instance, CNN SEG has the
In this work, two layers of convolutions are prepared for our balanced tradeoff at 2M bytes length.
1,000 IoT malware samples compiled for x86.
Length of Bytes Accuracy Training Time
Pooling layers Often, a pooling layer is inserted between
5M bytes 91.6% > 2 hours
successive convolutional layers to reduce the dimensions of
2M bytes 90.58% ∼ 1 hour
feature maps. In our case, the input sequence has up to
1M bytes 83.86% < 1 hour
thousands of instructions, and the feature map length is similar.
We choose the max pooling, expecting works better [3]. In contrast, the assembly code sequence is a variable-
In the model, the intermediate convolutions are followed sized data structure. Instead of setting the input size, we
by the local max-pooling with the filter size of 2. For the set the number of the convolution layers, which is under
last convolution, the global max-pooling is applied to generate the tradeoff between the accuracy and the execution time
the vector representation for the corresponding view, in which for the training (equivalently the number of parameters
each element is the result of pooling the feature map. to train). Since current data set for our preliminary
experiments is 1,000 (thus 800 samples for each layer),
IV. E XPERIMENTS we use fairly shallow models with two layers.
• Extensibility: All models can be easily adapted to other
A. Dataset
malware datasets. We also try the model of LSTM for
We prepare the dataset for experiments, both IoT malware byte sequences (in CNN SEG), and the result is lower
and goodware in the ELF format. than CNN. We observe that LSTM seems working well
• 15,000 IoT malware samples are supplied by for files ≤ 0.5MB, whereas the average size of IoT
Prof.Katsunari Yoshioka (Yokohama National malware samples is 1.0MB.
University). They run on various platforms, such • Accuracy: We estimate the accuracy by the average hit
as the ARM, MIPS, and x86. For experiments, we select rate. The accuracy of each method is shown in Table
1,000 malware samples of x86 binaries. II. CNN IMG and CNN ASM achieve higher accuracy
• 1,000 goodware samples are taken from x86 binaries of than CNN SEQ, Fig. 11 shows the convergence of each
Ubuntu 16.04. method, which is generally good.
We mix all of them in the single dataset. Then, we randomly We observe two more points among the results.
select 5 parts and evaluate by the 5-fold cross-validation. – As Fig. 12 shows, the color images of malware and
TABLE II the last is variable-sized. Experimental results showed that
C OMPARISON OF THE ACCURACY. either approach works quite well, probably partially because
IoT malware does not use obfuscation techniques. We also
CNN ASM
Fold CNN SEQ CNN IMG observe and compare them from several criteria, e.g., the
NoOp Ops
complexity of the pre-processing step, the training data and
1 86.3 100 100 100 the execution time, the extensibility and the accuracy. Our
2 82.8 100 100 100
3 97.5 100 100 98.25 experiments are preliminary and we would like to try on larger
4 97.5 100 100 100 sets of malware (as well as other platforms different than x86)
5 88.8 100 100 100 to confirm our current observation.
Avg. 90.58 100 100 99.65
R EFERENCES
[1] Mcafee labs threat report. Technical report, March 2018.
[2] Michael Cogswell, Faruk Ahmed, Ross Girshick, Larry Zitnick, and
non-malware are visually different, and non-malware Dhruv Batra. Reducing overfitting in deep networks by decorrelating
mostly looks darker. This means that the entropy of representations. arXiv preprint arXiv:1511.06068, 2015.
malware is higher than that of non-malware. [3] Alexis Conneau, Holger Schwenk, Loı̈c Barrault, and Yann Lecun. Very
deep convolutional networks for text classification. In Proceedings of
– We take goodware samples from Ubuntu, which are the 15th Conference of the European Chapter of the Association for
generally much smaller (the average is 0.07M B) Computational Linguistics: Volume 1, Long Papers, volume 1, pages
than malware samples (the average is 1.0M B). The 1107–1116, 2017.
[4] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier.
accuracy of SEQ ASM may be biased by the size Language modeling with gated convolutional networks. arXiv preprint
difference. arXiv:1612.08083, 2016.
[5] Mahboobe Ghiasi, Ashkan Sami, and Zahra Salehi. Dynamic malware
detection using registers values set analysis. In Information Security and
Cryptology (ISCISC), 2012 9th International ISC Conference on, pages
54–59. IEEE, 2012.
[6] Nguyen Minh Hai, Mizuhito Ogawa, and Quan Thanh Tho. Obfuscation
code localization based on cfg generation of malware. In International
Symposium on Foundations and Practice of Security, pages 229–247.
Springer, 2015.
[7] Yotam Hechtlinger, Purvasha Chakravarti, and Jining Qin. Convolutional
neural networks generalization utilizing the data graph structure. 2016.
[8] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman
Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick
Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic
modeling in speech recognition: The shared views of four research
groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
[9] Xuxian Jiang, Xinyuan Wang, and Dongyan Xu. Stealthy malware
detection through vmm-based out-of-the-box semantic view reconstruc-
tion. In Proceedings of the 14th ACM conference on Computer and
communications security, pages 128–138. ACM, 2007.
Fig. 11. Average accuracy transition during training process. [10] Jeffrey O Kephart et al. A biologically inspired immune system for
computers. In Artificial Life IV: proceedings of the fourth international
workshop on the synthesis and simulation of living systems, pages 130–
139, 1994.
[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet
classification with deep convolutional neural networks. In Advances
in neural information processing systems, pages 1097–1105, 2012.
[12] Quoc V Le, Will Y Zou, Serena Y Yeung, and Andrew Y Ng. Learning
hierarchical invariant spatio-temporal features for action recognition
with independent subspace analysis. In Computer Vision and Pattern
Recognition (CVPR), 2011 IEEE Conference on, pages 3361–3368.
IEEE, 2011.
[13] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.
Nature, 521(7553):436–444, 2015.
[14] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.
(a) set of malware (b) set of goodware
Gradient-based learning applied to document recognition. Proceedings
of the IEEE, 86(11):2278–2324, 1998.
[15] Philip O’Kane, Sakir Sezer, and Kieran McLaughlin. Obfuscation: The
Fig. 12. Dataset (a) malware and (b) goodware
hidden malware. IEEE Security & Privacy, 9(5):41–47, 2011.
[16] Anh Viet Phan and Minh Le Nguyen. Convolutional neural networks
on assembly code for predicting software defects. In Intelligent and
V. C ONCLUSION Evolutionary Systems (IES), 2017 21st Asia Pacific Symposium on, pages
37–42. IEEE, 2017.
This paper compared three CNN-based approaches for IoT [17] Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan
malware detection on 1,000 IoT malware samples for x86. Catanzaro, and Charles Nicholas. Malware detection by eating a whole
The approaches vary with the input data structures, i.e., byte exe. arXiv preprint arXiv:1710.09435, 2017.
[18] Chandrasekar Ravi and R Manoharan. Malware detection using windows
sequences, color images, and assembly instruction sequences. api sequence and machine learning. International Journal of Computer
Among them the first two data structures are fixed-sized, and Applications, 43(17):12–16, 2012.
[19] Kevin A Roundy and Barton P Miller. Binary-code obfuscations in
prevalent packer tools. ACM Computing Surveys (CSUR), 46(1):4, 2013.
[20] Monirul Sharif, Andrea Lanzi, Jonathon Giffin, and Wenke Lee. Au-
tomatic reverse engineering of malware emulators. In Security and
Privacy, 2009 30th IEEE Symposium on, pages 94–109. IEEE, 2009.
[21] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre,
George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou,
Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go
with deep neural networks and tree search. nature, 529(7587):484–489,
2016.
[22] Jiawei Su, Danilo Vasconcellos Vargas, Sanjiva Prasad, Daniele Sgan-
durra, Yaokai Feng, and Kouichi Sakurai. Lightweight classifica-
tion of iot malware based on image recognition. arXiv preprint
arXiv:1802.03714, 2018.
[23] Carsten Willems, Thorsten Holz, and Felix Freiling. Toward automated
dynamic malware analysis using cwsandbox. IEEE Security & Privacy,
5(2), 2007.

View publication stats

You might also like