Professional Documents
Culture Documents
net/publication/329645564
CITATIONS READS
20 210
6 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Nguyen Khanh on 04 August 2021.
Nguyen Le Minh
Japan Advanced Institute of Science and Technology
Email: nguyenml@jaist.ac.jp
Abstract—The development of IoT brings many opportunities Compared to PC malware, IoT malware often does not use
but also many challenges. Recently, increasing more malware has obfuscation techniques. Thus, we can apply quite immediately
been appeared to target IoT devices. Machine learning is one of statistical methods like machine learning and easily disassem-
the typical techniques used in the detection of malware. In this
paper, we survey three approaches for IoT malware detection ble by using commercial disassemblers, such as IDApro.
based on the application of convolutional neural networks on This paper compares three different convolutional neural
different data representations including sequences, images, and networks on 1,000 real IoT malware samples for x86, collected
assembly code. The comparison was conducted on the task of by IoTPOT2 of Yokohama National University.
distinguishing malware from nonmalware. We also analyze the
results to assess the pros/cons of each method. The first model adapts the features of fixed-sized byte
sequences, which is basic and easy to implement. The second
I. I NTRODUCTION uses the features of fixed-sized color images on AlexNet
Malware threat becomes more serious every year. According CNN. From the entropy feature of a binary code, the image
to the McAfee report in the first quarter of 2018 [1], the is generated by the Hilbert curver. The last model adapts
averages are 45 million malicious files, 57 million malicious the features of the assembly instruction sequences, which
URLs, and 84 million malicious IP addresses per day. We is generated by objdump and the abstraction of register
focus on IoT malware, which is doubled each year since 2015. names/memory addresses. Different from the standard CNN,
For PC malware, commercial antivirus software investigates this model accepts variable-sized sequences. Thus, the training
syntax patterns that are analyzed from known malicious sam- requires more time and effort. We compare the effectiveness
ples, often with machine learning techniques, e.g., finding among models by experimental and address the future direc-
characteristic bytes n-gram [10]. However, recent PC malware tions.
evolved with advanced obfuscation techniques [20], [15], The rest of the paper is as follows. Section 2 is for
which make difficult to identify semantical similarity from preliminaries. Three CNN modeling for detecting IoT malware
syntax patterns [19]. Actually, Symantec confessed that anti- are presented in Section 3. The experiments are presented in
virus software can detect only 45% of PC malware on May Section 4. Section 5 concludes with the discussion.
2015. Dynamic analysis in the sandbox, e.g., CWSandbox
[23], ANUBIS 1 , is another typical approach, which observes
the behaviors on registers [5], API calls [18], and the memory II. P RELIMINARIES
[9]. However, anti-debugging and anti-tampering techniques
A typical convolutional neural network (CNN) includes
may recognize the sandbox, and the trigger-based behavior,
three types of layers including convolution, pooling, and
e.g., malicious actions occur at the specific timing, will be
fully-connected. Wherein, the success of the network mostly
rarely detected. One of the authors developed a binary code
depends on convolutional layers that are responsible to auto-
analyzer BE-PUM based on dynamic symbolic execution on
matically learn data features from the low level to high level
x86 [6]. It overcomes obfuscation techniques and provides
of abstraction. Next, we will describe a simplest CNN from
precise disassembly of malware. The drawback is the heavy
the input to the output layer.
load by the nature of dynamic symbolic execution.
1 http://anubis.seclab.tuwien.ac.at 2 https://github.com/IoTPOT/IoTPOT
A. Feature extraction and data structures reducing the number of the parameters in the model. For
There are many kinds of features to detect whether the example, Fig. 2 shows the reduction of the dimensions
file is malicious. File features can be extracted from h ∗ w ∗ d to 1 ∗ 1 ∗ d in a global pooling layer. It reduces
contents and execution traces/logs, and stored in the data by mapping each h∗w features to their mean value. It is
structure, which is classified into either fixed-sized or often used at the backend of a CNN with dense layers to
variable-sized data structures. get a shape of the data, instead of the flattening. Another
Fixed-sized data structure means that different files are example of the global pooling layer is to transform the
represented in the data structure of the same size, e.g., variable-sized data into the fixed-sized data as in Fig. 2,
vectors with the same dimension and images. Variable in which the size of the output is always 1 ∗ 1 ∗ d
-sized data structure means that different files are left independent from w and h.
in various sizes, e.g., sequences, trees, and graphs. With
variable-sized data structure, classical machine learning
models need to be adapted to fit them.
B. Convolutional layer
All three models use a layer called the convolutional h
layer, which is the core block of a Convolutional Neural h'
d
d
Network (CNN) [14], [13]. CNN is known to be effec- w'
tive, especially on the image classification. The basic w
d
According to [7], many studies have tried to generalize h=1
d
CNN on other data structures, such as acoustic data [8], w=1
videos [12] and Go boards [21]. w
The convolution layers are composed either sequentially
or parallelly. The sequential composition transfers the Fig. 2. The summerization of the global pooling layer
output of a convolution layer to the input of the next
layer in the network but in [4]. The parallel composition
combines the outputs of several convolution layers to the D. Fully-connected layer
single output, which intends to avoid the vanishing gra- The output from the convolutional layers represents
dient problem caused by sigmoid activation functions. high-level features in the data. While the output can
be flattened and connected to the output layer, adding
C. Pooling layer
a fully-connected layer is a cheap way to learn these
A convolutional layer is often followed by a pooling features. Neurons in the fully connected layer have the
layer, whose function summarizes the neighborhoods to full connections to all activations in the previous layer,
reduce the size of the representation and the number of as regular Neural Networks. The fully-connected layer
parameters in the network, and to control over-fitting. is often associated with a softmax layer (Fig. 3), which
Examples of the neighborhoods are close pixels of an outputs the probability of each class.
image, adjacent nodes of a graph, and close time regions
of acoustic data. The typical functions are the max, III. T HREE APPROACHES
the mean, and the sum functions, which output the
A. CNN on byte sequences (CNN SEQ)
maximum, the mean (average) and the sum values of
each group, respectively. Fig. 4 shows CNN SEQ, inspired by MalConv [17].
There are two typical types of the pooling layers: a local CNN SEQ is simple and easy to implement and scale.
pooling layer and a global pooling layer. The operation 1) Feature extraction and data representation: Let X =
of the local pooling layer is depicted in Fig. 1, in which 0, ..., 255 be the integer representation of a byte. A binary
the output size depends on w, h, the size of sliding code is composed of the k bytes data (x1 , ..., xk ∈ X),
window, and the padding. Recently, the global pooling • if the binary code is shorter than k bytes, zeros are padded
layer is considered to minimize overfitting by drastically as the suffix until k bytes.
softmax layer
fully connected layers
TABLE I
Fig. 7. Visualize the binary file busybox with Hilbert curve CPU A RCHITECTURE STATISTICS
Architecture Number
2) Description model architecture: After converting a bi-
MIPS 2814
nary file to a square color image, AlexNet is applied [11]. The ARM 2774
model uses the square convolution sliding on the image and i386 2353
the square pooling to combine the outputs of neuron clusters. PowerPC 1247
sh-linux 1199
The architecture and the details of each layer in the AlexNet x86 1196
is shown in Fig. 8 and 9, respectively. m68k 1153
SPARC 1140