Malware Detection With LSTM Using Opcode Language

Malware Detection with LSTM using Opcode
Language
Renjie Lu,
University of Chinese Academy of Sciences
Beijing, China
Email: lurenjie17@mails.ucas.ac.cn
Abstract—Nowadays, with the booming development of Inter- is not susceptible to code obfuscation techniques [3], so it is a
net and software industry, more and more malware variants more effective malware detection method. Dynamic behavior-
arXiv:1906.04593v1 [cs.CR] 10 Jun 2019
are designed to perform various malicious activities. Traditional based malware detection methods [4] [5] usually need a secure
signature-based detection methods can not detect variants of mal-
ware. In addition, most behavior-based methods require a secure and controlled environment, such as virtual machine, simula-
and isolated environment to perform malware detection, which is tor, sandbox, etc. Then the behavior analysis is performed by
vulnerable to be contaminated. In this paper, similar to natural using the interaction information with the environment such
language processing, we propose a novel and efficient approach to as API calls and DLL calls. Although these techniques have
perform static malware analysis, which can automatically learn been widely studied, they have also been confirmed to be less
the opcode sequence patterns of malware. We propose modeling
malware as a language and assess the feasibility of this approach. efficient enough when applied to large dataset [6]. Dynamic
First, We use the disassembly tool IDA Pro to obtain opcode behavior-based malware detection methods are quite time-
sequence of malware. Then the word embedding technique is used consuming and require considerable attention to protect the
to learn the feature vector representation of opcode. Finally, we operating environment from contaminated.
propose a two-stage LSTM model for malware detection, which At present, a number of malware detection methods com-
use two LSTM layers and one mean-pooling layer to obtain the
feature representations of opcode sequences of malwares. We bined with machine learning techniques have been proposed.
perform experiments on the dataset that includes 969 malware Reference [7] first proposed a malware detection method using
and 123 benign files. In terms of malware detection and malware data mining technique, which use three different types of
classification, the evaluation results show our proposed method static features: PE header, string sequence, and byte sequence.
can achieve average AUC of 0.99 and average AUC of 0.987 in Kolter and Maloof [8] proposed to use n-gram instead of
best case, respectively.
Index Terms—Malware detection and classification, Static byte sequence and compared the performance of naive bayes,
analysis, Opcode language, Long short-term memory decision trees, support vector machines for malware detection.
Later, artificial neural network [9] [10] were also used for
I. I NTRODUCTION malware detection. Meanwhile, there are also some novel ideas
Malicious software is referred to as malware, which is de- for malware detection. Both [11] and [12] utilize the technique
signed to perform various malicious activities, such as stealing of image processing to detect malware. In terms of malware
private information, gaining root authority, disabling targeted detection, the previous works have achieved good enough per-
host and so on. Meanwhile, with the booming development formance. However, most of these methods manually extract
of Internet and software industry, more and more variants of malware features which are used to train a machine learning
malware are emerging and almost everywhere. According to a classifier.
2018 McAfee threats report [1], the total number of malware To reduce the cost of artificial feature engineering, in this
samples has grown almost 34% over the past quarters to more paper, we propose a novel and efficient method to detect
than 774 million samples. It can be seen that the number of whether a Windows executable file is malware. First, we use
malware continues to increase. Hence, malware detection is the disassembly tool IDA Pro to obtain the assembly format
always a attractive and meaningful issue. file of all executable files. Next, we develop an algorithm to
A large number of research have been published on how to extract opcode sequence from each assembly format file. Then,
detect malware. Malware detection can be simply considered similar to natural language processing (NLP), word embedding
as a binary classification problem, and traditional anti-virus technology [13] is used to learn the feature vector representa-
software usually relies on static signature-based detection tion of opcode, and long-short term memory (LSTM) [14] is
method [2], which has a significant limitation. some minor used to automatically learn opcode sequence patterns of mal-
changes in malware can change the signature, so more malware. To increase invariance of the local feature representation,
ware could easily evade signature-based detection by encrypt- we also introduce a mean-pooling layer after second LSTM
ing, obfuscating or packing. Meanwhile, the zero-day malware layer. To verify the effectiveness of our proposed method, we
can also evade this detection approach. The dynamic analysis make a series of experiments on the dataset that includes 969
malwares and 123 benign files. In the experimental section,
Corresponding author: lurenjie17@mails.ucas.ac.cn we evaluate the effect of the second LSTM layer on malware
detection performance, and we also make detailed performance behavior. However, because of the discovery of the vanishing
comparison with other related work. In terms of malware and exploding gradient problem, RNN became unpopular until
detection and malware classification, the evaluation result the LSTM was proposed. LSTM [14] is a special type of
shows our proposed method can achieve average AUC of 0.99 RNN, which can greatly mitigate the vanishing and exploding
and average AUC of 0.987, respectively. gradient problem.
In summary, we make the following contributions in this
paper: III. M ALWARE D ETECTION M ETHODOLOGY
• We present a novel and efficient malware detection ap- In this section, we introduce the proposed malware detec-
proach, which makes use of word embedding technology tion approach in detail, which is similar to natural language
and LSTM to automatically learn the opcode sequence processing. As shown in Figure 1, the malware detection
patterns of malwares. It can greatly reduce the cost of methodology can be simply divided into the data processing
artificial feature engineering. stage and modeling stage. In the data processing stage, we
• We propose and implement a two-stage LSTM model first use the disassembly tool IDA Pro, which can resolve
for malware detection, which use two LSTM layers executable file into Intel X86 assembly format file. Next, we
and one mean-pooling layer to automatically obtain the develop a algorithm to extract opcode sequence from each
comprehensive feature representation of malware. assembly format file. In the modeling stage, word embedding
• We make a series of evaluation experiment including technology is used to learn the correlation between opcodes
malware detection and malware classification. The ex- and to obtain the feature vector representation of each opcode.
primental results demonstrate the effectiveness of our Then, a two-stage LSTM model is used to learn the opcode
proposed method. sequence patterns of each sample and to generate the predic-
The rest of this paper is organized as follows. Related tive model. Finally, we use the predictive model to perform
work on neural network is discussed in Section II. Section malware detection on the testing set in order to evaluate its
III describes the proposed malware detection framework. Ex- performance.
periment and evaluation are presented in Section IV. Section A. Data Processing
V concludes the paper and discusses the future work.
In order to obtain the features of opcode sequence, we
II. R ELATED W ORK need to extract opcode sequence from each assembly format
A. Word Embedding file. Typically, this type of assembly format file contains four
basic predefined segments, .text segment, .idata segment, .rdata
Recurrent Neural Network Based Language Model
segment, and .data segment. Since only the .text segment
(RNNLM) [15] is the language model using recurrent
stores program instructions and the rest of segments are data
neural network, which can predict the next word from
segment, we only consider the contents of .text segment
previous input. Later, Mikolov [13] proposed the CBOW and
to extract opcode sequence. Meanwhile, we also find some
Skip-gram language model in efficient estimation of word
meaningless opcodes such as ‘dd’, ‘db’, ‘align’ and so on. To
representations in vector space. CBOW model can predict
obtain opcodes that are really beneficial for malware detection,
current word by the given context. In contrast, Skip-gram
we need to filter out these meaningless opcodes. Actually,
model can predict context by the given current word. Then,
the opcode sequence can reflect program execution logic of
each word is converted to feature vector which store the
corresponding executable file. The pseudocode of this opcode
semantic information of word, and the correlation between
extraction algorithm is shown in Algorithm 1.
words can also be calculated using this feature vector.
B. Recurrent Neural Network Algorithm 1 Extract Opcode Sequence
Neural network (NN) is a kind of mathematical model Input: Each assembly format file
which is consisted of many neuron layers. Recurrent Neural Output: Corresponding opcode sequence
Network (RNN) is a typical structure of NN, and it has a 1: pattern ← predefined matching pattern for extracting
special memory unit which can retain the state information of opcode
previous hidden layer. RNN shows good results in various 2: f ilter ← {‘align0 , ‘dd0 , ‘db0 , ...}
fields which use sequential data such as natural language 3: file ← open (assembly format file)
processing and speech recognition. A large amount of research 4: for eachline in f ile do
works using RNN for malware detection has been published. 5: if eachline starts with ’.text’ then
EI-Bakry [16] proposed that a time delay neural networks 6: result ← match (pattern, eachline)
could be used for malware classification, but the paper did 7: if result is not null and result not in filter then
not carry out any experiments to validate the idea. Pascanu et 8: add result to corresponding opcode sequence
al. [17] proposed a malware detection method using RNN. 9: end if
However, [17] uses API calls as the original feature for 10: end if
malware detection. Tobiyama et al. [18] proposed a malware 11: end for
detection method with deep neural network using process
Malware
Malware Malware
Assembly Opcode Word
IDA Pro
Format File
Algorithm 1
Sequence File
Skip-gram
/CBOW
Embedding
two-stage
LSTM Detection Testing Set ?
Benign Model
Benign
Training Set Data Processing Modeling Evaluation
Fig. 1. The overview of malware detection methodology. Data processing stage includes the conversion of executable file to .asm file and the extraction of
opcode sequence. Modeling stage is consists of the word embedding and the generation of malware detection model.
B. Opcode Representations in Vector Space impact of different word window sizes and different word
Common word representations in NLP are one-hot represen- embedding techniques (Skip-gram and CBOW) on malware
tation, bag-of-word, or n-grams. However, the largest defect of detection accuracy in detail. Next, we use a two-stage LSTM
these local representations is that any two words are isolated so model to learn the comprehensive feature representation of
that it can’t reflect the semantic correlation between words. We entire opcode sequence.
use word embedding technique to automatically learn feature C. Feature Representation by LSTM
vector representation of opcode. As shown in Figure 2(a), the
1) Long-short term memory : As a typical and improved
Skip-gram model tries to predict its context from input opcode
recurrent neural network, long-short term memory (LSTM)
according to the word window size. In constrast, CBOW model
[14] is suitable for processing and predicting time series
can predict current word by the given context as shown in
problems. LSTM model introduces a new structure called a
Figure 2(b). If the word window size is set to i, then i is the
memory cell, which is composed of three main elements: an
maximum distance between the current opcode and predicted
input gate, a forget gate and an output gate, to control the
opcode.
transmission of information, as seen Figure 3. Because of
this special structure, LSTM can alleviate the vanishing and
INPUT PROJECTION OUTPUT INPUT PROJECTION OUTPUT exploding gradient problem.
Op(t-i) Op(t-i)
... ... self-recurrent

forget gate connection
Op(t-1) Op(t-1)
Op(t) Op(t)
Op(t+1) Op(t+1)
memory cell memory cell
input output
... ...
Op(t+i) Op(t+i)
input gate output gate

(a) Skip-gram (b) CBOW
Fig. 2. Two word embedding techniques: Skip-gram and CBOW. Fig. 3. Illustration of a standard memory cell.
As the model can not directly process opcode in the form Let xt denote the input to memory cell at time t; let Wi ,
of string, opcode is first converted to one-hot representation. Wf , Wc , Wo , Ui , Uf , Uc , Uo and Vo be weight matrixes; let
Hence, We make frequency statistics on all opcode sequence bi , bf , bc and bo be bias vectors. Formally, the equations below
files and filter out low frequency opcodes to build an opcode describe that how a memory cell is updated at time t.
vocabulary. In the end, the opcode vocabulary we created • First, we calculate the value of the forget gate ft , the
contains 391 different and valuable opcodes. Accordingly, the value of the input gate it , and update the previous state
one-hot representation of opcode should be a 391-dimensional of the memory cell to C̃t .
vector which contains only one non-zero element like [0,
ft = σ(Wf xt + Uf ht−1 + bf ) (1)
0, 0, 1, 0, ..., 0], and each opcode gets a unique one-hot
representation. it = σ(Wi xt + Ui ht−1 + bi ) (2)
Then we use the Gensim Python library [19] for word C̃t = tanh(Wc xt + Uc ht−1 + bc ) (3)
embedding to obtain the feature vector representation of
• Second, given the value of the input gate, the value of
opcode. After multiple experimental evaluations, we set the
the forget gate and the value of updated state C̃t , we can
dimension of the feature vector to 100. In this paper, we use the
calculate new state of memory cell, Ct , at time t.
CBOW model to implement our proposed malware detection
method. In the experimental evaluation, we will discuss the Ct = it ∗ C̃t + ft ∗ Ct−1 (4)
• With the new state of memory cell, we can calculate the
value of the output gate and the output of memory cell.
Positional Context
Word Relationship
Sequence Relationship
Article
ot = σ(Wo xt + Uo ht−1 + Vo Ct + bo ) (5)
ht = ot ∗ tanh(Ct ) (6)
NLP
In the above formula, σ is the logistic sigmoid function,
Analogy
so the value of gating vector it , ft , ot are in [0,1]. tanh
is the hyperbolic tangent function and * is the pointwise
multiplication operation. Assembly
Positional Mutual
2) Two-stage LSTM: We conduct detailed statistics on the Instruction Relationship
Function Call
Instruction
File
frequency of each opcode that appear in the dataset. The
statistical results are shown in Figure 4 and Figure 5. Figure 4
presents the average of opcodes for each type of samples and Malware Detection
top 10 most used opcodes in the dataset. Figure 5 also presents
the most frequent 10 opcode for each type of samples. Fig. 6. An equivalent analogy between NLP and malware detection.
4000000 The positional relationship between words constitutes a

40000
3000000 sentence, and the context between sentences constitutes an ar-
Frequency
2000000
ticle. Similarly, the positional relationship between instructions
Average of Opcodes
30000
1000000
constitutes an assembly function, and the mutual call between
0
20000 functions forms an assembly instruction file. Although the
pu v
sh
ll
p
im l
ul
r
p
p
d
mu
xo
ca
mo
no
po
cm
ad
Opcode
instructions consist of opcodes and operands, in this paper, we
10000
only use opcodes that represent specific operational behaviors
0
Worm Adware Backdoor Trojan Download Benign
to replace instructions. In the end, the experimental results also
Family show that this analogy intuition is indeed feasible.
Therefore, we propose a two-stage LSTM model for mal-
Fig. 4. The average of opcodes for each type of samples and Top 10 opcodes
in the dataset.
ware detection, which can handle opcode sequence of different
length. Figure 7 shows the structure of the two-stage LSTM.
We use two LSTM layers and one mean-pooling layer to
Worm Adware
obtain the feature representation of each opcode sequence
2000000 file. In this paper, every opcode is represented as a 100-
1000000
dimensional vector. The input of the first LSTM layer is all
0 0
word embedding in the opcode sequence, and its output is
puov
cah
po ll
cmp
lepa
jz
tesd
jm t
p
nov
mup
im l
xo l
stdr
clhd
po ll
p
u
ca
mo
s
ad
s
pu
m
Backdoor Trojan corresponding function vector representation. Because LSTM

200000 10000 current timestep input contains both the output of the previous
timestep and the input of the current timestep, so we can
0 0
use the output of the last timestep in each function to be as
mosh
cav
po ll
adp
sud
cmb
jmp
xop
lear
pu p
posh
mop
adv
sud
xob
in r
cm c
dep
c
jm
pu
Download Benign function vector representation. And all the function vectors are
20000 200000 the input of the second LSTM layer.
0 0 Differently, we added a mean-pooling layer after the second
LSTM layer, which can enhance the invariance of feature
mosh
cav
lell
poa
adp
cmd
p
jz
jnzt
puov
cah
ad ll
cmd
sup
b
jz
pop
lepa
tes
jm
pu
representation. Assuming that the second LSTM layer will

Fig. 5. Most frequent 10 opcodes for each type of samples in the dataset. produce a representation sequence h1 , h2 , ..., hAveLen , the
mean-pooling layer will average over all timesteps resulting in
Based on statistical results, we observe that that most of representation haverage . That is, the hidden vectors obtained
the instructions are executed sequentially when the program at every timestep in the second LSTM layer are averaged to
are executed. That is, the jump instructions and function acquire the feature representation of entire opcode sequence.
call instruction occupy only a small proportion. This process Because of the use of this two-stage LSTM model, we can
is very similar to natural language processing. Hence, as not only consider the positional relationship between opcodes,
shown in Figure 6, we can make an analogy between a but also combine semantic information between functions
long article and an assembly instruction file. Then we also to obtain a comprehensive feature representation of opcode
make an equivalent analogy between an assembly instruction sequence file. Finally, this representation is fed as the feature
function and a complete sentence in the article. Similarly, each to a softmax layer to output the probability of classifying the
instruction is analogous to the word in the sentence. executable file as every class. Softmax function is calculated
Softmax
Feature vector
File Representation representation
Mean pooing Mean-pooling
Second-stage LSTM LSTM LSTM ... LSTM
Function Representation Function representation Function representation Function representation
First-stage LSTM LSTM
Word Embedding opcode,…,opcode opcode,…,opcode ... opcode,…,opcode
The first function The second function The AveLen function
Fig. 7. The model is consists of two LSTM hidden layer, a mean-pooling layer and a softmax layer.
as follows, where N is the number of executable file classes choose 70% samples as training set and choose 30% samples
and Si is the probability of belonging to the i-th class. as testing set.
exp(vi )
Si = PN (7) TABLE I
j=1 exp(vj ) T HE DETAILS ON DATASET
IV. E XPERIMENTS AND E VALUATIONS Collections Numbers

In this section, we present the experimental results to Worm 155
demonstrate the malware detection performance of our pro- Adware 248
posed method. We use python language to implement this Backdoor 442
Trojan 48
malware detection framework based on Tensorflow [20] and Downloader 76
Keras [21], and all the experiments are conducted on a Benign 123
computer with 2.8GHz Intel Core i7 CPU, NVIDIA GeForce
GTX 1050 GPU and 8GB memory.
B. Evaluation Metric
A. Data Collection For the sake of convenience, we first introduce the confusion
Raw collections consist of 969 malware samples and 123 matrix for malware detection, as shown in Table 2.
benign samples. Part of them are from the malware samples
provided by Microsoft [22]. For each malware sample, two file TABLE II
formats are provided, the hexadecimal representation of the C ONFUSION MATRIX
file’s binary content and the corresponding assembly (.asm) Predicted class
format files, but we only use the .asm file generated by IDA Malware Benign
Pro. The others are the executable file we collect, including Malware TP FN
Actual
Benign FP TN
benign applications such as browsers and system programs, as
well as various malwares collected from some public malware
websites such as MalwareDB [23] and Virusshare [24]. By In the evaluation section, we use the following met-
reference to Microsofts classification format and Virustotal’s rics: true positive rate (TPR), false positive rate (FPR)
analysis result [25], these malwares are divided into the and accuracy (ACC). TPR is equal to TP/(TP+FN), and
following categories: Worm, Adware, Backdoor, Trojan and FPR is equal to FP/(FP+TN), and ACC is equal to
Downloader. Then, we use the disassembly tool, IDA Pro, to (TP+TN)/(TP+FN+FP+TN). An excellent model will have a
get the corresponding .asm file to form the entire dataset. The higher TPR, a higher ACC, and a lower FPR. Hence, we
details on dataset can be seen Table 1. Finally, We randomly also use receiver operating characteristic (ROC) curve in order
to measure comprehensively both TPR and FPR. ROC curve two-stage LSTM not second LSTM-layer
typically features TPR on the Y axis, and FPR on the X axis. 1.0 1.0
That is, the top left corner of the plot is the ideal point - a 0.8 0.8
True Positive Rate
True Positive Rate

FPR of zero and a TPR of one so that a larger area under the 0.6 0.6
curve (AUC) is usually better.
0.4 0.4
C. Experimental Results 0.2 Malware (area = 0.99) 0.2 Malware (area = 0.96)
Benign (area = 0.99) Benign (area = 0.96)
We conduct lots of experiments to evaluate the dection 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
performance of our proposed malware analysis method. The False Positive Rate False Positive Rate
experiment is divided into two parts. On one hand, we collec-
tively refer to Worm, Adware, Backdoor, Trojan, and Down- Fig. 8. Binary classification: the performance comparison of two-stage LSTM
loader as malware. In other words, it is a binary classification model and ‘not second LSTM layer’ model.
problem, either malware or benign file. On the other hand, we
conducted more detailed malware classification experiments, two-stage LSTM not second LSTM-layer
which is a multi-classification problem. Anyway, we first use 1.0 1.0
the training set to train this two-stage LSTM model (predictive 0.8 0.8
True Positive Rate
True Positive Rate

model), and then use the trained model to predict the testing
0.6 Worm (area = 0.99) 0.6 Worm (area = 0.94)
set. For the sake of generality, all of the models mentioned in Adware (area = 1.00) Adware (area = 1.00)
this paper are used to predict the testing set after training 100 0.4 Backdoor (area = 0.99) 0.4 Backdoor (area = 0.98)
Trojan (area = 1.00) Trojan (area = 0.97)
epochs using the traing set. 0.2 Downloader (area = 0.98) 0.2 Downloader (area = 0.98)
1) Preferences: First, we discuss the impact of different 0.0 0.0
word window sizes and different word embedding techniques 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate False Positive Rate
(Skip-gram and CBOW) on malware detection performance.
Table 3 lists the experimental results. In general, the malware Fig. 9. Multi-classification: the performance comparison of two-stage LSTM
detection performance of using the CBOW model as word model and ‘not second LSTM layer’ model.
embedding techniqeu is better than the malware detection
performance of using the Skip-gram model as word embedding
technique. It can also be seen that setting word window size to V. C ONCLUSIONS
10 and choosing CBOW model as word embedding technique In this paper, we propose a novel and efficient malware
is the best option. Hence, we use the CBOW model as word detection method. Similar to natural language processing, the
embedding technique and set the word window size to 10 in word embedding technique and LSTM are used to auto-
the following experiments. matically learn the correlation between opcodes and feature
2) The effect of the second LSTM layer: Figure 8 and representation of opcode sequence, respectively. In addition,
Figure 9 show the impact of the second LSTM layer on mal- to increase invariance of the local feature representation, we
ware detection performance. It can be seen that the malware also introduce a mean-pooling layer after second LSTM layer.
detection performance of two-stage LSTM model is always To verify the availability of the proposed method, we perform a
better than ‘not second LSTM layer’ model for both binary
classification problem and multi-classification problem.
3) The comparison with other approachs: Figure 10 and two-stage LSTM CNN
Figure 11 show the performance comparison of the two-stage 1.00 1.00
True Positive Rate
True Positive Rate
LSTM model and other related malware detection approachs. 0.75 0.75
Related approachs include the convolutional neural network 0.50 0.50
(CNN), RNN and the multilayer perceptron (MLP). It can 0.25 Malware (area = 0.99) 0.25 Malware (area = 0.98)
be seen that the performance of two-stage LSTM model is Benign (area = 0.99) Benign (area = 0.98)
0.00 0.00
significantly better than RNN for both binary classification 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate False Positive Rate
and multi-classification. It can also be seen that the two-stage
RNN MLP
LSTM model is better than MLP for both binary classification 1.00 1.00
True Positive Rate
True Positive Rate
and multi-classification, and is slightly better than CNN for 0.75 0.75
binary classification and multi-classification. Although the per-
0.50 0.50
formance of CNN and two-stage LSTM model is comparable,
the time cost of training CNN model is much greater than the 0.25 Malware (area = 0.89) 0.25 Malware (area = 0.97)
time cost of training two-stage LSTM model because CNN 0.00 0.00
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
model usually has a very large number of parameters. False Positive Rate False Positive Rate
Based on the experimental results, we can conclude that
the method we proposed performs excellently on malware Fig. 10. Binary-classification: the performance comparison of the two-stage
detection and malware classification. LSTM model and other related malware detection approachs.
TABLE III
D ETECTION ACCURACY OF DIFFERENT WORD WINDOW SIZES AND DIFFERENT WORD EMBEDDING TECHNIQUES
Skip-gram (ACC) CBOW (ACC)

Word window size
Binary-classification Multi-classification Binary-classification Multi-classificatio
5 96.34% 92.07% 96.95% 89.94%
7 96.95% 91.16% 96.95% 91.16%
10 97.26% 92.99% 97.87% 94.51%
15 97.18% 91.77% 97.56% 90.55%
30 95.73% 91.46% 95.73% 92.38%
two-stage LSTM CNN [6] P. Li, L. Liu, D. Gao, and M. K. Reiter, “On challenges in evaluating
1.00 1.00 malware clustering,” in International Workshop on Recent Advances in
True Positive Rate
True Positive Rate
Intrusion Detection. Springer, 2010, pp. 238–255.

0.75 Worm (area = 0.99) 0.75 Worm (area = 0.98) [7] M. G. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo, “Data mining
Adware (area = 1.00) Adware (area = 1.00)
0.50 Backdoor (area = 0.99) 0.50 Backdoor (area = 1.00)
methods for detection of new malicious executables,” in Security and
Trojan (area = 1.00) Trojan (area = 1.00) Privacy, 2001. S&P 2001. Proceedings. 2001 IEEE Symposium on.
0.25 Downloader (area = 0.98) 0.25 Downloader (area = 0.96) IEEE, 2001, pp. 38–49.
Benign (area = 0.96) Benign (area = 0.95) [8] J. Z. Kolter and M. A. Maloof, “Learning to detect and classify malicious
0.00 0.00
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 executables in the wild,” Journal of Machine Learning Research, vol. 7,
False Positive Rate False Positive Rate no. Dec, pp. 2721–2744, 2006.
RNN MLP [9] G. E. Dahl, J. W. Stokes, L. Deng, and D. Yu, “Large-scale malware
1.00 1.00 classification using random projections and neural networks,” in Acous-
True Positive Rate
True Positive Rate
tics, Speech and Signal Processing (ICASSP), 2013 IEEE International

0.75 Worm (area = 0.84) 0.75 Worm (area = 0.96) Conference on. IEEE, 2013, pp. 3422–3426.
Adware (area = 0.95) Adware (area = 1.00)
0.50 Backdoor (area = 0.94) 0.50 Backdoor (area = 1.00) [10] J. Saxe and K. Berlin, “Deep neural network based malware detection
Trojan (area = 0.84) Trojan (area = 1.00) using two dimensional binary program features,” in Malicious and
0.25 Downloader (area = 0.66) 0.25 Downloader (area = 0.94) Unwanted Software (MALWARE), 2015 10th International Conference
0.00 0.00 on. IEEE, 2015, pp. 11–20.
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 [11] L. Nataraj, S. Karthikeyan, G. Jacob, and B. Manjunath, “Malware
False Positive Rate False Positive Rate images: visualization and automatic classification,” in Proceedings of the
8th international symposium on visualization for cyber security. ACM,
2011, p. 4.
Fig. 11. Multi-classification: the performance comparison of the two-stage [12] L. Nataraj, V. Yegneswaran, P. Porras, and J. Zhang, “A comparative
LSTM model and other related malware detection approachs. assessment of malware classification using binary texture analysis and
dynamic analysis,” in Proceedings of the 4th ACM Workshop on Security
and Artificial Intelligence. ACM, 2011, pp. 21–30.
[13] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of
series of experiments on the dataset that includes 969 malwares word representations in vector space,” arXiv preprint arXiv:1301.3781,
and 123 benign files. Experimental results also demonstrate the 2013.
effectiveness of our proposed method in malware detection and [14] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget:
Continual prediction with lstm,” 1999.
malware classification. [15] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur,
On the other hand, the instructions consist of opcodes and “Recurrent neural network based language model,” in Eleventh Annual
operands. However, we only use the opcodes to conduct the Conference of the International Speech Communication Association,
2010.
experiment. Although the opcodes can reflect the specific [16] H. M. El-Bakry, “Fast virus detection by using high speed time delay
operational behavior of the program, the operands may also neural networks,” Journal in computer virology, vol. 6, no. 2, pp. 115–
reveal some different sensitive information between malware 122, 2010.
[17] R. Pascanu, J. W. Stokes, H. Sanossian, M. Marinescu, and A. Thomas,
and benign files. Therefore, the use of operands for malware “Malware classification with recurrent networks,” in Acoustics, Speech
detection will be our future work. and Signal Processing (ICASSP), 2015 IEEE International Conference
on. IEEE, 2015, pp. 1916–1920.
[18] S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagi,
R EFERENCES “Malware detection with deep neural network using process behavior,”
in Computer Software and Applications Conference (COMPSAC), 2016
[1] “Mcafee labs threats report september 2018,” https://www.mcafee.com/ IEEE 40th Annual, vol. 2. IEEE, 2016, pp. 577–582.
enterprise/en-us/assets/reports/rp-quarterly-threats-sep-2018.pdf. [19] R. Rehurek and P. Sojka, “Software framework for topic modelling with
large corpora,” in In Proceedings of the LREC 2010 Workshop on New
[2] I. Santos, Y. K. Penya, J. Devesa, and P. G. Bringas, “N-grams-based
Challenges for NLP Frameworks. Citeseer, 2010.
file signatures for malware detection.” ICEIS (2), vol. 9, pp. 317–320,
[20] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean et al.,
2009.
“Tensorflow: a system for large-scale machine learning.” in OSDI,
[3] A. Moser, C. Kruegel, and E. Kirda, “Limits of static analysis for
vol. 16, 2016, pp. 265–283.
malware detection,” in Computer security applications conference, 2007.
[21] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015.
ACSAC 2007. Twenty-third annual. IEEE, 2007, pp. 421–430.
[22] R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ah-
[4] L. Martignoni, E. Stinson, M. Fredrikson, S. Jha, and J. C. Mitchell, “A madi, “Microsoft malware classification challenge,” arXiv preprint
layered architecture for detecting malicious behaviors,” in International arXiv:1802.10135, 2018.
Workshop on Recent Advances in Intrusion Detection. Springer, 2008, [23] “Malwaredb,” http://malwaredb.malekal.com./.
pp. 78–97. [24] “Virusshare,” https://virusshare.com/.
[5] C. Willems, T. Holz, and F. Freiling, “Toward automated dynamic [25] “Virustotal,” https://www.virustotal.com/#/home.
malware analysis using cwsandbox,” IEEE Security & Privacy, vol. 5,
no. 2, 2007.

Malware Detection With LSTM Using Opcode Language

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Malware Detection With LSTM Using Opcode Language

Uploaded by

Copyright:

Available Formats

Malware Detection with LSTM using Opcode

Training Set Data Processing Modeling Evaluation

... ... self-recurrent

input gate output gate

4000000 The positional relationship between words constitutes a

Backdoor Trojan corresponding function vector representation. Because LSTM

representation. Assuming that the second LSTM layer will

Mean pooing Mean-pooling

Second-stage LSTM LSTM LSTM ... LSTM

Function Representation Function representation Function representation Function representation

First-stage LSTM LSTM

Word Embedding opcode,…,opcode opcode,…,opcode ... opcode,…,opcode

The first function The second function The AveLen function

IV. E XPERIMENTS AND E VALUATIONS Collections Numbers

True Positive Rate

True Positive Rate

True Positive Rate

True Positive Rate

True Positive Rate

True Positive Rate

Skip-gram (ACC) CBOW (ACC)

True Positive Rate

Intrusion Detection. Springer, 2010, pp. 238–255.

True Positive Rate

tics, Speech and Signal Processing (ICASSP), 2013 IEEE International

You might also like