Android Malware Family Classification Based On Sensitive Opcode Sequence

,(((6\PSRVLXPRQ&RPSXWHUVDQG&RPPXQLFDWLRQV,6&&
Android Malware Family Classification Based on

Sensitive Opcode Sequence
Jianguo Jiang1, 2, Song Li1, 2, Min Yu1, 2*, Gang Li3*, Chao Liu1, Kai Chen1, Hui Liu4, Weiqing Huang1
1
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
2
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
3
School of Information Technology, Deakin University, VIC, Australia
4
School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
*corresponding author
yumin@iie.ac.cn, gang.li@deakin.edu.au
Abstract—Android malware family classification is an Currently, various solutions have been proposed for Android
advanced task in Android malware analysis, detection and malware family classification. Most methods are based on static
forensics. Existing methods and models have achieved a analysis, have achieved a certain success, but there is still a
certain success for Android malware detection, but the challenging for the Android malware family classification
accuracy and the efficiency are still not up to the analysis due to the limitation of existing features. For example,
expectation, especially in the context of multiple class AndroidManifest.xml based methods [3-5] extracted permission,
classification with imbalanced training data. To address component, intent information for analysis, and their
those challenges, we propose an Android malware family performance is deteriorated when the permission is abused by
developers [6]. Fang et al. [7-8] classified Android families via
classification model by analyzing the code’s specific
the call graph of API with the potential risk of the path explosion.
semantic information based on sensitive opcode sequence.
Suarez et al. [9] proposed Dendroid, which automatically
In this work, we construct a sensitive semantic feature – classifies Android malware to various families based on the code
sensitive opcode sequence using opcodes, sensitive APIs, structures. However, the accuracy of these methods is limited
STRs and actions, and propose to analyze the code’s specific because the polymorphic variants of Android malware could
semantic information, generate a semantic related vector evade the detection with the name obfuscation. Opcode based
for Android malware family classification based on this methods have showed potential in Android malware family
feature. Besides, aiming at the families with minority, we classification, not only can they overcome issues such as the
adopt an oversampling technique based on the sensitive name obfuscation, path explosion, they also contain a large
opcode sequence. Finally, we evaluate our method on number of code content and structural information. Android
Drebin dataset, and select the top 40 malware families for malware family analysis methods using opcode are mostly based
experiments. The experimental results show that the Total on statistics and modeling [10-12], but overlook some semantic
Accuracy and Average AUC (Area Under Curve, AUC) information that can cause high false positive. Opcode has rich
reach 99.50% and 98.86% with 45.17s per Android semantic, structural information, but has not been explored for
malware, and even if the number of malware families Android malware family analysis, due to two reasons: first, the
increases, these results remain good. analyst only uses opcode as a feature, while ignores some
sensitive semantics information; second, a large number of
Keywords—Android malware, family classification, opcodes in each Android app bring extensive computation cost.
sensitive opcode, semantic Aiming to address above mentioned challenges, through
transforming the family classification task into a text semantic
I. INTRODUCTION classification task, we propose to analyze the code’s specific
Android has been the main target of mobile threats, and semantic information based on the sensitive opcode sequence.
many novel families and variants of malware have been We construct a sensitive semantics feature - sensitive opcode
discovered [1]. These pose a serious challenge to most anti- sequence based on three observations. First, Android malware
malware systems. At the same time, Zhou et al. [2] found that usually invokes various sensitive APIs, executes commands
86% Android malware samples are repackaged and produced by with sensitive STRs (such as “ /system/bin/sh”) to perform
injecting malicious components into legitimate apps. malicious activities. Second, Android malware often listens to a
Accordingly, a promising approach to malware analysis is to specific action though components, once triggered, begins to
classify the Android malware load into various families that can execute the malicious load. Third, malware and its variants in
be further exploited for malware detection and inspection. the same family always perform malicious acts though similar
Besides, Android malware family classification is also an patterns. So the sensitive opcode sequence is generated using
essential step for depicting malware’s behaviors in digital sensitive APIs, STRs and sensitive actions for further research.
forensics. Due to choosing the code contained in the methods with
sensitive elements, we claim that our method can represent
k,(((
malware behaviors well, and reduce computational complexity introduces our model. In Section IV, we introduce the results of
on generating vectors. Moreover, we adopt an oversampling the experiment and compare with the related work. The
technique based on sensitive opcode sequence to improve the conclusion is given in Section V, and section VI envisages the
number of training vectors. Our contributions are as follows: future work.
We propose a model for Android malware family II. PRELIMINARIES
classification by analyzing the code’s specific semantic
information. We construct the sensitive opcode In this section, we introduce the related concepts of the
sequence representing the semantic information to features we used for generating sensitive opcode sequence.
characterize Android malware behaviors, and convert A. Dalvik and Opcode
the sensitive opcode sequence to semantic related
As the heart of Android system, ART (Android runtime,
vectors for classification.
ART) is a fast, ahead-of-time compiled runtime with modern
We adopt a technique for classification on the minority garbage collection designed to scale. Android applications are
families based on the sensitive opcode sequence. compiled to Dalvik bytecode and run with ART [13].
Experiments show that with this oversampling
Opcode describes the set of instructions for the data
technique, Average AUC could be improved
operation. Because Dalvik is based on the register structure, the
significantly without affecting the Total Accuracy
instruction set and the instruction set in JVM are different, but
seriously.
more similar to the assembly instruction in x86. According to
The remainder of the paper is organized as follows. Section the their function, the opcodes can be divided into the following
II shows the necessary preliminaries about our study. Section III categories as shown in Figure 1.
Data definition Object operation Data calculation Field operation Method call
const/
const/4
t4 add-type
add
dd-type aget
geet aput
a ut
ap invoke-virtual
n-instance
n-instance invo
oke-vvirtu
t al
const-wide/16 sub-type
sub
u -type Aget-wide
Aget-wide aput-wide
a ut-wide
ap invoke-super
instance-of
instance-of invoke-sup
u er
const-string mul-type
mul-type Aget-object
Aget-objb ect aput-object
a ut-obj
ap b ect invoke-direct
…...
…... invoke-direct
…... div-type
div-type Aget-boolean
Aget-boolean aput-boolean
a ut-boolean
ap invoke-static
invoke-static
rem-type
rem-type Aget-byte
Aget-byte aput-byte
a ut-byte
ap invoke-interface
invoke-interfa
f ce
Data operation Array operation
and-type
and-type Aget-char
Aget-char aput-char
a ut-char
ap invoke-virtual/range
invoke-virtu
t al/range
move
move new-array
ew-array or-type
or-type Iget
Iget iput
iput invoke-super/range
invoke-sup
u er/
r range
move/from
move/fr
f om fill-array-data
f ll-array-data
fi xor-type
xor-type Iget-wide
Iget-wide iput-wide
iput-wide invoke-direct/range
invoke-direct/t range
move/16
move/16 …...
…... shl-type
shl-type Iget-object
Iget-obj
b ect iput-object
iput-obj
b ect invoke-static/range
invoke-static/range
move-object
move-obj
b ect shr-type
shr-type Iget-boolean
Iget-boolean iput-boolean
iput-boolean invoke-interface-range
invoke-interfa
f ce-range
move-result
move-result …...
…... Iget-byte
Iget-byte iput-byte
iput-byte range
range
Comparison
…...
…... Iget-char
Iget-char iput-char
iput-char empty
empty
Method return …...
…... …...
…... quick/range
quick/
k range
Jump Cmpl-float
Cmpl-fl
f oat return-void invoke-polymorphic
invoke-polymorp r hic
retu
t rn-void
goto Cmpg-float
Cmpg-fl
f oat retun Data conversion Synchronization invoke-polymorphic/
invoke-polymorp r hic/
goto retu
t n
packed-switch Cmpl-double
Cmpl-doub
u le return-wide int-to-long range
range
packed-switch retu
t rn-wide int-to--long Throw
If-test …...
…... return-object float-to-int Throw invoke-custom
invoke-custom
f test
If- retu
t rn-obj
b ect f oat-to-int
fl
…...
…... …...
…..
….. …...
…... …...
…... …...
Figure 1: The opcodes of Dalvik virtual machine
B. Sensitive Elements chooses suitable time to execute a malicious load though
listening to sensitive actions with Broadcast Receiver. Besides,
Android provides sensitive APIs for diverse functions, the component Activity, service also use their intentFilters’
which can be exploited by malware to help to perform malicious action to match intent messages.
acts. For example, API
android.telephony.SmsManager.senTextMessage() can be III. SYSTEM ARCHITECTURE
called to send short messages, and API java.lang.Runtime.exec()
can be invoked for the execution of the external command. In this section, we introduce the system overview and
Sensitive STRs are used to execute a specific shell script through describe main processes, key methods and related concepts in
commands such as “/system/bin/sh”. our model.
Broadcast is a widely used mechanism for transmitting A. Overall Architecture
information among apps. Broadcast Receiver is the component Figure 2 gives the architecture of our model. It consists of
that filters and responds to the filtered Broadcast. When needing four main modules: pre-processing, generating sensitive
to be sent, the information for filtering such as action and opcode sequence, generating feature vector, training classifier.
category will be loaded into an intent object by calling Module 1 mainly includes unzipping and decompiling APK file
sendOrderBroadcast() or sendStickyBroadcast(). When the to smali files. Here, smali is the interpretative language of
intent is sent, all registered Broadcast Receivers will check Dalvik. Module 2 extracts opcode, sensitive APIs, sensitive
whether the intentFilter registered is matched with the actions, sensitive STRs from smali file and generates the
transmitted intent. Upon a match, the onReceive() method in
sensitive opcode sequence. Module 3 generates the feature
Broadcast Receiver will be invoked. Based on manual analysis
vector from sensitive opcode sequence text, and at the same
of many Android malwares, we claim that the malware always
time, the model uses an oversampling technique to generate the feature vectors above to generate the Android malware family
similar sensitive opcode sequence for the families with a small classification model.
number of samples. Module 4 trains the classifier based on the
!*(
)

$
# !&

!,( !+(

!" "!
"

' '
"
"

"
"
!
"
!
!

$
"
!
$ " %
!-(

! #%

!
!
Figure 2: System overview
may run in the background, so in order to better distinguish the

response of the action is in the foreground or in the background,
B. Preprocessing
we divide sensitive actions into two parts according to their
We unzip APK file to extract dex file, and then decompile corresponding components. The details are shown in table 1.
dex file to multiple smali files using apktool. According to the The first column indicates the type of sensitive elements, the
order of smali files in the folders, our method gathers all the second column indicates the number of sensitive elements, and
smali files, and only remains the codes of the methods the third column indicates some examples.
containing sensitive elements for generating semantic feature
vector. Algorithm 1: Aligning code of sensitive elements
Input: Smali files of one APK (AndroidPackage, APK).
C. Generating Sensitive Opcode Sequence Output: Sensitive opcode sequence file.
The method to generate the sensitive opcode sequences will 1 Initialize Dic1= {sensitive elements: prefix}, Dic2= {global variable
or method name: sensitive elements}, txt = a list, opcode list and their
be introduced from the following three parts: prefix.
2 For line in smali files:
TABLE 1
SENSITIVE ELEMENTS 3 if (line is an assignment statement for a global variable && value
is sensitive elements) or (line contains „return“ && return value
Element Number Examples caculated is sensitive elements):
android/location/LocationManager;->getL 4 Dic2.put(variable or method name: value)
Sensitive astLocation 5 if (line uses GMS (global variable, the method or sensitive
648
API android/net/wifi/WifiManager;->removeN elements)):
etwork 6 txt.append(the method’s opcode & GMS the line located).
Sensitive android.intent.action.MEDIA_EJECT 7 For method in txt:
103
action1 android.intent.action.AIRPLANE_MODE 8 if (GMS in Dic2.key):
android.intent.action.INSERT 9 method.replace(GMS, Dic1[Dic2[GMS]])
Sensitive
44 android.intent.action.SAVE_BATTERY_P 10 else: txt.remove(method)
action2
RO_IGNORE_LIST
11 write txt to a file.
/system/bin/sh
Sensitive
8 mount -o remount 2) Aligning Code of Sensitive Elements
STR
/system/bin/rm In order not to lose sensitive method modules, we use
1) Sensitive Elements Statistics algorithm 1 to align code with sensitive elements. Firstly, we
To describe Android malware’s malicious behaviors, we check whether the global variable’s value or the string value
treat sensitive APIs, sensitive actions, sensitive STRs as the returned by a code method, if it is, the global variable or the
sensitive elements, and generate sensitive opcode sequence method will be stored in the sensitive variable dictionary (Line
with opcode and these elements. Activity is often associated 1-4). When the global variable appears as a parameter or the
with the interface while other two components receiver, service method in the sensitive variable list is called in other methods,
the variable name or method name of the corresponding points. At the same time, two stagnation points are set as the
position is replaced by the hard-coded sensitive element (Line exchangeable areas in which we can delete one or exchange any
5-10). The detail of this process is given in algorithm 1. It helps two operation codes. Through this, we can generate more
to overcome transformation attack to a certain extent and sensitive opcode sequences for the families of minority.
protects context information of semantic analysis.
E. Generating Semantic Related Vector and Classification
3) Sensitive Opcode Sequence Algorithms
Firstly, we use eight operating codes to represent all Doc2vec is an extension of Word2vec [14] that learns to
opcodes according to their semantics: “MOVE”, “IF”, correlate labels and words, rather than words with other words.
“INVOKE”, “CMP”, “GET”, “PUT”, “GO”, “RETURN”. To The main function of Doc2vec is to transform a paragraph, a
facilitate the presentation, we use the corresponding acronyms sentence, a document into a semantic related vector. The
to express these sensitive elements uniformly. Their prefixes feature-sensitive opcode sequence we construct basically
are API, BA, AA, STRs respectively. The generated sensitive retains sensitive semantic information, is also cut a lot of
opcode sequence is shown in Figure 3. One block represents unrelated content. So, Doc2vec is suitable model for our feature
one method in code. For the method containing the sensitive to generate semantic related vector.
elements, we extract those eight operating codes and sensitive
elements as the sensitive opcode sequence according to the There are some differences in the results of the same
context. feature in different classification algorithms. In order to choose
the suitable classification algorithm, we use nine common
INVOKE MOVE IF INVOKE GO RETURN INVOKE MOVE INVOKE INVOKE MOVE
OVE INVOKE MOVE MOVE GO
machine learning algorithms including KNN (K-Nearest
IF GET INVOKE MOVE GET
T IF GET GO INVOKE MOVE INVOKE INVOKE MOVE IF
MOVE INVOKE GO AA11 GO
O MOVE
M GO INVOKE GO
F MOVE GO INVOKE Neighbors, KNN), LiSVM (Linear SVM, LiSVM), RSVM
(RBF SVM, RSVM), DT (Decision Tree, DT), RF (Random
GET INVOKE INVOKE API199 MOVE GET INVOKE INVOKE INVOKE API145 INVOKE MOVE IF INVOKE
INVOKE MOVE GET INVOKE MOVE INVOKE MOVE INVOKE GO RETURN MOVE INVOKE INVOKE MOVE Forest, RF), Ada (AdaBoost, Ada), LR (Logistic Regression,
GET INVOKE MOVE INVOKE MOVE INVOKE GO
LR), GB (Gradient Bootsting Classifier, GB) and MLP (Multi-
GET INVOKE INVOKE API199 MOVE IF PUT MOVE GO RETURN MOVE MOVE INVOKE INVOKE GET IF
INVOKE INVOKE MOVE INVOKE GET INVOKE INVOKE INVOKE API145 INVOKE MOVE IF PUT MOVE GO layer Perceptron Classifier, MLP) [15].
INVOKE MOVE GO
GET IF GO RETURN PUT INVOKE INVOKE BA10 INVOKE INVOKE INVOKE MOVE INVOKE INVOKE MOVE IV. EXPERIMENTS AND RESULTS
IF INVOKE GO
GET INVOKE MOVE AA11 INVOKE GET INVOKE RETURN

A. Dataset and Metrics
INVOKE MOVE AA11 INVOKE INVOKE GET INVOKE MOVE INVOKE GO RETURN MOVE INVOKE MOVE IF
INVOKE INVOKE MOVE INVOKE MOVE INVOKE GO This experiment is implemented in Python, using 5560
malware samples with 178 families in Drebin [5] as
Figure 3: Sensitive opcode sequence experimental samples on a computer with Intel Core i7-
6500HQ 2.50GHz CPU and 16GB memory. Incidentally, we
4) Comparing with Related Work sort the families according to the number of family samples,
Several existing research works [12], [16-19] used opcode and subsequent experiments are carried out in this order.
sequence as feature on Android and PE (Portable Executable, We use Doc2vec to generate the feature vector. The
PE) malware detection and malware family classification. parameters used for Doc2vec are shown in table 2.
Different from our way of generating the sequence of opcode,
they typically extract the opcode sequence from all code’s TABILE 2
method. The method based on the opcode sequence only PARAMETERS OF DOC2VEC
containing codes’ structure easily leads to over fitting, and Parameter Value Description
easily leads to under fitting due to lacking necessary semantic Dimensionality of the feature
size 50
information. These limitations all cause to reduction in vectors.
accuracy. On the other hand, the way extracting all opcode can The maximum distance between the
window 8 current and predicted word within a
generate large number of opcodes, so that brings extensive sentence.
computation cost when generating feature vectors. Ignores all words with total
min_count 2
frequency lower than this.
D. Oversampling Technique for the Families of Minority Use these many worker threads to
workers 8
Though manual analysis of many Android malwares, we train the model.
find that malicious code may change the order of code B. Evaluation Measures
execution, but does not affect the actual function. To alleviate
the challenge that some families have a small number of This article defines the correct classification as the correct
samples for training, we adopt a technique similar to SMOTE tuple, and the classification error is defined as the wrong tuple.
(Synthetic Minority Oversampling Technique, SMOTE) for t_pos (true positives, t_pos) means that the classifier classifies
generating corresponding sensitive opcode sequence of the samples of a family into this family; t_neg (true negatives,
families with minority based on a special rule to enhance the t_neg) means that the classifier classifies samples of other
generalization ability of the model. As shown in Figure 3, we families into the family; f_pos (false positives, f_pos) means
use the operation code “GO”, the sensitive elements and the two that the classifier classifies samples not in the family into other
upper and lower operations codes and newline as the stagnation families; f_neg (false negatives, f_neg) means that the classifier
identifies a sample that is not in the family as this family [20]. (b) Training time, testing time of nine algorithms
K is the number of families, N is the number of samples and
Figure 4: Experimental comparison of nine algorithms
IAUC represents the individual AUCs of each predicted column.
The formula of evaluation measures for classifiers is: D. Result

5 In order to verify our method, we use the samples of the first
2
40 families for experiments and create two sets of data based

5 on the full training set to perform a 2-fold cross validation. The
2
confusion matrix of the classification is shown in Figure 5.
44

3 5 From Figure 5, almost all families’ samples are classified
2
+
correctly, which indicates that our method achieve a high
5 * 1/.06-
,
7
accuracy on those 40 Android malware families.

67
5

#%"$

5 #%"$!)'&(
C. Experimental Comparison of Nine Algorithms

In order to choose the suitable classification algorithm, we
use nine machine learning algorithms, select top 40 families
from Drebin, and create two sets of data based on the full
training set to perform a 2-fold cross validation. Figure 4 shows
the Total Accuracy, Average AUC, training time, testing time
of the nine algorithms. We can see that KNN, DT, GB have a
high Total Accuracy, Average AUC in Figure 4 (a), but GB
costs more than 50 times the training time than the KNN, DT in
Figure 4 (b), so we choose KNN with higher value than DT as
the classification algorithm for the following experiments.
Figure 5: Confusion matrix on top 40 families
We also calculate the Precision, Recall, F1-score of each

40 families on table 3. We sort the 40 families according to the
number, and family serial number are ordered by the rank of
families. As shown on table 3, 31 families are identified
perfectly, though the 37th malware and the 38th malware
family’s precision, F1-score or Recall are relatively low.
Combining the confusion matrix, we can know that some
samples of 38th family are classified to 37th family.
TABLE 3
(a) Total Accuracy, Average AUC of nine algorithms THE RESULT OF CLASSIFICATION OF 40 FAMILIES
Family serial F1-
Precision Recall Support
number score
03 0.9967 1.0000 0.9984 303
04 0.9956 0.9956 0.9956 228
05 1.0000 0.9940 0.9970 167
26 0.8889 1.0000 0.9412 8
27 1.0000 0.9167 0.9565 12
30 0.8889 1.0000 0.9412 8
31 1.0000 0.9167 0.9565 12
37 0.4286 1.0000 0.6000 6
38 1.0000 0.2727 0.4286 11
others 1.0000 1.0000 1.0000
E. Comparison and Discussion

1) Comparison with Opcode Based Methods
Due to only choosing the opcode in sensitive code area, we

claim that our method has a better computational efficiency
than the method using all opcode on generating vectors. We
also use samples of the first 40 families to do the experiment.
In Figure 6, ATS represents the average generation of each
sensitive opcode sequence, and GTV represents the generation
of vector. Training and testing of KNN are expressed by TTN
and TETN, respectively. We can see that generating sensitive
opcode sequences’ vector costs less time than opcode sequence.
The time of generation of sensitive opcode sequence, the
training time and the testing time on two methods are nearly
same. In addition, when generating vector, opcode based
methods (41,908KB, memory) use more memory space than
our method (28,612KB, memory). This shows that the method (a) Trend of Total Accuracy
based on sensitive opcode sequence is more efficient.
(b) Trend of Average AUC
Figure 6: The time of four main processes Figure 7: Performance with different numbers of families
To verify the performance of our method on more families, Figure 7 shows how the performance changes as the
we utilize more families in the experiment. Table 4 summarizes increasing of family numbers. Figure 7 (a) is the change of
the data set information. Training set 1 is divided from all Total Accuracy and Figure 7 (b) is the change of Average AUC.
samples of each family for generating the opcode sequence and As shown in Figure 7 (a), SOSM (the method based on sensitive
the sensitive opcode sequence. In training set 2, we use opcode sequence, SOSM), OSM (the method based on opcode
oversampling technique to generate mix sensitive opcode sequence, OSM), MSOSM (the mixed sensitive opcode
sequences for the families with less than 20 training samples in sequence based method, MSOSM) all keeps high Total
training set 1, and make these families’ corresponding sensitive Accuracy as the number of families, and the Total Accuracy of
opcode sequence up to 20. We use the above method to indicate them are over 85%. SOSM and MSOSM’ Total Accuracy is
that the number of training samples has been increased. Besides, consistently higher. As shown in Figure 7 (b), SOSM, OSM,
test set is divided equally from all samples of each family. MSOSM are all with increased Average AUC as the increasing
of family numbers. MSOSM’s Average AUC is higher than
TABILE 4
EXPERIMENTAL DATA SET
SOSM’s in the whole process. SOSM and MSOSM maintain a
high Average AUC value which are over 85%. From the Figure
Class sequence Training set1 Training set2 Test set 7, we can know that SOSM and MSOM perform well in
01 478 478 478 Android malware classification and show better Total Accuracy
02 328 328 328
… … … …
and Average AUC than OSM. Besides, MSOSM is effective in
21 20 20 20 improving the performance on the samples of minority.
22 11 20 11
… … … …
2) Comparison with Related Works
100 2 20 1 To ensure the fairness of comparison, we only compare the
work done with the same dataset we used. Table 5 shows the
comparison with Drebin and Canfora G’s work. We can know
that our method performs the best with high Total Accuracy,
low false positive. In particular, all families show a detection
rate of over 97%, more than 90% in Drebin, in which fourteen International Cooperation Project of Institute of Information
of them can be identified perfectly. Engineering, Chinese Academy of Sciences under Grant No.
Y7Z0511101, and Kai Chen was supported in part by NSFC
TABILE 5 U1836211.
COMPARISON WITH OTHER WORKS
Total False Identified Support
REFERENCES
Works
Accuracy Positive perfectly (families) [1] FORENSIC BLOG [EB/OL].
Drebin 93.00% 1.00% 3 20 https://forensics.spreitzenbarth.de/android-malware, 2018-05-14
Canfora G [11] 94.69% 2.56% 4 10 [2] Zhou Y, Jiang X. Dissecting Android Malware: Characterization and
Our method 99.82% 0.03% 14 20 Evolution[C]// IEEE Symposium on Security and Privacy. IEEE
3) Discusssion Computer Society, 2012:95-109.
Our model exploits the semantic information of code for [3] Wang Y, Zheng J, Sun C, et al. Quantitative Security Risk Assessment
of Android Permissions and Applications [M]// Data and Applications
Android malware family analysis. Opcode has rich semantic Security and Privacy XXVII. Springer Berlin Heidelberg, 2013:226-241.
information, but not so capable for Android malware family [4] Merlo A, Georgiu G C. RiskInDroid: Machine Learning-Based Risk
classification due to ignoring sensitive semantic information. Analysis on Android[C]// IFIP International Conference on ICT Systems
We construct the sensitive opcode sequence, and propose to Security and Privacy Protection. Springer, Cham, 2017:538-552.
analyze the code’s specific semantic information for family [5] Arp D, Spreitzenbarth M, Hübner M, et al. DREBIN: Effective and
Explainable Detection of Android Malware in Your Pocket[C]//
classification to overcome this limitation. We show that the Network and Distributed System Security Symposium. 2014.
sensitive opcode sequence based method performs better than [6] Aswini A M, Vinod P. Droid permission miner: Mining prominent
methods with only the opcode sequence in terms of Total permissions for Android malware analysis[C]// Applications of Digital
Accuracy and computational efficiency. Oversampling Information and Web Technologies. IEEE, 2014:81-86.
technique is also adapted, and we claim that it’s an effective [7] Fan M, Liu J, Luo X, et al. Frequent Subgraph based Familial
way for families with minority. Besides, comparing with other Classification of Android Malware[J]. IEEE International Symposium
on Software Reliability Engineering, 2016(ISSRE):24-35.
works, the method based on our model outperforms well on
[8] Zhang M, Duan Y, Yin H, et al. Semantics-Aware Android Malware
Android malware family classification. In a word, our model Classification Using Weighted Contextual API Dependency Graphs[J].
provides a stable and efficient method for Android malware 2014, 7(9):1105-1116.
family classification. [9] Suarez-Tangil G, Tapiador J E, Peris-Lopez P, et al. Dendroid: A Text
Mining Approach to Analyzing and Classifying Code Structures in
V. CONCLUSIONS AND FUTURE WORK Android Malware Families[J]. Expert Systems with Applications, 2014,
41(4):1104-1117.
In this paper, we propose an Android malware family [10] Bilar D. Opcodes as predictor for malware[J]. International Journal of
classification model based on the sensitive opcode sequence, Electronic Security & Digital Forensics, 2008, 1(2):156-168.
which is generated by opcode, sensitive APIs, STRs and actions. [11] Rad B B, Masrom M. Metamorphic Virus Variants Classification Using
In addition, we adopt an oversampling technique for the Opcode Frequency Histogram[C]// Wseas International Conference on
families with a small number of samples based on their Computers. 2010.
sensitive opcode sequence. Experiments show that our method [12] Canfora G, Lorenzo A D, Medvet E, et al. Effectiveness of Opcode
ngrams for Detection of Multi Family Android Malware[C]//
have a high Total Accuracy and Average AUC on Android International Conference on Availability, Reliability and Security. IEEE
malware classification, and our oversampling technique can Computer Society, 2015:333-340.
help to improve our model’s performance on the families of [13] ART & Dalvik [DB/OL].
minority without serious influence on the Total Accuracy. https://source.android.com/devices/tech/dalvik/, 2018-05-29.
[14] Word2vec [EB/OL]. https://en.wikipedia.org/wiki/Word2vec/, 2018-05-
However, there are still remaining limitations of our 29.
model. For those malwares (about 100 samples) with advanced [15] Supervised learning [EB/OL]. http://scikit-
obfuscation technologies such as strong cryptographic methods, learn.org/stable/supervised_learning.html#supervised-learning, 2018-05-
the proposed model may not extract the sensitive elements 29.
successfully. Due to the model only extracts the opcode that [16] Shabtai A. Detecting unknown malicious code by applying classification
techniques on OpCode patterns[J]. Security Informatics, 2012, 1(1):1.
contained in the method with sensitive elements, our model
[17] Moskovitch R, Feher C, Tzachar N, et al. Unknown Malcode Detection
may fail to generate the sensitive opcode sequence for further Using OPCODE Representation[C]// Intelligence and Security
analysis. Except these sensitive elements, the model needs other Informatics, First European Conference, EuroISI 2008, Esbjerg,
typical elements to represent malicious gene for the Android Denmark, December 3-5, 2008. Proceedings. DBLP, 2008:204-215.
malware families used the obfuscation technologies. In addition, [18] Santos I, Brezo F, Nieves J, et al. Idea: Opcode-Sequence-Based
Malware Detection[C]// International Conference on Engineering Secure
the classification algorithm we used is supervised learning. Software and Systems. Springer-Verlag, 2010:35-43.
Only the known Android malware families can be distinguished. [19] Santos I, Brezo F, Ugarte-Pedrero X, et al. Opcode sequences as
The models is ineffective for zero-day malware families. We representation of executables for data-mining-based unknown malware
will explore unsupervised learning and transfer learning for detection[J]. Information Sciences, 2013, 231(9):64-82.
further investigation. [20] Confusion matrix [EB/OL].
https://en.wikipedia.org/wiki/Confusion_matrix, 2018-05-29.
Acknowledgment: This work is supported by National Key
R&D Program of China (No.2018YFB0803402), the

Android Malware Family Classification Based On Sensitive Opcode Sequence

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Android Malware Family Classification Based On Sensitive Opcode Sequence

Uploaded by

Copyright:

Available Formats

,(((6\PSRVLXPRQ&RPSXWHUVDQG&RPPXQLFDWLRQV,6&&