Professional Documents
Culture Documents
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Brief papers
a r t i c l e i n f o a b s t r a c t
Article history: Analyzing the runtime behaviors of Android apps is crucial for malware detection. In this paper, we
Received 27 December 2019 attempt to learn the behavior level features of an app from function calls. The challenges of this task
Revised 12 October 2020 are twofold. First, the absence of function attributes hinders the understanding of app behaviors.
Accepted 17 October 2020
Second, the graphical representation of function calls cannot be directly processed by classical machine
Available online 29 October 2020
Communicated by Steven Hoi
learning algorithms. In this paper, we develop two methods to overcome these challenges. Based on func-
tion embedding, we first propose the concept of enhanced function call graphs (E-FCGs) to characterize
app runtime behaviors. We then develop a Graph Convolutional Network (GCN) based algorithm to
Keywords:
Graph convolutional network
obtain vector representations of E-FCGs. Extensive experiments show that the features learned by our
Android malware detection method can achieve surprisingly high detection performance on a variety of classifiers (e.g., LR, DT,
Function embedding SVM, KNN, RF, MLP and CNN), significantly outperforming the traditional static features.
Function call graph Ó 2020 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.neucom.2020.10.054
0925-2312/Ó 2020 Elsevier B.V. All rights reserved.
M. Cai, Y. Jiang, C. Gao et al. Neurocomputing 423 (2021) 301–307
(1) Function embedding. Intuitively, the functions in an app pro- tures, which are further processed by an SVM model and a deep
gram are analogous to the words in a document. Borrowing learning model, respectively.
ideas from word embedding, we embed the functions into a In many literature, function calls are simply represented as a
low-dimensional vector space. Every vector in this space binary vector, in which every element indicates whether the corre-
denotes the attributes of a function, which can characterize sponding function is called or not. Although vector representations
the action of the function, the functional similarity among reveal some information about apps, they do not describe the
functions, and the relation with other functions. Accord- interaction among functions and hence cannot accurately charac-
ingly, we create an enhanced FCGs (E-FCGs), through assign- terize apps’ behaviors.
ing function attributes to every node in FCGs. A better way to utilize function calls is to construct an FCG. The
(2) GCN based feature learning. Both FCGs and E-FCGs cannot be FCG gives the topology information that can be employed to infer
directly processed by classifiers. Inspired by graph embed- apps’ runtime behaviors. However, the FCG does not offer node
ding [12], we develop a GCN based algorithm to learn behav- attributes (i.e., function attributes), which are important for under-
ior level features from E-FCGs. The useful and dense vector standing app behaviors. In addition, FCGs belong to the category of
representations obtained by our algorithm can be used by graph data. Hence they cannot be directly handled by classical
a variety of classifiers for malware detection, as shown in classifiers.
Fig. 1.
2.2. Graph embedding
Our main contributions are summarized as follows. First, we
propose the concept of E-FCGs, which can accurately characterize Graph embedding [12] aims to learn a mapping that embeds
the runtime behavior of apps. Second, we develop an effective nodes or (sub) graphs as points in a low-dimensional vector space.
algorithm to extract behavior level features from E-FCGs, i.e., BLFE The learned embeddings can be used as feature inputs for down-
(Behavior Level Features Exaction algorithm). To our knowledge, stream classification tasks, e.g., Android malware detection.
we are the first to introduce GCNs to learn features for Android DeepWalk [13] and node2vec [14] are two classical graph
malware detection. Extensive experiments demonstrate that our embedding methods. DeepWalk is the first deep learning based
features perform better than the traditional static features, mark- graph embedding method, which employs random walks on
edly improving detection performnce on a variety of classifiers graphs to obtain node representations. Node2vec extends Deep-
(e.g., LR, DT, SVM, KNN, RF, MLP and CNN). Walk by introducing a biased random walking procedure. How-
The rest of this paper is organized as follows. In Section 2, we ever, these two methods and the other unsupervised graph
discuss the related work on static features and function embed- embedding methods (e.g., LINE [16]) leverage only topology infor-
ding. Section 3 shows how to construct E-FCGs. Section 4 proposes mation, and cannot consider node attributes [14] [13]. [15]
the BLFE algorithm to learn features from E-FCGs. Experiments are explored the effect of node2vec, Deepwalk and other methods on
carried out in Section 5 to evaluate the proposed approaches, fol- the FCG embedding, but they did not consider the node attributes
lowed by the concluding remarks in Section 6. of the graph. For our app classification task, node attributes are of
great importance for the understanding of app runtime behavior.
2. Related work To leverage both topology information and node attributes, we
introduce GCNs to obtain vector representations from E-FCGs.
2.1. Static features GCNs extend existing convolutional neural networks (CNNs) to
process graph data. They iteratively aggregate the embeddings of
As the most common static features, permission requirements, neighbors for a node, and use a function of the obtained embedding
intent actions and function calls have been widely used in Android and its previous embedding to obtain the new embedding. By
malware identification [5–8]. Permission requirements indicate applying GCNs to E-FCGs, we can get vector representations for
what sensitive user data (e.g., contacts and SMS) need to be apps, which can be used for high-accuracy Android malware
accessed by an app. Intent actions tell Android what standard oper- detection.
ation activities can perform. As for function calls, they indicate
what functions are called by an app. In [5,6], function calls together 3. Construction of E-FCGs
with permissions and intent actions are used as detection features,
and then fed into both shallow learning and deep learning models. For an Android app, the information of function calls can be
In [7,8], permissions and function calls are chosen as detection fea- obtained through processing its classes.dex file. Accordingly, an
302
M. Cai, Y. Jiang, C. Gao et al. Neurocomputing 423 (2021) 301–307
FCG can be constructed for the app1. In the following, we study how
to obtain function attributes for an FCG. A naive solution is to use a
one-hot encoded vector as function attributes. However, one-hot
encoded vectors cannot measure the similarity between any two
functions, hence providing little information for app behavior
analysis.
Realizing that the functions in a program are analogous to the
words in a document, we use a method similar to word embedding
to obtain function attributes. This method, called function embed-
ding in this paper, aims to convert every function to a dense vector
representation. Accordingly, we can make the functions with sim-
ilar attributes occupy close spatial positions in the embedded
space.
The main procedure of function embedding is described as fol-
lows. For every app in the dataset, we first create a file (i.e., func-
tion call record) to store the order of function calls. We then
build a corpus through putting together all the function call
records. Following the method of CBOW (Continuous Bag Of
Words) [17], we train a fully-connected neural network with one Fig. 3. An enhanced function call graph (E-FCG).
hidden layer to convert a function to an N-dimension vector.
Our function embedding method is illustrated in Fig. 2. As
shown in the left side of this figure, the neural network processes With function embedding, we can construct an enhanced FCG
the one-hot encoded vectors of C functions at a time. All these (E-FCG) by assigning a vector representation to every node in an
functions share the same weights, denoted by W VN , where V is FCG, as depicted in Fig. 3.
the number of functions occurring in the corpus. To train this neu-
ral network, we repeatedly feed it with a sequence of functions 4. Learning features from E-FCGs
fF tk ; . . . ; F t1 ; F tþ1 ; . . . ; F tþk g obtained from the corpus, and guide
it to predict the function F t . More specifically, we maximize Classical classification models (e.g, SVM and CNN) cannot pro-
cess E-FCGs, since E-FCGs are non-Euclidean and belong to the
1XC þk category of graph data. To tackle this problem, we propose to
log pðF t jF tk ; . . . ; F t1 ; F tþ1 ; . . . ; F tþk Þ; ð1Þ employ a GCN to learn features from E-FCGs. GCNs can operate
C t¼k
directly on graphs and leverage their structural information.
Given a E-FCG, the GCN takes as input: 1) a feature matrix X
where C ¼ 2k and pðBjAÞ denotes the probability of the function B where each row is a vector representation (i.e., feature) of a node,
being predicted given a function sequence A. More details about and 2) an adjacency matrix A representing the graph structure. At
model training can be found in [17]. When the training phase termi- each layer, the features are aggregated to form the next layer’s
nates, we can use the weights W VN to derive an N-dimension vec- features using a propagation rule. Accordingly, each hidden layer,
tor representation for every function, as shown in the right side of say l, can be expressed as Hl ¼ f ðHðl1Þ ; AÞ, where H0 ¼ X. In this
Fig. 2. way, features become increasingly more abstract at each consec-
utive layer. Following the widely used spectral propagation rule
given in [18], we have
1
Hðlþ1Þ ¼ r D ~D
~ 2 A ~ 12 Hl Wl ; ð2Þ
1
Please find the details about how to get function calls and construct an FCG in
Section 5.1.
303
M. Cai, Y. Jiang, C. Gao et al. Neurocomputing 423 (2021) 301–307
FCGs, we calculate the diagonal matrix D ~ as dii ¼ P aij , where dii is samples, in which 7; 362 samples are malicious and the others
j
are benign.3 The apps in our experiments were from Androzoo data-
the diagonal element of D ~
~ and aij is the element of the matrix A.
base [21], Google APP Store, VirusShare [22] and so on.
Now we develop an algorithm to learn behavior level features To extract the static features such as permission requirements,
from E-FCGs, which is termed BLFE. In the phase of training, BLFE intent actions and API calls, we decompress every apk file into two
attempts to train a model with two components: 1) feature extrac- files: AndroidManifest.xml and classes.dex. Then we extract permis-
tor and 2) classifier. The feature extractor is composed of several sion requirements and intent actions by parsing AndroidManifest.
graph convolution layers and a ReadOut Layer, while the classifier xml. After decompiling classes.dex into a series of smali files, we
is implemented by a fully-connected neural network. In the Read- obtain the information on function calls. To construct an FCG, we
Out Layer, we get the output of the last graph convolution layer just need to create an adjacency matrix, where each element
and calculate the sum of every column in the matrix, which is a½i; j is 1 if function i invokes function j, and 0 otherwise. With
the behavior level feature vector, as shown in (3). the method proposed in Section 3, we obtain node attributes,
X and then construct a E-FCG by assigning node attributes to the cor-
Fi ¼ Hij ð3Þ
i
responding FCG.
In our experiments, vector representation for every function is
During training, the whole model is iteratively updated through 100-dimension, and the GCN model has three convolutional layers,
minimizing the following loss function which contain 100; 100 and 60 neurons, respectively. Hence, vector
X X
L ¼ Eð y0i log ðyi ÞÞ þ k jw2i j; ð4Þ representation for behavior level features is 60-dimension. To
i i implement the BLFE algorithm, we build a fully-connected neural
network with two hidden layers, which acts as the classifier and
where y0i represents the ground truth value, yi denotes the predicted is connected with the GCN-based feature extractor.
value, and wi is the weight to learn. In (4), the first item is binary
crossentropy, and the second is introduced to alleviate overfitting.
In the phase of test, BLFE processes E-FCGs and tells whether the 5.2. Effects of function embedding
corresponding apps are malicious or not. The details of BLFE are
given in Algorithm 1, where xm denotes the weights in our model To show the effects of function embedding,4 we first embed the
and g is the learning rate. functions in our experiments into a 2D space, and then show the rep-
Once the model is well trained, the feature extractor can be resentations of three functions in Fig. 4. The functions getLongitude()
used to learn behavior level features from a E-FCG. With these fea- and getLatitude() are both used to get the location, and they are often
tures, one can feed them to any classifier (e.g., SVM and KNN) or called together in apps. Hence they occupy close spatial positions in
some advanced malware detection methods (e.g. [19,20]) for app the embedded space. The function setHomeActionContentDescription
classification. () is used to set an alternate description for the Home/Up action,
which is totally different from getLongitude() and getLatitude().
Algorithm 1. The BLFE Algorithm Therefore, its position in the embedded space is far away from those
Stage I: Initialization of getLongitude() and getLatitude(), and the angle between the vec-
Create a corpus with all apps; tors corresponding to setHomeActionContentDescription() and getLon-
Find a vector representation for every function through gitude() (or getLatitude()) is large.
function embedding;
Construct E-FCGs for apps in training dataset;Stage II:
5.3. Convergence of the BLFE algorithm
Training
In each epoch:
BLFE is an iterative algorithm, and its convergence has been ver-
– Sample a batch of E-FCGs;
ified by our experiments. For illustration, Fig. 5 depicts the itera-
– Update xm with gradient descent, i.e.,
tions of loss (4) in a certain experiment. For convenience of
xm xm grxm L ð5Þ depiction, the vertical axis of Fig. 5 provides the average values
Stage III: Test of loss over 20 iterations. Accordingly, each point in the horizontal
Repeat for every app: axis represents 20 iterations. It can be seen from this figure that the
– Construct a E-FCG for the app; loss rapidly decreases during training. Starting from the 100-th
– Make a decision base on the E-FCG; point in the horizontal axis, the loss keeps getting close to zero
and the BLFE algorithm converges.
3
We set the proportion between malicious sample number and total sample
number to 17%, in accordance with the ratio of Android malware in real life.
4
In our experiments, our function embedding adopts functions as many as
2
If IN is not considered here, multiplication with A means that, for every node, we possible. For those unadopted functions, their feature vectors are set to a vector of all
sum up all the feature vectors of all neighboring nodes but not the node itself. In order entries 1. Since the number of the unadopted functions is small, they do not
to take the node itself into account, we simply add the identity matrix IN to A. significantly impact our function embedding method.
304
M. Cai, Y. Jiang, C. Gao et al. Neurocomputing 423 (2021) 301–307
Table 1
Performance evaluation.
Appendix A. How to process a real-world Android malicious [8] Z. Yuan, Y. Lu, Z. Wang, Y. Xue, Droid-sec: deep learning in android malware
detection, ACM SIGCOMM Computer Communication Review 44 (4) (2014)
sample 371–372.
[9] K. Tam, A. Feizollah, et al., The evolution of android malware and android
Here we explain how to classify a real-world Android analysis techniques, ACM Computing Surveys 49 (4) (2017).
[10] M. Fan, J. Liu, W. Wang, H. Li, Z. Tian, T. Liu, DAPASA: detecting android
malicious app with our proposed method. For illustration, we
piggybacked apps through sensitive subgraph analysis, IEEE Transactions on
use an Android malicious app Amazing Submarine, which was Information Forensics and Security 12 (8) (2017) 1772–1785.
injected with malicious code. The process flow for this app is [11] J. Zhou, G. Cui, Z. Zhang, et al., Graph neural networks: a review of methods
and applications, arXiv:1812.08434, 2019.
depicted in Fig. 6. The procedure consists of four main steps,
[12] P. Goyal, E. Ferrara, Graph embedding techniques, applications, and
which are given below. performance: a survey, Knowledge Based Systems 151 (2018) 78–94.
[13] B. Perozzi, R. Alrfou, S. Skiena, Deepwalk: online learning of social
Unpack and decompile the APK file of this app into smali files, representations, in: Proceedings of KDD 2014, 2014, pp. 701–710.
[14] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, in:
and extract the function call graph (FCG) from these files. Proceedings of KDD 2016, 2016, pp. 855–864.
With the method of word2vec, we obtain node attributes and [15] Abdurrahman Pektas, Tankut Acarman, Deep learning for effective Android
then construct an Enhanced- FCG (E-FCG). malware detection using API call graph embeddings, Soft Computing 24 (2020)
1027–1043.
We use a GCN based method to learn features from the E-FCG, [16] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, Q. Mei, Line: Large-scale information
which can characterize the behaviour of the app. network embedding, in: Proceedings of WWW 2015, 2015, pp. 1067–1077.
Finally, the features are sent to a classifier for malware [17] T. Mikolov, G.s. Corrado, K. Chen, J. Dean, Efficient estimation of word
representations in vector space, in: Proceedings of ICLR 2013, 2013, pp. 1–12.
detection. [18] T.N. Kipf, M. Welling, Semi-supervised classification with graph convolutional
networks, in: Proceedings of ICLR 2017, 2017.
[19] T. Kim, B. Kang, M. Rho, S. Sezer, E. im, A multimodal deep learning method for
android malware detection using various features, IEEE Transactions on
References Information Forensics and Security 14 (3) (2019) 773–778.
[20] S.Y. Yerima, S. Sezer, DroidFusion: a novel multilevel classifier fusion approach
[1] J. Li, L. Sun, Q. Yan, Z. Li, W. Srisa-an, H. Ye, Significant permission identification for android malware detection, IEEE Transactions on Cybernetics 49 (2) (2019)
for machine-learning-based Android malware detection, IEEE Trans. Ind. 453–466.
Inform. 14 (7) (Jul. 2018) 3216–3225. [21] L. Li, et al., AndroZoo++: Collecting millions of android apps and their metadata
[2] G. DATA, 8,400 new android malware samples every day. [Online]. Available: for the research community, arXiv:1709.05281, 2017.
https://www.gdatasoftware.com/blog/2017/04/29712-8-400-new-android- [22] [online] Available at https://virusshare.com/.
malware-samples-every-day.
[3] M. Yang, S. Wang, Z. Ling, Y. Liu, Z. Ni, Detection of malicious behavior in
android apps through API calls and permission uses analysis, Concurrency and Minghui Cai received the B.E. degree in electronic
Computation: Practice and Experience 29 (19) (2017) e4172.
engineering from the Huazhong University of Science
[4] P. Vinod, A. Zemmari, M. Conti, A machine learning based approach to detect
and Technology, China in 2018. He is currently pursuing
malicious android apps using discriminant system calls, Future Generation
the master’s degree in School of Electronic Information
Computer Systems 94 (2019) 333–350.
[5] H. Li, S. Zhou, W. Yuan, Adversarial-example attacks toward android malware and Communications, Huazhong University of Science
detection system, IEEE Systems Journal (2019). and Technology, China. His current research interests
[6] W. Yuan, Y. Jiang, H. Li, M. Cai, A lightweight on-device detection method for include computer vision and machine learning.
android malware, IEEE Transactions on Systems, Man, and Cybernetics:
Systems (2019).
[7] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, K. Rieck, DREBIN: Effective
and Explainable Detection of Android Malware in Your Pocket, in: Proceedings
of NDSS, 2014, pp. 1–16.
306
M. Cai, Y. Jiang, C. Gao et al. Neurocomputing 423 (2021) 301–307
Yuan Jiang received the B.E. and M.E. degrees in com- Heng Li received the B.E. degree in communication
munication engineering from Huazhong University of engineering from Huazhong University of Science and
Science and Technology, China, in 2016 and 2019, Technology, China, in 2017. He is currently pursuing
respectively. Now he is working for Tencent. His current toward the Ph.D. degree at the same Institution. His
research interests include machine learning and current research interests include artificial intelligence,
Android app development. information security, and image/signal processing.
Cuiying Gao received the B.E degree in computer sci- Wei Yuan received the B.E. degree in electronic engi-
ence from the Nanchang University, Nanchang, China in neering from Wuhan University, China, in 1999, and the
2019. She is currently pursuing the master’s degree in Ph.D. degree in electronic engineering from the
School of Electronic Information And Communication, University of Science and Technology of China, Hefei, in
Huazhong University of Science and Technology, 2006. He is currently a professor with the School of
Wuhan, China. Her current research interests include Electronic Information and Communications, Huazhong
network security and machine learning. University of Science and Technology, China. His current
research interests include machine learning and infor-
mation security.
307