Cai Et Al. - 2021 - Learning Features From Enhanced Function Call Grap

Neurocomputing 423 (2021) 301–307
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Brief papers
Learning features from enhanced function call graphs for Android

malware detection
Minghui Cai a,1, Yuan Jiang a,b, Cuiying Gao a,1, Heng Li a, Wei Yuan a,⇑
a
1037 Luoyu Road, Huazhong University of Science and Technology, Wuhan, China
b
Kexing Science Park, No. 15, Keyuan Road, Nanshan District, Shenzhen, China
a r t i c l e i n f o a b s t r a c t
Article history: Analyzing the runtime behaviors of Android apps is crucial for malware detection. In this paper, we
Received 27 December 2019 attempt to learn the behavior level features of an app from function calls. The challenges of this task
Revised 12 October 2020 are twofold. First, the absence of function attributes hinders the understanding of app behaviors.
Accepted 17 October 2020
Second, the graphical representation of function calls cannot be directly processed by classical machine
Available online 29 October 2020
Communicated by Steven Hoi
learning algorithms. In this paper, we develop two methods to overcome these challenges. Based on func-
tion embedding, we first propose the concept of enhanced function call graphs (E-FCGs) to characterize
app runtime behaviors. We then develop a Graph Convolutional Network (GCN) based algorithm to
Keywords:
Graph convolutional network
obtain vector representations of E-FCGs. Extensive experiments show that the features learned by our
Android malware detection method can achieve surprisingly high detection performance on a variety of classifiers (e.g., LR, DT,
Function embedding SVM, KNN, RF, MLP and CNN), significantly outperforming the traditional static features.
Function call graph Ó 2020 Elsevier B.V. All rights reserved.
1. Introduction address this problem, we propose to learn behavior level features

from function calls.
Android is the major target of 97% mobile malware [1], and Function calls are usually represented as a binary vector or a
almost 8; 400 new Android malware instances are found every directed graph, i.e., Function Call Graph (FCG) [10]. In an FCG, each
day [2]. Up to now, a variety of machine learning based detection node represents a function and every edge denotes a function call.
methods have been developed to tackle Android malware. Based Compared to vector representation, graph representation provides
on the extracted features, these methods use a classifier to decide more information about how an app works, helping more in mal-
whether an app is malicious. In practice, the performance of malware detection. For illustration, we consider an FCG containing
ware detection depends heavily on the features. an edge between onReceive() and startService(). Through analyzing
The existing features for Android malware detection fall into this FCG, we realize that the app calls onReceive() to receive system
two categories: dynamic [3,4] and static [5–9]. Dynamic features broadcast BOOT COMPLETE, and then calls startService() to start the
are able to reflect the runtime behaviors of apps, hence providing Service component for private user data collection. This behavior
important clues for malware detection. However, extracting reveals that the app is likely to be malicious.
dynamic features requires monitoring the execution of apps, which However, there exist two main barriers to apply FCGs to
causes overheads and inconvenience. Contrarily, static features do Android malware detection. First, FCGs do not tell us about node/-
not require the execution of apps. They can be obtained merely by function attributes, e.g., the meanings of functions. Unfortunately,
analyzing an app installation file, i.e., Android application packages function attributes are crucial for malware detection. For example,
(apks). In existing literature (e.g., [5–9]), permission requirements, without the knowledge about the meaning of startService(), we
intent actions and function calls (i.e., API calls) are often used as cannot understand how the app behaves. Second, classical machine
static features for malware detection. However, these static fea- learning algorithms cannot directly process FCGs due to their non-
tures cannot accurately depict the runtime behaviors of apps. To Euclidean structure [11].
In this paper, we use two techniques to overcome the above
challenges, i.e., 1) function embedding and 2) GCN (Graph Convo-
⇑ Corresponding author. lutional Networks) based feature learning.
E-mail addresses: caiminghui@hust.edu.cn (M. Cai), petejiang@tencent.com
(Y. Jiang), gaoc@hust.edu.cn (C. Gao), liheng@hust.edu.cn (H. Li), yuanwei@mail.
hust.edu.cn (W. Yuan).
1
Minghui Cai and Cuiying Gao are co-first authors of this paper.
https://doi.org/10.1016/j.neucom.2020.10.054
0925-2312/Ó 2020 Elsevier B.V. All rights reserved.
M. Cai, Y. Jiang, C. Gao et al. Neurocomputing 423 (2021) 301–307
(1) Function embedding. Intuitively, the functions in an app pro- tures, which are further processed by an SVM model and a deep
gram are analogous to the words in a document. Borrowing learning model, respectively.
ideas from word embedding, we embed the functions into a In many literature, function calls are simply represented as a
low-dimensional vector space. Every vector in this space binary vector, in which every element indicates whether the corre-
denotes the attributes of a function, which can characterize sponding function is called or not. Although vector representations
the action of the function, the functional similarity among reveal some information about apps, they do not describe the
functions, and the relation with other functions. Accord- interaction among functions and hence cannot accurately charac-
ingly, we create an enhanced FCGs (E-FCGs), through assign- terize apps’ behaviors.
ing function attributes to every node in FCGs. A better way to utilize function calls is to construct an FCG. The
(2) GCN based feature learning. Both FCGs and E-FCGs cannot be FCG gives the topology information that can be employed to infer
directly processed by classifiers. Inspired by graph embed- apps’ runtime behaviors. However, the FCG does not offer node
ding [12], we develop a GCN based algorithm to learn behav- attributes (i.e., function attributes), which are important for under-
ior level features from E-FCGs. The useful and dense vector standing app behaviors. In addition, FCGs belong to the category of
representations obtained by our algorithm can be used by graph data. Hence they cannot be directly handled by classical
a variety of classifiers for malware detection, as shown in classifiers.
Fig. 1.
2.2. Graph embedding
Our main contributions are summarized as follows. First, we
propose the concept of E-FCGs, which can accurately characterize Graph embedding [12] aims to learn a mapping that embeds
the runtime behavior of apps. Second, we develop an effective nodes or (sub) graphs as points in a low-dimensional vector space.
algorithm to extract behavior level features from E-FCGs, i.e., BLFE The learned embeddings can be used as feature inputs for down-
(Behavior Level Features Exaction algorithm). To our knowledge, stream classification tasks, e.g., Android malware detection.
we are the first to introduce GCNs to learn features for Android DeepWalk [13] and node2vec [14] are two classical graph
malware detection. Extensive experiments demonstrate that our embedding methods. DeepWalk is the first deep learning based
features perform better than the traditional static features, mark- graph embedding method, which employs random walks on
edly improving detection performnce on a variety of classifiers graphs to obtain node representations. Node2vec extends Deep-
(e.g., LR, DT, SVM, KNN, RF, MLP and CNN). Walk by introducing a biased random walking procedure. How-
The rest of this paper is organized as follows. In Section 2, we ever, these two methods and the other unsupervised graph
discuss the related work on static features and function embed- embedding methods (e.g., LINE [16]) leverage only topology infor-
ding. Section 3 shows how to construct E-FCGs. Section 4 proposes mation, and cannot consider node attributes [14] [13]. [15]
the BLFE algorithm to learn features from E-FCGs. Experiments are explored the effect of node2vec, Deepwalk and other methods on
carried out in Section 5 to evaluate the proposed approaches, fol- the FCG embedding, but they did not consider the node attributes
lowed by the concluding remarks in Section 6. of the graph. For our app classification task, node attributes are of
great importance for the understanding of app runtime behavior.
2. Related work To leverage both topology information and node attributes, we
introduce GCNs to obtain vector representations from E-FCGs.
2.1. Static features GCNs extend existing convolutional neural networks (CNNs) to
process graph data. They iteratively aggregate the embeddings of
As the most common static features, permission requirements, neighbors for a node, and use a function of the obtained embedding
intent actions and function calls have been widely used in Android and its previous embedding to obtain the new embedding. By
malware identification [5–8]. Permission requirements indicate applying GCNs to E-FCGs, we can get vector representations for
what sensitive user data (e.g., contacts and SMS) need to be apps, which can be used for high-accuracy Android malware
accessed by an app. Intent actions tell Android what standard oper- detection.
ation activities can perform. As for function calls, they indicate
what functions are called by an app. In [5,6], function calls together 3. Construction of E-FCGs
with permissions and intent actions are used as detection features,
and then fed into both shallow learning and deep learning models. For an Android app, the information of function calls can be
In [7,8], permissions and function calls are chosen as detection fea- obtained through processing its classes.dex file. Accordingly, an
Fig. 1. Applying our features to a variety of classifiers.
302
Fig. 2. Function embedding.
FCG can be constructed for the app1. In the following, we study how
to obtain function attributes for an FCG. A naive solution is to use a
one-hot encoded vector as function attributes. However, one-hot
encoded vectors cannot measure the similarity between any two
functions, hence providing little information for app behavior
analysis.
Realizing that the functions in a program are analogous to the
words in a document, we use a method similar to word embedding
to obtain function attributes. This method, called function embed-
ding in this paper, aims to convert every function to a dense vector
representation. Accordingly, we can make the functions with sim-
ilar attributes occupy close spatial positions in the embedded
space.
The main procedure of function embedding is described as fol-
lows. For every app in the dataset, we first create a file (i.e., func-
tion call record) to store the order of function calls. We then
build a corpus through putting together all the function call
records. Following the method of CBOW (Continuous Bag Of
Words) [17], we train a fully-connected neural network with one Fig. 3. An enhanced function call graph (E-FCG).
hidden layer to convert a function to an N-dimension vector.
Our function embedding method is illustrated in Fig. 2. As
shown in the left side of this figure, the neural network processes With function embedding, we can construct an enhanced FCG
the one-hot encoded vectors of C functions at a time. All these (E-FCG) by assigning a vector representation to every node in an
functions share the same weights, denoted by W VN , where V is FCG, as depicted in Fig. 3.
the number of functions occurring in the corpus. To train this neu-
ral network, we repeatedly feed it with a sequence of functions 4. Learning features from E-FCGs
fF tk ; . . . ; F t1 ; F tþ1 ; . . . ; F tþk g obtained from the corpus, and guide
it to predict the function F t . More specifically, we maximize Classical classification models (e.g, SVM and CNN) cannot pro-
cess E-FCGs, since E-FCGs are non-Euclidean and belong to the
1XC þk category of graph data. To tackle this problem, we propose to
log pðF t jF tk ; . . . ; F t1 ; F tþ1 ; . . . ; F tþk Þ; ð1Þ employ a GCN to learn features from E-FCGs. GCNs can operate
C t¼k
directly on graphs and leverage their structural information.
Given a E-FCG, the GCN takes as input: 1) a feature matrix X
where C ¼ 2k and pðBjAÞ denotes the probability of the function B where each row is a vector representation (i.e., feature) of a node,
being predicted given a function sequence A. More details about and 2) an adjacency matrix A representing the graph structure. At
model training can be found in [17]. When the training phase termi- each layer, the features are aggregated to form the next layer’s
nates, we can use the weights W VN to derive an N-dimension vec- features using a propagation rule. Accordingly, each hidden layer,
tor representation for every function, as shown in the right side of say l, can be expressed as Hl ¼ f ðHðl1Þ ; AÞ, where H0 ¼ X. In this
Fig. 2. way, features become increasingly more abstract at each consec-
utive layer. Following the widely used spectral propagation rule
given in [18], we have
1
Hðlþ1Þ ¼ r D ~D
~ 2 A ~ 12 Hl Wl ; ð2Þ
1
Please find the details about how to get function calls and construct an FCG in
Section 5.1.
303
where ~ is the degree

r is an activation function (e.g., ReLU), D 5. Performance evaluation
matrix, and Wl is the weight matrix of layer l. In (2), A ~ ¼ A þ IN is
introduced to make the aggregated representation of a node include Here we conduct extensive experiments to evaluate the effec-
its own features, where IN is an identity matrix [18].2 tiveness of function embedding, the convergence of BLFE, and the
It can be seen from (2) that the aggregated features of a node effectiveness of the features learned by BLFE.
correspond to the weighted sum of the neighbors’ features. When
aggregating features for a node, we only choose the neighbors
pointed by that node. That is, we only consider who it invokes 5.1. Dataset & settings
when evaluating a function. Note that the spectral propagation rule
is designed mainly for the undirected graphs. To apply it to our E- In our experiments, we use a large-size dataset with 43; 310
FCGs, we calculate the diagonal matrix D ~ as dii ¼ P aij , where dii is samples, in which 7; 362 samples are malicious and the others
j
are benign.3 The apps in our experiments were from Androzoo data-
the diagonal element of D ~
~ and aij is the element of the matrix A.
base [21], Google APP Store, VirusShare [22] and so on.
Now we develop an algorithm to learn behavior level features To extract the static features such as permission requirements,
from E-FCGs, which is termed BLFE. In the phase of training, BLFE intent actions and API calls, we decompress every apk file into two
attempts to train a model with two components: 1) feature extrac- files: AndroidManifest.xml and classes.dex. Then we extract permis-
tor and 2) classifier. The feature extractor is composed of several sion requirements and intent actions by parsing AndroidManifest.
graph convolution layers and a ReadOut Layer, while the classifier xml. After decompiling classes.dex into a series of smali files, we
is implemented by a fully-connected neural network. In the Read- obtain the information on function calls. To construct an FCG, we
Out Layer, we get the output of the last graph convolution layer just need to create an adjacency matrix, where each element
and calculate the sum of every column in the matrix, which is a½i; j is 1 if function i invokes function j, and 0 otherwise. With
the behavior level feature vector, as shown in (3). the method proposed in Section 3, we obtain node attributes,
X and then construct a E-FCG by assigning node attributes to the cor-
Fi ¼ Hij ð3Þ
i
responding FCG.
In our experiments, vector representation for every function is
During training, the whole model is iteratively updated through 100-dimension, and the GCN model has three convolutional layers,
minimizing the following loss function which contain 100; 100 and 60 neurons, respectively. Hence, vector
X X
L ¼ Eð y0i log ðyi ÞÞ þ k jw2i j; ð4Þ representation for behavior level features is 60-dimension. To
i i implement the BLFE algorithm, we build a fully-connected neural
network with two hidden layers, which acts as the classifier and
where y0i represents the ground truth value, yi denotes the predicted is connected with the GCN-based feature extractor.
value, and wi is the weight to learn. In (4), the first item is binary
crossentropy, and the second is introduced to alleviate overfitting.
In the phase of test, BLFE processes E-FCGs and tells whether the 5.2. Effects of function embedding
corresponding apps are malicious or not. The details of BLFE are
given in Algorithm 1, where xm denotes the weights in our model To show the effects of function embedding,4 we first embed the
and g is the learning rate. functions in our experiments into a 2D space, and then show the rep-
Once the model is well trained, the feature extractor can be resentations of three functions in Fig. 4. The functions getLongitude()
used to learn behavior level features from a E-FCG. With these fea- and getLatitude() are both used to get the location, and they are often
tures, one can feed them to any classifier (e.g., SVM and KNN) or called together in apps. Hence they occupy close spatial positions in
some advanced malware detection methods (e.g. [19,20]) for app the embedded space. The function setHomeActionContentDescription
classification. () is used to set an alternate description for the Home/Up action,
which is totally different from getLongitude() and getLatitude().
Algorithm 1. The BLFE Algorithm Therefore, its position in the embedded space is far away from those
Stage I: Initialization of getLongitude() and getLatitude(), and the angle between the vec-
Create a corpus with all apps; tors corresponding to setHomeActionContentDescription() and getLon-
Find a vector representation for every function through gitude() (or getLatitude()) is large.
function embedding;
Construct E-FCGs for apps in training dataset;Stage II:
5.3. Convergence of the BLFE algorithm
Training
In each epoch:
BLFE is an iterative algorithm, and its convergence has been ver-
– Sample a batch of E-FCGs;
ified by our experiments. For illustration, Fig. 5 depicts the itera-
– Update xm with gradient descent, i.e.,
tions of loss (4) in a certain experiment. For convenience of
xm xm grxm L ð5Þ depiction, the vertical axis of Fig. 5 provides the average values
Stage III: Test of loss over 20 iterations. Accordingly, each point in the horizontal
Repeat for every app: axis represents 20 iterations. It can be seen from this figure that the
– Construct a E-FCG for the app; loss rapidly decreases during training. Starting from the 100-th
– Make a decision base on the E-FCG; point in the horizontal axis, the loss keeps getting close to zero
and the BLFE algorithm converges.
3
We set the proportion between malicious sample number and total sample
number to 17%, in accordance with the ratio of Android malware in real life.
4
In our experiments, our function embedding adopts functions as many as
2
If IN is not considered here, multiplication with A means that, for every node, we possible. For those unadopted functions, their feature vectors are set to a vector of all
sum up all the feature vectors of all neighboring nodes but not the node itself. In order entries 1. Since the number of the unadopted functions is small, they do not
to take the node itself into account, we simply add the identity matrix IN to A. significantly impact our function embedding method.
304
Table 1
Performance evaluation.
Classifier Features Dimension Precision Recall Accuracy F1-

score
LR A 106 30.69% 87.55% 64.71% 47.20%
B 273 59.52% 87.68% 86.57% 71.88%
C 379 60.58% 89.72% 87.57% 72.33%
Ours 60 96.59% 100.00% 99.34% 98.27%
DT A 106 32.12% 88.64% 64.20% 47.15%
B 273 81.28% 88.65% 94.30% 84.80%
C 379 82.26% 87.26% 94.42% 84.68%
Ours 60 99.46% 99.23% 99.78% 99.38%
SVM A 106 32.34% 91.01% 63.91% 47.73%
B 273 82.97% 88.02% 94.47% 85.42%
C 379 82.48% 90.13% 94.87% 86.14%
Fig. 4. Illustration of function embedding. Ours 60 99.27% 98.80% 99.64% 99.04%
KNN A 106 30.37% 81.45% 64.94% 44.56%
B 273 89.31% 86.28% 95.71% 87.78%
C 379 87.03% 88.21% 95.53% 87.62%
Ours 60 98.94% 99.68% 99.75% 99.31%
RF A 106 32.60% 90.59% 64.35% 47.94%
B 273 90.40% 86.52% 95.89% 88.42%
C 379 90.31% 89.34% 96.36% 89.83%
Ours 60 99.47% 99.84% 99.87% 99.65%
MLP A 106 32.07% 89.49% 47.22% 63.43%
B 273 89.71% 86.32% 95.68% 87.99%
C 379 92.00% 85.83% 96.13% 88.81%
Ours 60 99.18% 99.67% 99.80% 99.43%
CNN A 106 32.19% 90.45% 64.52% 47.50%
B 273 84.20% 84.10% 94.29% 84.15%
C 379 85.59% 87.61% 94.56% 85.02%
Ours 60 99.18% 99.73% 99.81% 99.45%
Fig. 5. The iterations of loss (4) during training.

6. Conclusion
Analyzing the runtime behaviors of Android apps helps in mal-

5.4. Detection performance ware detection. Since monitoring the execution of apps is costly in
practice, we propose to extract or learn behavior level features
To evaluate the features obtained by our method, we applied from the information of function calls, which can be obtained prior
them to seven mainstream classifiers: Decision Tree (DT), k- to app execution. To that end, we first put forward a new concept
Nearest Neighbor (KNN), Logistic Regression (LR), Random Forest of E-FCGs to accurately characterize the runtime behaviors of apps,
(RF), Support Vector Machine (SVM), Multi-Layer Perceptron based on our proposed function embedding method. We then
(MLP) and CNN. For comparison, we also consider there kinds of develop a GCN based algorithm to learn behavior level features
traditional static features, including: (1) 106 sensitive API calls (de- from E-FCGs. Experiments demonstrate that the features obtained
noted by A in Table 1), (2) the combination of 147 permissions and by our method achieve satisfactory performance on seven main
126 intent actions (denoted by B), and (3) the combination of 106 classifiers, significantly surpassing the traditional static features.
sensitive API calls, 147 permissions and 126 intent actions (de-
noted by C). The effects of these static features have been verified
in existing literature (e.g., [5,6,8]). The experimental results for CRediT authorship contribution statement
these static features are also given in Table 1. It is noted that we
used 5-fold cross validation to accurately evaluate the performance Minghui Cai: Conceptualization, Methodology, Validation,
of our features and feature groups A, B and C. Investigation, Writing - original draft. Yuan Jiang: Data curation,
From Table 1, we can draw the following conclusions. First, the Investigation, Methodology, Writing - original draft. Cuiying Gao:
feature groups A, B and C are effective for malware detection, but Investigation, Writing - Review & Editing, Validation. Heng Li:
the proposed features significantly outperform them with respect Conceptualization, Methodology. Wei Yuan: Writing - original
to all the metrics. Among these feature groups, A is the most sim- draft, Conceptualization, Methodology, Project administration,
ilar to ours, since it based on the information of functions calls. Funding acquisition.
When compared to A, our features improve detection accuracy
by at least 34 percentage points on all the classifiers. Among the Declaration of Competing Interest
feature groups A, B and C, the best one is C in most cases. This is
because that the feature group C utilizes the greatest amount of The authors declare that they have no known competing finan-
information, including permission requirements, function calls, cial interests or personal relationships that could have appeared
and intent actions. Our features perform better than the feature to influence the work reported in this paper.
group C on all the classifiers, even if our features only use the infor-
mation of function calls. Take the classifier CNN for example. When Acknowledgments
compared to C, our features improve F1-score by at least 14 per-
centage points for CNN, as indicated by the last two rows of This work was supported by the National Natural Science Foun-
Table 1. dation of China under Grant 61571205 and 61772220.
305
Fig. 6. The process flow of an Android app.
Appendix A. How to process a real-world Android malicious [8] Z. Yuan, Y. Lu, Z. Wang, Y. Xue, Droid-sec: deep learning in android malware
detection, ACM SIGCOMM Computer Communication Review 44 (4) (2014)
sample 371–372.
[9] K. Tam, A. Feizollah, et al., The evolution of android malware and android
Here we explain how to classify a real-world Android analysis techniques, ACM Computing Surveys 49 (4) (2017).
[10] M. Fan, J. Liu, W. Wang, H. Li, Z. Tian, T. Liu, DAPASA: detecting android
malicious app with our proposed method. For illustration, we
piggybacked apps through sensitive subgraph analysis, IEEE Transactions on
use an Android malicious app Amazing Submarine, which was Information Forensics and Security 12 (8) (2017) 1772–1785.
injected with malicious code. The process flow for this app is [11] J. Zhou, G. Cui, Z. Zhang, et al., Graph neural networks: a review of methods
and applications, arXiv:1812.08434, 2019.
depicted in Fig. 6. The procedure consists of four main steps,
[12] P. Goyal, E. Ferrara, Graph embedding techniques, applications, and
which are given below. performance: a survey, Knowledge Based Systems 151 (2018) 78–94.
[13] B. Perozzi, R. Alrfou, S. Skiena, Deepwalk: online learning of social
Unpack and decompile the APK file of this app into smali files, representations, in: Proceedings of KDD 2014, 2014, pp. 701–710.
[14] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, in:
and extract the function call graph (FCG) from these files. Proceedings of KDD 2016, 2016, pp. 855–864.
With the method of word2vec, we obtain node attributes and [15] Abdurrahman Pektas, Tankut Acarman, Deep learning for effective Android
then construct an Enhanced- FCG (E-FCG). malware detection using API call graph embeddings, Soft Computing 24 (2020)
1027–1043.
We use a GCN based method to learn features from the E-FCG, [16] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, Q. Mei, Line: Large-scale information
which can characterize the behaviour of the app. network embedding, in: Proceedings of WWW 2015, 2015, pp. 1067–1077.
Finally, the features are sent to a classifier for malware [17] T. Mikolov, G.s. Corrado, K. Chen, J. Dean, Efficient estimation of word
representations in vector space, in: Proceedings of ICLR 2013, 2013, pp. 1–12.
detection. [18] T.N. Kipf, M. Welling, Semi-supervised classification with graph convolutional
networks, in: Proceedings of ICLR 2017, 2017.
[19] T. Kim, B. Kang, M. Rho, S. Sezer, E. im, A multimodal deep learning method for
android malware detection using various features, IEEE Transactions on
References Information Forensics and Security 14 (3) (2019) 773–778.
[20] S.Y. Yerima, S. Sezer, DroidFusion: a novel multilevel classifier fusion approach
[1] J. Li, L. Sun, Q. Yan, Z. Li, W. Srisa-an, H. Ye, Significant permission identification for android malware detection, IEEE Transactions on Cybernetics 49 (2) (2019)
for machine-learning-based Android malware detection, IEEE Trans. Ind. 453–466.
Inform. 14 (7) (Jul. 2018) 3216–3225. [21] L. Li, et al., AndroZoo++: Collecting millions of android apps and their metadata
[2] G. DATA, 8,400 new android malware samples every day. [Online]. Available: for the research community, arXiv:1709.05281, 2017.
https://www.gdatasoftware.com/blog/2017/04/29712-8-400-new-android- [22] [online] Available at https://virusshare.com/.
malware-samples-every-day.
[3] M. Yang, S. Wang, Z. Ling, Y. Liu, Z. Ni, Detection of malicious behavior in
android apps through API calls and permission uses analysis, Concurrency and Minghui Cai received the B.E. degree in electronic
Computation: Practice and Experience 29 (19) (2017) e4172.
engineering from the Huazhong University of Science
[4] P. Vinod, A. Zemmari, M. Conti, A machine learning based approach to detect
and Technology, China in 2018. He is currently pursuing
malicious android apps using discriminant system calls, Future Generation
the master’s degree in School of Electronic Information
Computer Systems 94 (2019) 333–350.
[5] H. Li, S. Zhou, W. Yuan, Adversarial-example attacks toward android malware and Communications, Huazhong University of Science
detection system, IEEE Systems Journal (2019). and Technology, China. His current research interests
[6] W. Yuan, Y. Jiang, H. Li, M. Cai, A lightweight on-device detection method for include computer vision and machine learning.
android malware, IEEE Transactions on Systems, Man, and Cybernetics:
Systems (2019).
[7] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, K. Rieck, DREBIN: Effective
and Explainable Detection of Android Malware in Your Pocket, in: Proceedings
of NDSS, 2014, pp. 1–16.
306
Yuan Jiang received the B.E. and M.E. degrees in com- Heng Li received the B.E. degree in communication
munication engineering from Huazhong University of engineering from Huazhong University of Science and
Science and Technology, China, in 2016 and 2019, Technology, China, in 2017. He is currently pursuing
respectively. Now he is working for Tencent. His current toward the Ph.D. degree at the same Institution. His
research interests include machine learning and current research interests include artificial intelligence,
Android app development. information security, and image/signal processing.
Cuiying Gao received the B.E degree in computer sci- Wei Yuan received the B.E. degree in electronic engi-
ence from the Nanchang University, Nanchang, China in neering from Wuhan University, China, in 1999, and the
2019. She is currently pursuing the master’s degree in Ph.D. degree in electronic engineering from the
School of Electronic Information And Communication, University of Science and Technology of China, Hefei, in
Huazhong University of Science and Technology, 2006. He is currently a professor with the School of
Wuhan, China. Her current research interests include Electronic Information and Communications, Huazhong
network security and machine learning. University of Science and Technology, China. His current
research interests include machine learning and infor-
mation security.
307

Cai Et Al. - 2021 - Learning Features From Enhanced Function Call Grap

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cai Et Al. - 2021 - Learning Features From Enhanced Function Call Grap

Uploaded by

Copyright:

Available Formats

Neurocomputing 423 (2021) 301–307

Contents lists available at ScienceDirect

Learning features from enhanced function call graphs for Android

1. Introduction address this problem, we propose to learn behavior level features

Fig. 1. Applying our features to a variety of classifiers.

Fig. 2. Function embedding.

where ~ is the degree

Classifier Features Dimension Precision Recall Accuracy F1-

Fig. 5. The iterations of loss (4) during training.

Analyzing the runtime behaviors of Android apps helps in mal-

Fig. 6. The process flow of an Android app.

You might also like