You are on page 1of 6

2021 IEEE 2nd International Conference on Information Technology,Big Data and Artificial Intelligence (ICIBA 2021)

Android Malware Detection Based on Image


Analysis
Xu Ke1 ,Yang XiaoHui2,3
2021 IEEE 2nd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA) | 978-1-6654-2877-4/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICIBA52610.2021.9688179

1. School of Cyber Science and Engineering, Southeast University, Nanjing, Jiangsu


2. School of Information Science and Engineering, Southeast University, Nanjing, Jiangsu
3. State Key Laboratory of Mobile Communications (Southeast University), Nanjing, Jiangsu
1198641651@qq.com, xhyang08@163.com
Corresponding Author: Yang XiaoHui Email: xhyang08@163.com

Abstract—Aiming at the problem that the current designed a convolutional neural network classification
Android malware detection methods have a single feature system to detect Android malicious code. In 2020, Zheng
dimension and it is difficult to determine the multi- et al. [6] collected the API call sequence of samples through
dimensional characteristics of the malware, this article the Cuckoo Sandbox platform, and used the two-way
proposes an Android malware detection method based on LSTM model to classify 6681 malware samples, achieving
image analysis. This method visualizes the software's DEX 99.28% accuracy.
file, extracts the shallow texture features and deep abstract
features and combines them, and finally uses the Light The above research shows that compared with dynamic
Gradient Boosting Machine to detect. Experiments show that analysis, static analysis has the advantages of less time and
when using fusion features, the detection accuracy rate low requirements for hardware resources. Compared with
reaches 98.7%, and the effect is significantly improved the traditional static method of decompiling APK, image
compared with using a single feature. analysis has the advantages of high efficiency and good
portability. Therefore, this paper proposes an Android
Keywords—Android malware; Texture feature; feature malware detection method based on image analysis.
fusion; Light Gradient Boosting Machine
The specific contributions of this article are as follows:
I. INTRODUCTION (1) Select a bytecode file to generate a grayscale image,
According to the market share of the global mobile and use an interpolation algorithm to normalize the image
operating system released by statista [1], as of January 2021, to a uniform size.
the Android operating system accounted for 71.93% of the
(2) Improve the convolutional neural network model
global mobile operating system market share, significantly
with the semantic embedding branch method, extract and
leading the second-ranked iOS operating system. In order
fuse the deep and shallow features of the network layer, and
to meet consumer's software needs, Android applications
normalize them with GIST texture features to form a new
on the market are also emerging in endlessly. But at the
hybrid feature Make the model more expressive.
same time, the number of malware on the Android platform
has also increased significantly, which poses a great threat (3) Use an algorithm that combines Light Gradient
to users' privacy and property security. Therefore, it is Boosting Machine and Logistic Regression, and conduct
necessary to develop efficient and accurate detection comparative experiments with a variety of different
methods. machine learning algorithms. Finally, it shows that the
algorithm has the best performance, can improve the
In 2011, NATARAJ [2] first applied the binary code
generalization ability of the model, and reduce the
visualization method to malware detection. By extracting
possibility of overfitting.
the GIST features of the grayscale image, the KNN
algorithm was used to achieve 98% accuracy. In 2014, Arp
et al. [3] proposed DREBIN for static analysis. Based on II. THE SURVIVABILITY ASSOCIATION MODEL OF LARGE-
the manifest file and disassembled DEX code, they extract SCALE NETWORK
multiple sets of feature sets, and then use the support vector
machine algorithm to detect, the accuracy rate can reach A. Model Description
94%. In 2018, Xia et al. [4] visualized binary files in the Fig. 1 shows the Android malware detection model
form of grayscale images, spliced bytecode grayscale based on image analysis. The model consists of three parts:
images and list grayscale images, and then classified them source image generation, feature extraction, machine
using high-order convolutional neural networks. In 2018, learning classifier training and testing. In the source code
Li et al. [5] converted the DEX executable file into a generation stage, the bytecode file of the Android
grayscale value vector, then added color channels to application package is selected, and a grayscale image with
construct a pixel-normalized RGB map, and finally a grayscale value between 0-255 is generated according to

978-1-6654-2877-4/21/$31.00 ©2021 IEEE 295

Authorized licensed use limited to: Yarmouk University. Downloaded on October 17,2022 at 20:05:30 UTC from IEEE Xplore. Restrictions apply.
the way of reading every 8 bits. On this basis, CNN is used 2) Image Size Standardization
to extract multi-dimensional features of the picture and In order to facilitate subsequent processing, this paper
GIST algorithm is used to extract texture features, and the uses the bicubic interpolation algorithm [7] to scale
obtained features are fused. Finally, the input features are bytecode file images of different sizes to a uniform size.
preprocessed and put into different machine learning Bicubic interpolation selects 16 pixels near the target point
classifiers for training and prediction. P to calculate the pixel value of the new image. The
interpolation basis function is used to find the weight of
each point, and then the pixel values of these 16 points are
weighted and superimposed based on the weight, and
finally the pixel value of the point P in the new image is
obtained. Commonly used interpolation basis function is
shown as:
(𝑎𝑎 + 2)|𝑥𝑥|3 − (𝑎𝑎 + 3)|𝑥𝑥|2 + 1 𝑓𝑓𝑓𝑓𝑓𝑓 |𝑥𝑥| ≤ 1
𝑊𝑊(𝑥𝑥) = �𝑎𝑎|𝑥𝑥|3 − 5𝑎𝑎|𝑥𝑥|2 + 8𝑎𝑎|𝑥𝑥| − 4𝑎𝑎 𝑓𝑓𝑓𝑓𝑓𝑓 1 < |𝑥𝑥| < 2 (1)
0 𝑜𝑜𝑜𝑜ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
Among them, 𝑥𝑥 is the distance between the abscissa (or
ordinate) of two points, and 𝑎𝑎 is a fixed parameter. The
commonly used value of a is usually -0.5, -0.75, -1, and -1
Fig.1. Android malware detection model based on image analysis
is selected as the value of a in this article. Using this
function, we can calculate the pixel value of the new image.
B. Source Code File Generation Suppose the difference between the abscissa and ordinate
Android Application Package contains all the files of the target point 𝑃𝑃 and one of the 16 nearby points 𝐿𝐿𝑖𝑖𝑗𝑗 is
needed in the installation and execution. In the package, the 𝑢𝑢 and 𝑣𝑣 respectively, then the weight of 𝐿𝐿𝑖𝑖𝑗𝑗 is 𝑊𝑊(𝑢𝑢) ∗
classes.dex file is a bytecode file running on the Dalvik 𝑊𝑊(𝑣𝑣). The contribution value 𝑑𝑑𝑖𝑖𝑖𝑖 of point 𝐿𝐿𝑖𝑖𝑗𝑗 is the pixel
virtual machine, which contains the classes that the value 𝑙𝑙𝑖𝑖𝑖𝑖 multiplied by 𝑊𝑊(𝑢𝑢) ∗ 𝑊𝑊(𝑣𝑣), that is 𝑑𝑑𝑖𝑖𝑖𝑖 = 𝑙𝑙𝑖𝑖𝑖𝑖 ∗
program runs. It is the core of the entire Android program. 𝑊𝑊(𝑢𝑢) ∗ 𝑊𝑊(𝑣𝑣), at this time, the pixel value 𝑃𝑃𝑃𝑃𝑃𝑃𝑝𝑝 of point
Therefore, it is feasible to determine the maliciousness of 𝑃𝑃 in the new image is:
the software by analyzing the DEX file.
𝑃𝑃𝑃𝑃𝑃𝑃𝑝𝑝 = ∑3𝑖𝑖= 0 ∑3𝑗𝑗=0 𝑑𝑑𝑖𝑖𝑖𝑖 (2)
1) Visualization of Bytecode File
The process of converting DEX files into grayscale Using the bicubic interpolation algorithm, we scale the
images is shown in Fig. 2. The DEX file is read per 8bit and DEX file image to fixed sizes, which are 64*64, 28*128,
converted into a decimal number to generate an array. In 224*224 and 256*256. The above are the common input
order to facilitate image generation, we convert the one- image sizes of convolutional neural networks.
dimensional array into a two-dimensional array with equal
rows and columns, and the gray value can be generated in C. Convolutional Neural Network Feature
the range of 0-255. The grayscale image of the bytecode Extraction
file is shown in Fig. 3.
1) Convolutional Neural Network Model
Convolutional neural network [8] (CNN) generally
consists of an input layer, a convolutional layer, an
excitation layer, a pooling layer, a fully connected layer,
Fig.2. The process of converting dex files into grayscale images and an output layer. Fig. 4 is the CNN network model used
in this article. Outside the dotted area is the basic
architecture of the CNN network model in this article,
which is composed of 6 convolutional layers, 3 pooling
layers, and 2 fully connected layers. Inside the dotted area
is the multi-layer feature fusion module, which fusion
pooling layer two and pooling layer three.

Fig.3. Grayscale image of bytecode file

296

Authorized licensed use limited to: Yarmouk University. Downloaded on October 17,2022 at 20:05:30 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. Reconstruction of the second pooling layer

Fig.4. CNN network structure

2) Feature Map Selection


In the convolutional neural network, after layer-by-
layer mapping, the input image generally obtains the
unique result of feature extraction. In many works, the
efficient integration of multi-layer convolutional networks
Fig. 7. Reconstruction of the third pooling layer
and complementing each other can improve network
performance [9].
It can be seen from the figures that the low-level feature
Zeiler et al. [10] once proposed a deconvolution maps have higher resolution, and retain more position and
visualization process. The system showed how to visualize detail information, but at the same time, there are relatively
the convolutional network and project the pixel space of the more noise parts. The high-level feature maps have
input layer, so that the evolution of features during the stronger semantic information, but the ability to perceive
training process can be observed. Fig. 5, Fig. 6, Fig. 7 are details is poor. Therefore, we extract feature maps from the
the feature pictures reconstructed after deconvolution of the above three-layer network, select three subsets of different
CNN network model architecture used in this article. functional levels, use the tensor splicing operation in the
GoogleNet [11] network, and use them to train the entire
system. The results are shown in Table 1.

TABLE I RESULTS OF FUSION OF DIFFERENT FEATURE MAP


SUBSETS

Feature maps Accuracy


{Pool3} 0.933
{pool3, pool2} 0.944
Fig. 5. Reconstruction of the first pooling layer
{pool3, pool2, pool1} 0.943
Obviously, the classification quality of a model that
integrates multiple feature maps will be improved, but the
performance will often quickly saturate or even decline,
resulting in marginal effects. Moreover, simple tensor
splicing operations are not satisfactory enough to improve

297

Authorized licensed use limited to: Yarmouk University. Downloaded on October 17,2022 at 20:05:30 UTC from IEEE Xplore. Restrictions apply.
accuracy. Therefore, this article will abandon the first higher accuracy, and faster training speed. In 2014,
pooling layer in the feature layer fusion, and choose a new Facebook[16] proposed a method to integrate GBDT and
feature fusion method, which can integrate more semantic logistic regression model in advertising CTR prediction.
information and detailed information into the features. This article refers to this method and combines LR into the
LightGBM algorithm. The specific process is shown in Fig.
3) CNN Feature Fusion 9.
Zhang et al. [12] proposed ExFuse to solve the problem
of image semantic segmentation. The paper proposes the
Semantic Embedding Branch method, which can obtain
more detailed information from high-level features. The
specific process is shown in Fig. 8. This method is used in
the feature fusion module of this article. After the high-
level features go through a 3×3 convolution kernel, the up-
sampling method of bilinear interpolation is used to adjust
the size to make it consistent with the size of the upper-
level feature map. Finally, a new feature map is generated
by the element-wise phase multiplication method.

Fig. 9. LightGBM and LR fusion algorithm model

Use the original sample features to train a LightGBM


model, and get 𝑛𝑛 trees, each tree has 𝑚𝑚1 , 𝑚𝑚2 , 𝑚𝑚3 , … … , 𝑚𝑚𝑛𝑛
leaf nodes. For an input sample, it will eventually fall into
a leaf node on each tree. We use one-hot encoding to finally
Fig. 8. Semantic Embedding Branch method
get a sparse vector of length 𝑚𝑚1 + 𝑚𝑚2 + 𝑚𝑚3 + ⋯ + 𝑚𝑚𝑛𝑛 .
With 𝑛𝑛 elements being one, the remaining elements are
D. GIST Texture Feature Extraction zero. Combine this new feature with the normalized source
GIST [13] is a macroscopic way of describing image feature and use it as the input of the LR algorithm, we get
features which can describe the contour and texture of an the final classification result. This method can make good
image. This paper uses the GIST512 method to extract 512- use of the nonlinear information extracted by LightGBM,
dimensional features. Use a filter bank with a convolution improve the generalization ability of the model, and reduce
kernel size of 4 and a direction dimension of 8 to form a the possibility of overfitting.
4×8 channel filter to perform Gabor filtering on the image,
and then we divide the image into 4×4 small blocks. Take III. EXPERIMENT AND RESULT ANALYSIS
the average of each small block, and finally we get the 512-
dimensional GIST feature: A. Experimental Environment
The experiment is done on a computing node with
𝐺𝐺𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 �𝑥𝑥,𝑦𝑦� = 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑛𝑛𝑛𝑛 [𝑓𝑓(𝑥𝑥, 𝑦𝑦)𝑔𝑔𝑚𝑚𝑚𝑚 (𝑥𝑥, 𝑦𝑦)] (3) processor Intel(R) Core(TM) i7-10700K, memory
𝑓𝑓 (𝑥𝑥, 𝑦𝑦) represents the pixel value of the image at (𝑥𝑥, 𝑦𝑦); 16GRAM, graphics card model NVIDIA GeForce RTX
𝑔𝑔𝑚𝑚𝑚𝑚 (𝑥𝑥, 𝑦𝑦) represents a two-dimensional Gabor filter, and 3080 10G, and operating system Ubuntu 18.04.
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑛𝑛𝑛𝑛 represents a filter bank with 𝑛𝑛𝑛𝑛 channels. In this
article, the value of 𝑛𝑛𝑛𝑛 is 32, with each image divided into B. Experimental Data
16 areas, there are a total of 512-dimensional features. This article selects 5000 Android application software,
including 2500 benign applications and 2500 malicious
E. Improved LightGBM Algorithm applications. Among them, the benign applications come
Both GBDT and Random Forest are an ensemble from Xiaomi's official application store, and the malicious
algorithm whose main idea is to iteratively train multiple applications come from the Android virus database
weak classifiers to generate an optimal model. In 2016, provided by the Virus Share website [17], training set The
Chen Tianqi et al.[14] optimized the XGBoost algorithm ratio to the test set is 4:1.
on the basis of GBDT, which has become one of the most
important algorithms in the field of machine learning. In C. Evaluation Index
2017, Ke G et al.[15] proposed the LightGBM algorithm The evaluation index of the algorithm in the experiment
based on XGBoost. Compared with XGBoost, this adopts the standard accuracy measures in the classification
algorithm has the advantages of less memory consumption,

298

Authorized licensed use limited to: Yarmouk University. Downloaded on October 17,2022 at 20:05:30 UTC from IEEE Xplore. Restrictions apply.
algorithm: accuracy, precision, recall, F1-score to evaluate
the algorithm.

D. Experimental results
According to the content in section Ⅱ.B, four different
sizes of images are used as the input of the convolutional
neural network, and feature extraction is performed. In
order to verify the effectiveness of the improved feature
fusion method, the schemes in Table 2 are used for
comparative experiments. Each scheme is trained 5 times,
and the model with the smallest experimental loss value is
selected each time as the best model for a single training,
and 5 are calculated The average accuracy of the best model
on the validation set is obtained, and the final accuracy is
shown in Fig. 10. The maximum number of single Fig. 11. Training time of a single epoch under three different schemes
iterations of the trained neural network is set to 20, and the
batch_size of each incoming sample is set to 6. Under three It can be seen from Fig. 11 that the accuracy of the
different feature extraction schemes, the average training second scheme including two pooling layers is higher than
time of a single epoch is shown in Fig. 11. that of the first scheme including only the basic network,
which fully proves the effectiveness of the feature fusion of
the two convolutional layers. The improved feature fusion
TABLE Ⅱ THREE SCHEMES OF CNN FEATURE EXTRACTION scheme in this article has a certain improvement in
Scheme Scheme description accuracy compared to the second simple splicing scheme
number using concat. The accuracy rate is when the image pixel
Scheme 1 Predict with the basic CNN network size is 64*64, 128*128, 224*224, and 256*256. Increased
by 0.42%, 0.70%, 0.78%, 0.74% respectively. In most
model
cases, the detection accuracy increases with the continuous
Scheme 2 Use the concat function to merge the increase of image resolution. Obviously, the improvement
second and the third pooling layer of image accuracy allows us to collect more convolutions.
Scheme 3 Use the SEB method to merge the Characteristic information, thereby increasing the
second and the third pooling layer performance of the model. However, when the image
resolution reaches 256*256, the improvement of accuracy
rate is not much better than that of the resolution of
224*224, and it even drops slightly in the case of scheme 3.
This shows that the number of samples is not large. In this
case, there are too many parameters that need to be trained,
which is prone to overfitting. From Fig. 12, we find that as
the image resolution increases, time and resources are
consumed more. Weighing the pros and cons of
performance and resource consumption, this article
chooses 224*224 resolution as the best input image size.
Select the source code image of the above size to train
the best convolutional neural network model, and then
intercept the input of the last layer of fully connected layer,
you can get a total of 128-dimensional convolutional neural
Fig. 10. The prediction accuracy of three different schemes network features. According to the content in Section 2.4,
the GIST texture feature of the bytecode image is obtained
by the Gabor kernel function filter, which contains 512-
dimensional features. After the normalization operation, it
is spliced with the convolutional neural network features as
the improved LightGBM classifier enter. The parameters
of the light gradient boosting algorithm adopt the learning
rate of 0.05, the number of decision trees 100, and the
number of leaf nodes 31. Table 3 shows the detection
results of using the improved LightGBM classifier under
different features.
In order to verify the superiority of the method
proposed in this paper, this paper uses multiple machine

299

Authorized licensed use limited to: Yarmouk University. Downloaded on October 17,2022 at 20:05:30 UTC from IEEE Xplore. Restrictions apply.
learning algorithms (SVM, RF, LightGBM) and improved features, API sequence features, etc., can improve accuracy
LightGBM to conduct comparative experiments, and the and achieve better detection algorithms.
indicators obtained are shown in Table 4.
REFERENCES
TABLE Ⅲ RESULTS OF LIGHTGBM+LR UNDER DIFFERENT [1] Mobile operating systems' market share worldwide from January
FEATURES 2012 to July 2020 [EB/OL]. https://
www.statista.com/statistics/272698/global-market-share-held-by-
Features Accurac Precisio Recall F1- mobile-operating-systems-since-2009/A
y n Score [2] NATARAJ L, KARTHIKEYAN S, JACOB G, et al. Malware
GIST 0.942 0.9422 0.9418 0.9419 Images:Visualization and Automatic Classification [C].ACM. 8th
International Symposium on Visualization for Cyber Security, July
20, 2011, Pittsburgh,Pennsylvania, USA. New York:ACM, 2011:4-
CNN 0.982 0.9820 0.9819 0.9819 11.
[3] Arp D, Spreitzenbarth M, Hubner M, et al. Drebin: Effective and
GIST+C 0.987 0.9869 0.9870 0.9870 explainable detection of android malware in your pocket[C].
NN Ndss.2014, 14: 23-26.
[4] Xia Xiaoling. Research on classification method of malicious code
TABLE Ⅳ RESULTS OF DIFFERENT CLASSIFICATION ALGORITHMS image and text features based on deep learning [D]. Harbin: Harbin
Institute of Technology,2018.
Algorith Accurac Precisio Recall F1- [5] Li Yuanyuan. Malicious code detection based on data
ms y n Score visualization[D]. Xi'an, Xidian University,2018.
SVM 0.931 0.9466 0.9272 0.9289 [6] Zheng Rui, Wang Qiuyun, Fu Jianming, et al. A malware family
classification model based on deep learning [J]. Journal of
Information Security,2020,5(1):1-9.
RF 0.959 0.9598 0.9601 0.9589 [7] Summary of image interpolation algorithms [EB/OL].
https://www.cnblogs.com/laozhanghahaha/p/12580822. html
lightGB 0.982 0.9819 0.9820 0.9819 [8] Kizhevsky A,SutskeVer I,Hinton G E. ImageNet Classification
M with Deep Convolutional Neural Networks[C]. Advances in Neural
Information Processing Systems, 2012:1097-1105.
lightGB 0.987 0.9869 0.9870 0.9870
[9] Deep feature fusion---high and low-level (multi-scale) feature
M+LR fusion[EB/OL].https://blog.csdn.net/xys430381_
Table 3 shows that the accuracy of the fusion feature 1/article/details/88370733.
detection using CNN+GIST is 98.7%. It is 0.5% higher [10] Zeiler M.D., Fergus R. (2014) Visualizing and Understanding
than using CNN features alone, and 4.5% higher than using Convolutional Networks[C]. ECCV 2014. ECCV 2014. Lecture
GIST features alone. It also has different degrees of Notes in Computer Science, vol 8689. Springer, Cham. pp 818-833
improvement in accuracy, recall, F1-Score and other [11] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott
Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke,
indicators. Experiments have proved that using the fusion Andrew Rabinovich; Going Deeper With Convolutions[C].
features of CNN+GIST has a better effect on malware Proceedings of the IEEE Conference on Computer Vision and
detection. Table 4 shows that under the fusion features of Pattern Recognition (CVPR), 2015, pp. 1-9
GIST+CNN, the lightGBM+LR method performs best [12] Zhang Z., Zhang X., Peng C., Xue X., Sun J. ExFuse: Enhancing
under all indicators, and the improved lightGBM method is Feature Fusion for Semantic Segmentation[C]. In: Ferrari V.,
0.5% more accurate than the initial lightGBM. Compared Hebert M., Sminchisescu C., Weiss Y. (eds) Computer Vision –
ECCV 2018. ECCV 2018. pp 273-288
with other algorithms, the improved lightGBM method has
greater advantages. [13] Aude Oliva & Antonio Torralba. Modeling the Shape of the Scene:
A Holistic Representation of the Spatial Envelope[J]. January 22,
2001. pp 145-175
IV. CONCLUSIONS [14] Chen, Tianqi and Guestrin, Carlos. XGBoost: A Scalable Tree
This article selects bytecode files from the Android Boosting System[C]. Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data
application package, and uses interpolation algorithm to Mining. August 2016, Pages 785–794
generate uniform-sized grayscale images, and extracts the [15] Ke G, Meng Q, Finley T, et al. Lightgbm: A highly efficient
texture features and deep-level features in them and gradient boosting decision tree[J]. Advances in neural information
performs feature fusion, and then uses the lightGBM+LR processing systems, 2017, 30: 3146-3154.
algorithm to build the model to achieve the Android [16] He, Xinran, J. Pan, Ou Jin, T. Xu, Bo Liu, Tao Xu, Yanxin Shi,
malware detection function. The experimental results show Antoine Atallah, R. Herbrich, S. Bowers and J. Q. Candela.
that combining the texture features of the image surface and “Practical Lessons from Predicting Clicks on Ads at Facebook.”
ADKDD'14 (2014).
the deep features extracted by the convolutional neural
[17] VirusShare.com - Because Sharing is Caring[EB/OL].
network can effectively improve the accuracy. In the https://virusshare.com
experiment, we found that the detection accuracy of the
deep features extracted by CNN is not high when the image
size is large. It is initially determined that the number of
samples is insufficient. The next step is to increase the size
of the data set, and further optimize the CNN model, while
trying to fuse Other features, such as bytecode sequence

300

Authorized licensed use limited to: Yarmouk University. Downloaded on October 17,2022 at 20:05:30 UTC from IEEE Xplore. Restrictions apply.

You might also like