ICISS 2020 - Shivi - 9 April 2020 - IIT Jammu

Machine learning based Android vulnerability
detection: A roadmap
Shivi Garg1 , Niyati Baliyan2*

1
Faculty of Computer Engineering
J.C. Bose University of Science and Technology, YMCA
Faridabad, India
2
Information Technology Department
Indira Gandhi Delhi Technical University for Women
Delhi, India
{shivi1989@gmail.com, niyatibaliyan@igdtuw.ac.in}
Abstract. Cyber-security risk is increasing at an alarming rate because of

global connectivity. It is important to protect sensitive information and maintain
privacy. Machine Learning and Deep Learning algorithms have shown their
superiority and expertise in detecting as well as predicting cyber threats, as
compared to the conventional methods. The paper discusses the role of such
algorithms in cyber-security, particularly in Android mobile operating system.
This statistical analysis identifies different vulnerabilities affecting Android and
trend of these vulnerabilities from 2009 to 2019. Vulnerability assessment is
performed on confidentiality, integrity and availability aspects of information
security. Concrete research gaps have been identified in existing works and
various methods to improve machine learning and deep learning techniques, are
presented.
Keywords: Android, Cyber-security, Deep Learning, Machine Learning,
Malware.
1 Introduction
Machine Learning (ML) [1] plays a crucial role in current times to analyze
voluminous data, due to improved hardware and sophisticated algorithms being
readily available while more evolving each day. Different approaches which ML
algorithms use to solve some real-world problems are as follows:
1.1 Supervised Learning
It is also known as task-driven approach, in which, available data points have

predefined labels for output variable. ML model can take decisions about the new
data on basis of this labeled data.
1.2 Unsupervised Learning
It is also known as data-driven approach, in which labels for output variables are not
available for any data point. Based on different properties of data, ML model can
show some interesting patterns. It is generally used to find anomalies in data.
1.3 Semi-supervised learning
This approach combines the advantages of both supervised and unsupervised learning
approaches, when there is some labeled data available.
1.4 Reinforcement learning
It is also known as environment driven approach. It is generally used when the

behavior of data changes with the environment.
Deep Learning (DL) [2] is a subclass of ML where neural networks comprising

hidden layer(s) (each comprising perceptron(s)) are used progressively to extract
features from the raw data. The problems where ML/ DL can prove highly efficient
can be categorized into- prediction, classification, clustering, recommendation
(association rule mining), dimensionality reduction, etc., which are discussed as
follows:
Prediction – it is a task of forecasting the subsequent numeric/ continuous value
based on the previous values. Prediction is also refereed as regression in ML.
Different regression methods used in ML are linear, ridge, polynomial regression,
Decision Trees (DT), Support Vector Regression (SVR), Random Forest (RF), etc.
DL models used in regression are Artificial Neural Network (ANN), Recurrent Neural
Network (RNN), Neural Turing Machines (NTM) and Differentiable Neural
Computer (DNC).
Classification – it is a task of categorizing things into different discrete classes.
Different ML models used in classification are Logistic Regression (LR), k-Nearest
Neighbors (k-NN), Support Vector Machine (SVM), Naive Bayes, DT, and RF. DL
methods for classification are ANN and Convolution Neural Networks (CNN).
Clustering – it is a task of grouping the things on the basis of similarity since the
classes are not labeled beforehand. ML models used in clustering are k-nearest
neighbours (k-NN), k-means, DBScan, Bayesian, Gaussian Mixture Model and
Agglomerative Mean-shift. Self-organized Maps (SOM) or Kohonen Networks are
DL methods used in clustering.
Recommendation – it is also known as Association Rule Mining (ARM). It is the
task of predicting future preferences on the basis of past experiences. ML models
used in ARM are Apriori, Euclat, FP-Growth. Deep Restricted Boltzmann Machine
(RBM), Deep Belief Network (DBN), Stacked Autoencoder are used as DL models.
Dimensionality Reduction – it is also called as generalization. It is the process of
selecting important and appropriate features from a large data set to reduce
redundancy. ML models used in dimensionality reduction are Linear Discriminant
Analysis (LDA), Principal Component Analysis (PCA), T-distributed Stochastic
Neighbor Embedding (T-SNE), Singular-value decomposition (SVD), , , Latent
Semantic Analysis (LSA), Factor Analysis (FA), Independent Component Analysis
(ICA) and Non-negative Matrix Factorization (NMF).
ML/ DL is an intersection of data science, data mining and classical programming
[3] as shown in the Figure 1. ML exemplifies the principles of data mining, but is also
capable of making automatic correlations and learn from them to apply to new
algorithms.
Fig. 1. Venn Diagram for ML/ DL terminology
Cyber-security [4] is defined as the technologies and different practices designed in

order to protect data, programs, systems and networks from attacks, damages or any
unauthorized access. These attacks can access, modify or destroy confidential
information, can extort money from the users, and can interrupt business processes.
ML/ DL have proved to be efficient in different areas of cyber-security. There are
three dimensions in which ML/ DL can be applied. These dimensions are- why, what
and how.
The first dimension answers “why” or the reasons to perform a cyber-security task.
The goal is to detect threats, predict oncoming attacks, etc. According to Gartner’s
Predict, Prevent, Detect, and Respond (PPDR) model [5], security tasks are divided
into five categories as shown in the Table I.
Table 1. Categories of security tasks.
Tasks Description Tools/ Techniques

Vulnerability Management tools, Penetration
Prediction forecasts the attacks Testing Solutions, Security Configuration
Management
Prevention prevent the attacks firewalls, Intrusion Prevention System,
Encryption, Virtual Private Network (VPN)
Detection identifies potential Log Management, Threat Analytics, Deception
anomalies and attacks Tools, Fraud Monitoring
which have bypassed the
blocking systems
Response deals with the several forensic tools, incident management and
measures once an response, IT service management tools
incident occurs
Monitoring focuses on risk analysis Governance, Risk, and Compliance (GRC),
and security monitoring, Threat and Vulnerability Management, Security
which are in compliance Operations, Analytics, Reporting (SOAR)
with specific standards
The second dimension answers “what” or at what technical level, issues are
monitored. There are different layers listed in this dimension, namely, network,
endpoint, application, user and process. Table 2 describes how ML/ DL techniques
can be used to monitor anomalies in different layers.
Table 2. ML applications in cyber security areas.
Protection Sub Layer ML Techniques Role of ML

Layer
Network SCADA systems, Regression Prediction and comparison of network
Ethernet, Virtual packet parameters with normal packets
networks (e.g.,
Classification Identification of varied network attacks
SDNs)
like scanning and spoofing
Clustering Forensic analysis
Endpoint IoT device , Mobile, Regression Prediction and comparison the next
Server, Cloud system call for executable with real
instance, ones
Workstation Classification Categorizing programs into malware,
spyware, adware, ransom ware
Clustering Malware protection on secure email
gateways
Application Databases, ERP Regression Anomaly detection in HTTP requests
Security systems, Web Classification Detection of known attacks like SQL
applications injection, XSS, etc.
Clustering Detect ion of and mass exploitation and
DDOS attacks
User Behavior - Regression Anomaly detection in User actions
Classification Peer-group analysis based on different
users
Clustering Detection of outliers
Process - Regression Prediction and detection of outliers like
Behavior credit card fraud
Classification Detection of known frauds
Clustering Detection of outliers and comparison
with business processes
The third dimension answers “How” to check security mechanisms. It can be

historical, at rest, or in transit in real time.
There is widespread need of cyber-security since the number of software
vulnerabilities is increasing enormously. Software vulnerability can be defined as a
weakness in the system procedures, information systems or policy that allows
attackers to exploit information security. In this paper, we focus on Android operating
system and its vulnerabilities. The reason for choosing Android is that it was the most
vulnerable operating system of the year 2016 [6].
Organization of this paper is as follows- section II discusses the related works in
this area. Section III talks about the Android architecture. Section IV lists different
types of vulnerabilities affecting Android. Section V presents the trends of
vulnerabilities in Android from the year 2016 to 2019. Section VI deals with the ways
of improving ML/ DL methods to further make it more efficient and accurate. Finally,
section VIII concludes the paper.
2 Related Work
There are two aspects of related research work summarized in this paper: Role of ML/
DL in cyber-security domains and Role of ML/ DL in detecting Android
vulnerabilities.
2.1 Role of ML/ DL in cyber-security domains
Ahmed et al. [7] presented an in-depth analysis of anomaly detection techniques

based on four categories like classification, statistical, information theory and
clustering. Nguyen et al. [8] proposed one-class collective anomaly detection model
based on Long Short-Term Memory Recurrent Neural Network (LSTM-RNN).
Conventional approaches of anomaly detection are based on learning normal and
malicious behavior, which does not take memory into consideration to record
previous events and classify new ones. Le et al. [9] proposed DL based malware
classification approach. This approach is data driven for identifying features and
complex patterns. Bahnsen et al. [10] used RNN to detect phishing attacks with an
accuracy of 98.7%. This method eliminates the need of creation of features manually.
Tuor et al. [11] detected anomalous behavior of network activity from system logs
using an online unsupervised DL approach in real time. Dong et al. [12] proposed
opinion fraud detection using end-to-end trainable unified model that leverages the
properties from Autoencoders and random forest.
2.2 Role of ML/ DL in detecting Android vulnerabilities
Tchakounté and Hayata [13] used supervised ML to detect Android malware. They
used permissions as a feature to detect malicious behavior. Maier et al. [14]
demonstrated that Android malware can bypass many AntiVirus (AV) tools and
Google Bouncer. Hussain et al. [15] presented a conceptual framework for improving
the privacy of the users and to secure medical data related to Android Mobile Health
applications (mHealth). Liang et al. [16] proposed an end to end DL model for
Android malware detection using raw system call sequences. They achieved an
accuracy of 93.16%. Ganesh et al. [17] presented a CNN based malware detection
solution using permissions. This solution detected malware with an accuracy of 93%.
Garg and Baliyan [18] proposed a novel parallel classifier scheme for detection of
vulnerabilities in Android with an accuracy of 98.27%. Details of the data collection
and various steps of preprocessing are discussed in [19].
3 Android Architecture
Android is a Linux-based mobile operating system with features like- shared memory
mechanism, binder IPC mechanism, power manager, etc. There are five software
layers, which are present on top of the Linux kernel, namely, hardware abstraction
layer, native libraries, Android runtime, application framework and system
application (app) [20], as shown in the Figure 2.
Hardware Abstraction Layer (HAL) - It acts as an interface for communicating
the Android application/framework with hardware-specific device drivers such as
camera, Bluetooth, etc. HAL is hardware-specific and implementation varies from
vendor to vendor.
Native Libraries - core system components and services of Android like Android
Runtime (ART) and Hardware Abstraction Layer (HAL) are built from the native
libraries written in C/C++. There are different libraries such as application framework
libraries, libraries for building user interface, graphics drawing and database access.
Android Runtime (ART) - ART is introduced as a new runtime environment in
newer Android versions (version 5.0 onwards). During app installation, it uses ahead-
of-time (AOT) and just-in-time (JIT) compilation, which compiles the Dalvik
bytecode into native binaries (ELF format). This optimizes garbage collection and
power assumption and achieves high runtime performance.
Application framework- Android SDK provides tool and API libraries to develop
applications on Android java. This framework is known as Android Application
Framework. Important features are database for storing data, support for audio, video
and image formats, debugging tools, etc.
System applications - Applications are located at the top most layer of the
Android stack. These consist of both native and third-party applications such as web
browser, email, SMS messenger, etc., which are installed by the user.
Fig. 2. Android Architecture
4 Android Vulnerabilities
Android is a complex open network of different collaborating companies. Android is

customized by many hardware and network providers to meet their requirements. This
makes Android system more vulnerable to other mobile operating systems like iOS.
Different vulnerabilities present in Android are enumerated as follows:
Denial of Service (DoS) - The Denial of Service (DoS) vulnerability makes the
resources unavailable by tampering network packets, logic, programming, etc. The
services are ceased to the legitimate users when there are a large number of requests.
Arbitrary code can be injected and executed while performing DoS attacks to access
critical information. DoS attacks can have a direct impact on vulnerability by
introducing large response delays, service interruptions and excessive losses.
Code Execution - This vulnerability is exploited by executing arbitrary code.
This vulnerability is caused due to improper input/output data validation. Arbitrary
code can take the control of privileges and change or delete data using complete user
rights.
Overflow - Overflow vulnerability can occur when excess of the data is placed by
a malicious program than was originally allocated to be stored. This data leak can
corrupt/ overwrite the existing data. The extra data can have special instructions
which can trigger a response to damage files, access personal information or change
the data.
Memory Corruption - Memory corruption vulnerability can occur in a system
when a memory is altered without an explicit assignment. Programming errors can
enable the attackers to execute an arbitrary code, which can modify the contents of a
memory location.
SQL Injection - SQL query is injected via the input data from the client to the
application. This query can access the sensitive data from the database; modify the
database, shutdown the entire database management system, etc.
Cross Site Scripting (XSS) - XSS vulnerability is due to the injection of some
malicious scripts into benign and trusted websites. Attacker uses a browser side script
to send malicious code to the end user. The web application takes an input from the
user and generates an output without any validation. The malicious script can change
the HTML content of a web page, access tokens of the sessions, cookies, or any other
sensitive information used by the browser.
Directory Traversal - Directory traversal also known as path traversal
vulnerability accesses directories and files that are stored outside the web root folder.
Arbitrary files and directories or critical system files can be accessed by manipulating
absolute file paths. In case of Android, it is in the form of HTTP exploit where
attackers carry out a path traversal attack in the context of a user application and
read/write files inside internal storage.
HTTP Response Splitting - This vulnerability occurs when malicious characters
like carriage return (\r) and line feed (\n) are inserted in the HTTP response header
and sent to the end user without any validation. These characters allow attackers to
have direct control of the remaining headers and body of the response the application
intends to send, but also allow them to create additional responses entirely under their
control.
Bypass something - This vulnerability occurs when attackers can bypass
authentication mechanisms. Attackers can access unprotected file and can attack
protected applications by evading the authentication system.
Gain Information - This vulnerability allows attackers to gain privileges via a
malicious program in the affected application. It allows local users to gain privileges
via a crafted application that makes an API call to access sensitive information in the
registry.
Gain Privileges - It can occur when an attacker exploits the design or a
configuration flaws in the application or an operating system to gain access to the
resources and confidential data. The resources are then unavailable to the users.
Attackers can steal credentials and other sensitive information and can execute an
arbitrary code.
5 Trends in Android Vulnerabilities
It is important to analyze the trend of android vulnerabilities from the year 2009 to
2019. It can be seen that there is a continuous increase in the number of
vulnerabilities till 2017 and later on there is a steep decrease in the vulnerabilities in
the year 2018 and 2019.
Fig. 3. Android Vulnerabilities trend from the year 2009 to 2019
This decrease in the number of vulnerabilities is due to the better detection rates
using ML and DL algorithms. Table 3 shows detection rates of ML/ DL algorithms
2016 onwards.
Table 3. ML/DL Algorithms Detection rates for Android Vulnerabilities.
Published Key Feature # of Algorithm used Performance Metric(s)

Work Apps
F1score = 97.3
2016 [21] Static 7972 DBN Precision = 98.2
Recall = 94.4
F1 score= 94.5
2016 [22] Static 6334 DBN Precision = 93.09
Recall = 94.5
2016 [23] Dynamic 3000 SAE Accuracy = 93.68
2017 [24] Static 2500 CNN-Alex Net Accuracy = 93
Static data
2017 [25] 11000 DBN F1score = 95.05
flow analysis
2017 [26] API calls 5000 DBN SAE Accuracy = 96.66
Accuracy = 98
Opcode Precision = 99
2017 [27] 27377 CNN
Sequence Recall = 95
F1 score= 97
Accuracy = 93.1
2017 [28] Dynamic 10000 CNN Precision = 95.75
F1 score= 86.57
2017 [29] Dynamic 7100 CNN Accuracy = 85-95
2018 [30] Static 2800 DBN Recall = 94.28
2018 [31] Static 10770 CNN Precision = 96.6
Recall = 98.3
Accuracy = 97.4
F1 score= 97.4
2018 [32] Static 6965 DBN Accuracy = 95.7
Accuracy = 99.8
2018 [33] Static 23000 CNN Recall = 99.91
F1 score= 99.82
F1score = 96.29
2018 [34] Static 33000 CNN Precision = 96.29
Recall = 96.29
2018 [35] Static 110440 CNN - LSTM Accuracy = 97.74
Precision = 97.15
2018 [36] Static 129013 DNN Recall = 94.18
F1 score= 95.64
Dynamic
2018 [37] /Static 4208 DNN Accuracy = 95
features
ML and DL are effective tools for detecting categories of vulnerabilities.

Vulnerabilities like Memory corruption, Bypass, Information leak and Privilege
escalation have reduced from 2016 to 2019 due to higher detection rate of ML/ DL
algorithms.
Table 4 shows that Memory corruption vulnerability instances have reduced from
38 to 19 from 2016 to 2019. This can be partially justified with the work of Grieco et
al. [38] in which memory corruption vulnerability was detected with an accuracy of
83%.
It is evident from Table 4 that privilege escalation vulnerability instances have
reduced from 250 to 1 from 2016 to 2019. This can be partially explained from the
work of Xin et al. [39] that established that Privilege escalation vulnerability can be
detected using seven ML classifiers like- LR, DT, kNN, NB, SVM, RF and Adaboost
with an accuracy of >80%.
Denial of Service (DoS) vulnerability instances have reduced from 106 to 35 and
Bypass vulnerability instances are reduced from 48 to 30 from 2016 to 2019, as
shown in Table IV. This can be partially explained from the work of Zhou et al. [40]
that established that DoS and Bypass vulnerabilities can be detected using Adaboost
classifier with an accuracy of ~77%.
Table 4 shows that information leaks or gain information vulnerability instances
have reduced from 99 to 16 from 2016 to 2019. This can be partially explained from
the work of Amr Amin et al. [41] that established that Information leaks vulnerability
can be detected using ML classifier with an accuracy of 84%.
Table 4. Vulnerability Trends from 2016 - 2019.
Vulnerability 2016 2017 2018 2019 Total % share of

each
vulnerability
(2016- 2019)
Code
73 206 84 89 452 19%
Execution
Overflow 92 168 141 34 435 18%
Gain Privileges 250 36 3 1 290 12%
Gain
99 109 64 16 288 12%
Information
DoS 106 87 32 35 260 11%
Bypass
something 48 31 17 30 126 5%
Memory
38 32 12 19 101 4%
Corruption
-181
(not
mutu
Others 174 260 190 443 18%
ally
exclu
sive)
# of
525 843 613 414 2395 100%
Vulnerabilities
Table 5 shows the change in distribution pattern of vulnerabilities during 2016 -

2019. It is calculated in terms of percentage points (pp), which is defined as the
difference between the two percentages. In particular, the percentage distribution of
gain privileges vulnerability is decreased by 48 pp (from 48% to 0%) over four year
period. It is also seen that all the other major vulnerabilities like overflow, , gain
information, DoS, bypass, and memory corruption are reduced significantly. This
decrease is a result of increased use of ML/ DL techniques, implying better detection
rates and accuracy over the time.
Table 5. Change in Distribution pattern of Vulnerabilities during 2016-2019.
Vulnerability 2016 2019 % share % share % difference

in 2016 in 2019 (2016- 2019)
Code Execution 73 89 14% 21% 8 pp
Overflow 92 34 18% 8% -9 pp
Gain Privileges 250 1 48% 0% -48 pp
Gain Information 99 16 19% 4% -15 pp
DoS 106 35 20% 8% -12 pp
Bypass something 48 30 9% 7% -2 pp
Memory Corruption 38 19 7% 5% -3 pp
# of Vulnerabilities 525 414 100% 100% 0 pp
The assessment is not merely carried out on impact scores of each vulnerability,
but also on the number of instances of occurrence. Table 6 shows the average impact
score of each vulnerability and Count of CVE ID shows the instances of each
vulnerability. Number of instances depict the widespread nature of each vulnerability
and impact score depicts the severity level of each vulnerability. This is to ensure that
we can capture the effect of both volume and the impact of each vulnerability.
Table 6. Mean impact score of each vulnerability.
Vulnerability Mean COUNT of CVE Total Impact

Score ID
Unclassified (Others) 6.3 836 29.9%
Exec Code 8.3 310 14.7%
Priv 8.7 215 10.6%
Overflow 7.1 256 10.3%
Info 4.6 259 6.8%
DoS 6.2 191 6.7%

DoS Exec Code Overflow 9.8 66 3.7%
Mem. Corr.
Bypass 5.8 97 3.2%
Exec Code Overflow 8.3 56 2.7%

Overflow +Priv 8.4 40 1.9%
Exec Code +Priv 9.2 31 1.6%
Exec Code Overflow Mem. 9.1 31 1.6%

Corr.
Bypass +Info 4.8 40 1.1%
Classified (Others) 7.0 135 5.4%
Further, it is shown that impact analysis on a single instance is not enough to judge
the severity of vulnerability. Large number of instances of vulnerabilities having an
average impact score shows how widespread it is and higher cases of such
vulnerabilities make it more dangerous. This is shown in the Figure 4.
Fig. 4. Total impact of different types of vulnerabilities
Software vulnerabilities affect Confidentiality, Integrity and Availability (CIA)

triad on the basis of which impact scores are calculated.
Confidentiality – It states that information should be kept secret from any
unauthorized access. Information can be kept confidential through passwords, PIN
numbers, CVV details of a credit card, etc. Some common attacks on confidentiality
are packet sniffing, ping sweeps, password attacks, phishing, key-logging, etc.
Integrity – It states that information once generated should be tampered by any
unauthorized entity or the actual information should not be compromised. Integrity
attacks include session hijacking, Man-in-the-middle attack, salami attack, trust
relationship attack, etc.
Availability – It states that if any information is requested then it should be
available without any interruption. Some of the common availability attacks are DoS
attack, DDoS attack, SYN flood attack, etc.
CIA is the critical tenet of information security; therefore, it is important to
understand the impact of vulnerabilities on CIA. Table 7 shows the impact of
vulnerability scores on confidentiality. It is seen that out of all the vulnerabilities,
medium level vulnerabilities are dominating with 51% share, whereas high-level and
low-level shares are only 4% and 44%, respectively. Further analysis reveals that
high, medium and low-level vulnerabilities have a complete impact on confidentially,
i.e., the attacker has access to all the sensitive information or total compromise on
confidentiality.
Table 7. Impact Score on Confidentiality.
Complexity Complete (%) None (%) Partial (%) Total (%)
High 4 0 1 5
Medium 33 5 13 51
Low 25 4 15 44
Total (%) 61 9 29 100
Table 8 shows the impact of vulnerability scores on integrity. It is seen that high,
medium and low-level vulnerabilities have a complete impact on integrity, i.e., the
attacker can modify the information or total compromise on integrity.
Table 8. Impact Score on Integrity.
High 4 0 0 4
Medium 32 11 8 51
Low 24 11 10 45
Total (%) 60 22 18 100
Table 9 shows the impact of vulnerability scores on availability. It is seen that

high, medium and low-level vulnerabilities have a complete impact on availability,
i.e., the information is no longer accessible by shutting down all the ongoing
processes and freezing all the processing.
Table 9. Impact Score on Availability.
High 4 0 0 4
Medium 35 7 8 50
Low 27 7 11 45
Total (%) 66 15 19 100
6 ML/ DL Roadmap for Android Security
Existing ML/ DL approaches for vulnerability detection in Android are effective,

however, these can be improved further to achieve high detection rates [42]. First of
all, there should be a reasonable collection of malicious and benign apps. Three
different approaches can be used to compile the appropriate datasets. Highly efficient
tool should be used for downloading benign apps from Google play store. For
malicious apps, there should be rich central open source repository like AMD built on
literature research and made available for the public. Secondly, appropriate tools
should be automated and parallelized the ML based detection frameworks for feature
extraction. There are several features that can be extracted from an Android app like
APIs, permissions, services, etc. It can be illustrated as- suppose there is a n
dimensional vector Fi, where i denotes the type of feature. A binary variable is
associated with each dimension of a vector, where 1 corresponds to the presence of
that feature in app and 0 otherwise. Depending on the capabilities of an app Aj, it
gains an access to a subset of these features 𝐹𝑖 . Formally, each feature vector Fi is
calculated as: 𝐹𝑖 = {𝑋0 , 𝑋1 , 𝑋2 … 𝑋𝑛 }. Feature selection is an important step in ML
algorithms since it is not known in advance which features are relevant. Data
dimension should be reduced to improve the classification time but without
sacrificing performance up to a certain threshold. There are two categories of feature
selection algorithms. Wrapper methods determine significant features using ML
algorithms which are to be applied on data. Filter methods, on the other hand,
determine the dominance of features using heuristics on the basis of general
characteristics of the data.
Thirdly, best ML/ DL algorithms should be chosen based on their inherent
properties to achieve best results with less computational power and time.
7 Conclusions
Cyber-attacks are becoming more prevalent since large amount of data is

transferred over computer networks. Therefore, there is a requirement of new
techniques and methodologies, which can detect, predict, and identify these attacks at
a faster rate to assist cyber-security experts. This paper presents an insight into the
role of ML/ DL in handling cyber-threats. More focus is laid on Android and its
vulnerabilities since Android was the most vulnerable OS and it is important to study
how different ML/ DL techniques have detected the vulnerabilities over time. The
paper also discusses what type of vulnerabilities is seen in Android and which class of
vulnerabilities is detected by ML/ DL. The severity score of different vulnerabilities
are analyzed to assess its impact on CIA triad. In addition, different ways to improve
ML/ DL techniques are also suggested to obtain best results with optimum time and
cost.
References
[1] A. Narayanan, M. Chandramohan, L. Chen, &Y. Liu, “A multi-view context-aware approach to

Android malware detection and malicious code localization,” Empirical Software Engineering, vol.
23(3), pp. 1222-1274, 2018.
[2] M. K. Alzaylaee, S. Y. Yerima, & S. Sezer, S, “DL-Droid: Deep learning based android malware
detection using real devices,” Computers & Security, vol. 89, p.101663, 2020.
[3] A. Polyakov. (2020, Apr.). Machine Learning for Cybersecurity [Online]. Availbale:
https://towardsdatascience.com/machine-learning-for-cybersecurity-101-7822b802790b
[4] R. Von Solms, and J. V. Niekerk.,"From information security to cyber security," Computers &
security, vol. 38, pp. 97-102, 2013.
[5] P. Carpenter. (2016, Apr.). Using the Predict, Prevent, Detect, Respond Framework to Communicate
Your Security Program Strategy [Online]. Available: https://www.gartner.com
[6] S. Garg, R. K. Singh, and A. K. Mohapatra, "Analysis of software vulnerability classification based
on different technical parameters," Information Security Journal: A Global Perspective, vol. 28, pp. 1-
19, 2019.
[7] M. Ahmed, A.N. Mahmood, and J. Hu, "A survey of network anomaly detection techniques," Journal
of Network and Computer Applications, vol. 60, pp. 19-31, 2016.
[8] T. N. Nguyen, and N.A.L. Khac, "One-class collective anomaly detection based on LSTM-RNNs,"
Transactions on Large-Scale Data-and Knowledge-Centered Systems XXXVI, pp. 73-85, 2017.
[9] Q. Le, O. Boydell, B. Mac Namee, and M. Scanlon, “Deep learning at the shallow end: Malware
classification for non-domain experts,” Digital Investigation, vol. 26, pp. S118-S126, 2018.
[10] A.C. Bahnsen, E.C. Bohorquez,, S. Villegas, J. Vargas, and F.A. González, “Classifying phishing
URLs using recurrent neural networks,” in 2017 APWG symposium on electronic crime research
(eCrime), pp. 1-8.
[11] A. Tuor, S. Kaplan, B. Hutchinson, N. Nichols, and S. Robinson, “ Deep learning for unsupervised
insider threat detection in structured cybersecurity data streams,” in 2017 workshops at the Thirty-
First AAAI Conference on Artificial Intelligence.
[12] M. Dong, L. Yao, X. Wang, B. Benatallah, C. Huang, and X. Ning, “Opinion fraud detection via
neural autoencoder decision forest,” in 2018 Pattern Recognition Letters.
[13] F. Tchakounté, and F. Hayata, “Supervised Learning Based Detection of Malware on Android,” in
2017 Mobile Security and Privacy, pp. 101-154.
[14] D. Maier, M. Protsenko, and T. Müller, “A game of Droid and Mouse: The threat of split-personality
malware on Android,” Computers & Security, vol. 54, pp. 2-15, 2015.
[15] M. Hussain, A.A. Zaidan, B. Zidan, S. Iqbal, M.M. Ahmed, O.S. Albahri, and A.S. Albahri,
“Conceptual framework for the security of mobile health applications on android platform,”
Telematics and Informatics, vol. 35(5), pp. 1335-1354, 2018.
[16] H. Liang, Y. Song, and D. Xiao, “ An end-To-end model for Android malware detection,” in 2017
IEEE International Conference on Intelligence and Security Informatics (ISI), pp. 140-142.
[17] M. Ganesh, P. Pednekar, P. Prabhuswamy, D.S. Nair, Y. Park, and H. Jeon, “Cnn-based android
malware detection,” in 2017 International Conference on Software Security and Assurance (ICSSA),
pp. 60-65.
[18] S. Garg and N. Baliyan, “A novel parallel classifier scheme for vulnerability detection in android,”
Computers & Electrical Engineering, vol. 77, pp.12-26, 2019.
[19] S. Garg, &N. Baliyan, “ Data on vulnerability detection in android,” Data in brief, vol. 22, pp. 1081-
1087, 2019.
[20] K. Tam, A. Feizollah, N. B. Anuar, R. Salleh, & L. Cavallaro, “The evolution of android malware
and android analysis techniques,” ACM Computing Surveys (CSUR), vol. 49(4), pp. 1-41, 2017.
[21] X. Su, D. Zhang, W. Li, &K. Zhao, “A deep learning approach to android malware feature learning
and detection,”in 2016 IEEE Trustcom/BigDataSE/ISPA, pp. 244-251.
[22] Z. Wang, J. Cai, S. Cheng, &W. Li, “DroidDeepLearner: Identifying Android malware using deep
learning,” in 2016 IEEE 37th Sarnoff Symposium, pp. 160-165.
[23] S. Hou, A. Saas, L. Chen, and Y. Ye, “Deep4MalDroid: A deep learning framework for Android
malware detection based on Linux kernel system call graphs,” Proc. - 2016 IEEE/WIC/ACM Int.
Conf. Web Intell. Work. WIW 2016, pp. 104–111, 2017.
[24] M. Ganesh, P. Pednekar, P. Prabhuswamy, D. S. Nair, Y. Park, and H. Jeon, “CNN-Based Android
Malware Detection,” in 2017 International Conference on Software Security Assurance, pp. 60–65,
2017.
[25] D. Zhu, M. H. Jin, D. Wu, M. Y. Yang, and W. Chen, “DeepFlow: Deep Learning-Based Malware
Detection by Mining Android Application for Abnormal Usage of Sensitive Data,” pp. 0–5, 2017.
[26] S. Hou, A. Saas, L. Chen, Y. Ye, and T. Bourlai, “Deep neural networks for automatic Android
malware detection,” Proc. 2017 IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Mining, ASONAM
2017, pp. 803–810, 2017.
[27] N. McLaughlin, J. Martinez del Rincon, B. Kang, S. Yerima, P. Miller, S. Sezer, &G. Joon Ahn,
“Deep Android Malware Detection,” Proc. Seventh ACM Conf. Data Appl. Secur. Priv. - CODASPY
’17, pp. 301–308, 2017.
[28] H. Liang, Y. Song, and D. Xiao, “An end-To-end model for Android malware detection,” 2017 IEEE
Int. Conf. Intell. Secur. Informatics Secur. Big Data, ISI 2017, pp. 140–142, 2017.
[29] F. Martinelli, F. Marulli, and F. Mercaldo, “Evaluating Convolutional Neural Network for Effective
Mobile Malware Detection,” Procedia Comput. Sci., vol. 112, pp. 2372–2381, 2017.
[30] W. Li, Z. Wang, J. Cai, and S. Cheng, “An Android Malware Detection Approach Using Weight-
Adjusted Deep Learning,” 2018 Int. Conf. Comput. Netw. Commun., pp. 437–441, 2018.
[31] Y. Zhang, Y. Yang, and X. Wang, "A Novel Android Malware Detection Approach Based on
Convolutional Neural Network," Proceedings of the 2nd International Conference on Cryptography,
Security, and Privacy, pp. 144-149, 2018.
[32] L. Shiqi, T. Shengwei, Y. Long, Y. Jiong, and S. Hua, “Android malicious code Classification using
Deep Belief Network,” KSII Trans. Internet Inf. Syst., vol. 12, no. 1, pp. 454–475, 2018.
[33] W. Wang, M. Zhao, and J. Wang, “Effective Android malware detection with a hybrid model based
on deep autoencoder and convolutional neural network,” J. Ambient Intell. Humaniz. Comput., vol. 0,
no. 0, pp. 1–9, 2018.
[34] E. M. B. Karbab, M. Debbabi, A. Derhab, and D. Mouheb, “MalDozer: Automatic framework for
Android malware detection using deep learning,” Digit. Investig., vol. 24, no. March, pp. S48–S59,
2018.
[35] K. Xu, Y. Li, R. H. Deng, and K. Chen, “DeepRefiner: Multi-layer Android Malware Detection
System Applying Deep Neural Networks,” 2018 IEEE Eur. Symp. Secur. Priv., pp. 473–487, 2018.
[36] D. Li, Z. Wang, and Y. Xue, “Fine-grained Android Malware Detection based on Deep Learning,”
2018 IEEE Conf. Commun. Netw. Secur., vol. 1, no. L, pp. 1–2, 2018.
[37] H. Alshahrani, H. Mansourt, S. Thorn, A. Alshehri, A. Alzahrani, and H. Fu, “DDefender: Android
application threat detection using static and dynamic analysis,” 2018 IEEE Int. Conf. Consum.
Electron., pp. 1–6, 2018.
[38] G. Grieco, G.L. Grinblat, L. Uzal, S. Rawat, J. Feist, and L. Mounier, L, “Toward large-scale
vulnerability discovery using machine learning,” in 2016 Proceedings of the Sixth ACM Conference
on Data and Application Security and Privacy, pp. 85-96.
[39] J. Xin, Z. Wen, N. Shaozhang, and X. Yiming, “Analysis and Detection of Android App Privilege
Escalation Vulnerability Based on Machine Learning,” in 2018 International Conference on
Intelligent Information Hiding and Multimedia Signal Processing, pp. 117-123.
[40] L. Zhuo, G. Zhimin, and C. Cen, “Research on Android intent security detection based on machine
learning,” in 2017 4th International Conference on Information Science and Control Engineering
(ICISCE), pp. 569-574.
[41] A. Amin, A. Eldessouki, M.T. Magdy, N. Abdeen, H. Hindy, and I. Hegazy, “AndroShield:
Automated Android Applications Vulnerability Detection, a Hybrid Static and Dynamic Analysis
Approach,” Information, vol. 10(10), p.326, 2019.
[42] D. Geneiatakis, G. Baldini, I.N. Fovino, and I. Vakalis, “Towards a mobile malware detection
framework with the support of machine learning,” in 2018 International ISCIS Security Workshop,
pp. 119-129.

ICISS 2020 - Shivi - 9 April 2020 - IIT Jammu

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ICISS 2020 - Shivi - 9 April 2020 - IIT Jammu

Uploaded by

Copyright:

Available Formats

Machine learning based Android vulnerability

Shivi Garg1 , Niyati Baliyan2*

Abstract. Cyber-security risk is increasing at an alarming rate because of

1.1 Supervised Learning

It is also known as task-driven approach, in which, available data points have

1.3 Semi-supervised learning

1.4 Reinforcement learning

It is also known as environment driven approach. It is generally used when the

Deep Learning (DL) [2] is a subclass of ML where neural networks comprising

Fig. 1. Venn Diagram for ML/ DL terminology

Cyber-security [4] is defined as the technologies and different practices designed in

Table 1. Categories of security tasks.

Tasks Description Tools/ Techniques

Table 2. ML applications in cyber security areas.

Protection Sub Layer ML Techniques Role of ML

The third dimension answers “How” to check security mechanisms. It can be

2.1 Role of ML/ DL in cyber-security domains

Ahmed et al. [7] presented an in-depth analysis of anomaly detection techniques

Android is a complex open network of different collaborating companies. Android is

5 Trends in Android Vulnerabilities

Fig. 3. Android Vulnerabilities trend from the year 2009 to 2019

Table 3. ML/DL Algorithms Detection rates for Android Vulnerabilities.

Published Key Feature # of Algorithm used Performance Metric(s)

ML and DL are effective tools for detecting categories of vulnerabilities.

Table 4. Vulnerability Trends from 2016 - 2019.

Vulnerability 2016 2017 2018 2019 Total % share of

Table 5 shows the change in distribution pattern of vulnerabilities during 2016 -

Table 5. Change in Distribution pattern of Vulnerabilities during 2016-2019.

Vulnerability 2016 2019 % share % share % difference

# of Vulnerabilities 525 414 100% 100% 0 pp

Table 6. Mean impact score of each vulnerability.

Vulnerability Mean COUNT of CVE Total Impact

Exec Code 8.3 310 14.7%

Priv 8.7 215 10.6%

Overflow 7.1 256 10.3%

Info 4.6 259 6.8%

DoS 6.2 191 6.7%

Exec Code Overflow 8.3 56 2.7%

Exec Code +Priv 9.2 31 1.6%

Exec Code Overflow Mem. 9.1 31 1.6%

Classified (Others) 7.0 135 5.4%

Software vulnerabilities affect Confidentiality, Integrity and Availability (CIA)

Table 7. Impact Score on Confidentiality.

Complexity Complete (%) None (%) Partial (%) Total (%)

Total (%) 61 9 29 100

Table 8. Impact Score on Integrity.

Complexity Complete (%) None (%) Partial (%) Total (%)

Total (%) 60 22 18 100

Table 9 shows the impact of vulnerability scores on availability. It is seen that

Table 9. Impact Score on Availability.

Complexity Complete (%) None (%) Partial (%) Total (%)

Total (%) 66 15 19 100

6 ML/ DL Roadmap for Android Security

Existing ML/ DL approaches for vulnerability detection in Android are effective,

Cyber-attacks are becoming more prevalent since large amount of data is

[1] A. Narayanan, M. Chandramohan, L. Chen, &Y. Liu, “A multi-view context-aware approach to

You might also like