Feature-Based Semi-Supervised Learning To Detect Malware From Android

Chapter 6
Feature-Based Semi-supervised Learning

to Detect Malware from Android
Arvind Mahindru and A. L. Sangal
Abstract Malware is potentially harmful to an Android operating system just like

desktop operating system. With the exponential growth of Android device we have
analyze that the growth of Android malware is also increasing day by day and it
paid a serious security threat to user’s privacy. Previously developed frameworks
and virus protection softwares are capable to detect “known” malware specifically.
In the previous studies researchers, has applied distinct supervised machine learning
approaches to detect “unknown” malware, but practicality is far to be achieve because
it needs a wide range of labeled data to train. In this work, we present a unique
procedure to detect malware by employing a renowned semi-supervised learning
technique. The approach presented in this chapter is help us to select best features
by applying feature sub-set selection methods and to establish a malware detection
model. We performed an empirical validation to demonstrate that semi-supervised
machine learning techniques are sustaining the higher accuracy rates like supervised
machine learning techniques used in the literature.
Keywords Android apps · Permissions model · API calls · LLGC · Feature

selection · Intrusion detection
A. Mahindru (B) · A. L. Sangal

Department of Computer Science and Engineering, Dr. B.R. Ambedkar National Institute
of Technology, Jalandhar 144001, India
e-mail: sangalal@nitj.ac.in
A. Mahindru
Department of Computer Science and Applications, D.A.V. University, Sarmastpur,
Jalandhar 144001, India
© Springer Nature Switzerland AG 2020 93

S. C. Satapathy et al., Automated Software Engineering: A Deep
Learning-Based Approach, Learning and Analytics in Intelligent Systems 8,
https://doi.org/10.1007/978-3-030-38006-9_6
94 A. Mahindru and A. L. Sangal
6.1 Introduction
Android has literally conquered the smartphone industry and acquired 87.7% of
market share at the end of second quarter of 2019 [1]. Android official app store
i.e., Google Play had 3.7 million apps available at the end of December 2018 [2]
whereas the total app download count has reached 205.4 billion till the end of the
year 2018 [3]. The popularity of Android apps invites cybercriminals who view it
as a lucrative target. According to McAfee 2018 mobile threat report [4], there has
been 16 million malware found in Google play store which was double in number
from the previous years. More than 4000 mobile threats are present in Android [5].
According to Google Play Protect [6], they are securing more than 2 billion devices
on a daily basis. However, the report which was published by McAfee in 2018 [4]
is about-face of it. Google Play Protect failed to protect the most common malware
threats. Research carried out in this direction found that 4,964,460 devices were
infected from pre-installed apps in the first quarter of 2018 [6].
Alike a desktop operating system, there is anti-malware software in the mobile
phones too. The efficiency of anti-malware software is dependent upon a signature-
based approach. A signature-based approach follows the concept of a unique se-
quence of bytes which is constantly present inside the malware infected software.
The critical problem with this process is that it fails to find new malware. Still,
malware analyst must wait for a new malware to come into the market, and then it
generates a signature file and provides a solution for its users. This approach is oper-
ative when a significant amount of new malware signature is present in its database.
A Machine-Learning approach is generally followed to conquer the problem of
signature-based approach and to detect unknown malware from Android [7]. Machine
learning based approaches are trained with classification algorithms by the mean of
data sets which are composed of several characteristics or features that come from
both malicious and benign apps. In machine learning approach, feature selection is
the main step which determines the accuracy rate of the classifier.
In previous studies, researchers applied various supervised machine learning tech-
niques [8–11] to predict whether the app is infected with malware or not. Supervised
machine-learning techniques need a significant amount of labeled data for each of
the family. It is very hard to collect a considerable number of labeled data from the
real-world, such as known malicious app in Android. Gathering labeled data for both
the classes is a time-consuming process, and in such processes, some malware apps
can exceed detection.
From the researches done so far [8–11], it has been observed that for supervised
learning, the testing data can similarly allocate to the training data. Then, there is
degradation in performance of supervised machine learning algorithms when tested
on samples that do not allocate identically as the training data. Therefore to overcome,
the problems faced by supervised techniques, a semi-supervised machine learning
technique is helpful in which a fixed amount of labeled data is available for both
the classes. Semi-Supervised machine learning techniques trained with the help of
supervised classifier by using labeled data and detect the label for each unlabeled
6 Feature-Based Semi-supervised Learning … 95
Fig. 6.1 Flow chart of the proposed Android malware detection approach
instance. These techniques help us to enhance the accuracy with no labels present in
the data set.
In this work, a malware detection approach is developed based on the principle of
semi-supervised machine learning technique. We apply LLGC (Learning with Local
and Global Consistency) [12, 13] on our collected data set which consists of the
dynamic behavior of the Android apps. The unique and novel contribution of this
study are as given below:
– To develop a malware detection model by using distinct feature sets which are
suitable for all categories of Android apps.
– To demonstrate that semi-supervised machine learning approach is equivalent good
as supervised machine learning approach.
The steps followed by us in building a capable Android malware detection model
are demonstrated in Fig. 6.1. To build an effective malware detection model, we
collect Android application packages (.apk) from different sources mentioned in
Sect. 6.3. Next, it is important to identify the class (i.e., benign or malware) of .apk
file. Further, features are extracted from collected .apk files by using tools available in
the literature and these collected features forms our data set. Next, the right set of fea-
tures are chosen by implementing a feature sub-set selection methods. These selected
features are recognized as input to build a model by using semi-supervised machine
learning approach. Finally, the developed model is validated with the proposed de-
tection framework to check its capability to detect malware from real-world apps
or not. Rest of the chapter is arranged are as follows. In Sect. 6.2, we discusses the
previously developed Android malware detection models. In Sect. 6.3, we present
the description of the data set. In Sect. 6.4, we present the feature sub-set selec-
tion methods. Section 6.5, explain the semi-supervised machine learning classifier.
Section 6.6, provides the proposed detection framework. Section 6.7, provides the
performance evaluation parameters. Sections 6.8 and 6.9 represent the experimental
setup, experimental results and conclusion of this chapter with the future scope.
6.2 Related Work
The combination of artificial intelligence and statistics offers machine learning the
foundation of probabilistic models and data-driven parameter estimation [13]. Ma-
chine learning techniques can further be categorized into three distinct learning prin-
ciples viz. supervised, unsupervised and semi-supervised [7]. Table 6.1 shows the
Table 6.1 Existing developed framework or approaches

Approach/framework Description Findings
AndroSimilar [21] Use a signature-based approach to Achieve an accuracy of 60%
detect unknown malware
DroidAnalytics [24] Detect malware in Android apps at 2494 malware samples detected
three levels from 102 families
Stao et al. [27] Lightweight malware detection Achieve 90% accuracy
mechanism
Huang et al. [30] After retrieving the features, it labels Achieve the accuracy of 81 %
the apps as benign or malware
PUMA [9] Use ML techniques to analyze 80%
permissions from the app
Androguard [40] Disassemble and decompile Android Calculate normalized
apps compression distance among
each method pairs
DREBIN [33] Capable of detecting malware The accuracy of 94 % is achieved
Kang et al. [41] It is incorporating the creator’s It shows detection and
information as a feature and classify classification performance with
malicious applications into similar 98 and 90% accuracy
groups
Octeau et al. [42] It uses the principle of 636 million ICC relationships in
communication between applications a corpus of 11,267 apps took
30 min
CrowDroid [34] Anomaly detection Achieve the accuracy in between
85 and 100 % depends upon the
malware type
AntiMalDroid [35] It monitors the behavior of the High detection rates
applications
Andromaly [10] Anomaly detection approach Achieve accuracy in between 80
and 90%
DroidScope [37] Emulation based technique Accuracy depends upon the
features quality
STREAM [43] It enables rapid, large-scale It show detection rates from
validation 68.75 to 81.25%
DroidAPIMiner [31] It extracts features of the permissions Achieve an accuracy of 99%
at API level and evaluate distinct
classifiers by utilizing these features
DroidDolphin [44] It leverages the technologies of Achieve an accuracy of 86.1%
GUI-based testing and machine
learning to Android applications
Sheen et al. [45] Monitor by permission-based Precision and recall ranging
features and the API call based between 83 and 95%
features
MODroid [46] Behavioral-based malware detection The detection rate of Malware is
technique 60.16% with 39.43%
false-positives
Vinayakumar et al. [47] Use network parameters on all Android malware detection of
extracted features 0.939 on dynamic analysis and
0.975 on static analysis
existing framework or approaches developed in the past. First column of Table 6.1,
point towards the name of the framework or approach developed in the literature.
The second column describe the principal followed by the researchers to detect mal-
ware from Android apps. The third column presents the findings of the previously
developed approaches.
There are two techniques which are utilized to identify malware from Android,
i.e., static and dynamic. The static technique involves examining and disassembling
of the code to validate the function and also help us to evaluate the apps without
running it [7]. The static based detection method is further divide into three parts i.e.,
permission, signature and Dalvik. The dynamic technique is based on the principle
where detection of Android malware take place while executing it. The dynamic
technique is further segregated into three methods based on the principle of taint
analysis, anomaly and emulation [7].
In supervised learning, we train it with the help of labeled classes and after learn-
ing it from the features, the test is performed on the remaining data set. Traditional
methods include Support Vector Machine [14, 15], Decision Tree [11, 16], Near-
est Neighbor [17], Navies Bayes [11], Random Forest [11], etc. In unsupervised
learning, we train it with the help of an unlabeled class and after learning from the
features, the test performed on the remaining data set. Traditional methods include
K-mean [18], Model-Based Clustering [19], Hierarchical Clustering [20] etc. Semi-
supervised learning is among unsupervised and supervised learning, where we train
data with the help of unlabeled data and testing is performed by using both the labeled
and unlabeled data. Typical methods include hidden Markov model, low-density sep-
aration and so on.
Faruki et al. [21] proposed AndroSimilar which is utilized to identify the unknown
malware on the basis of signature length which is further compare with signatures
already exist in its database to detect malware apps. Felt et al. [22] was developed to
examine either the Android apps were over-privileged or not. Further, they authors
implemented their proposed approach on collection of distinct Android apps and
conclude that 33% apps were over-privileged. Tang et al. [23] developed a model
which work on the principal of security distance and implemented on the key idea
that if the app demands exceed one feature i.e., permission, with in the time it raises
an issue to the security of Android based devices. DroidAnalytics [24] utilizes the
signature of the app together with API calls to determine malware apps.
Wognsen et al. [25] developed initially formalized version of Dalvik byte-code
which is base on the java reflective features. Further, developed technique is utilized
to determine malware by employing data flow analysis. PUMA [9] obtained the
accuracy of 80% by utilized the machine learning technique with the help of extracted
permissions to detect malware from the Android apps. KIRIN [26] is a light-weight
framework based on the certificate and it utilized at the time of execution. An app
is considered to be malicious behavior if it is incapable to clear all of its security
check. Sato et al. [27] suggested a light-weight approach for malware detection
which investigates the Android"Manifiest.xml" file. It match the extracted features
with the manifest file and achieves the accuracy of 90% and also calculates the score
to evaluate the app is malware or not.
DroidMat [28] based on the extraction of data from “Manifest.xml” file of An-
droid. To strengthen the performance of machine learning classifier, they implement
K-mean machine learning algorithm in addition to K-nearest neighbors algorithm on
collected data set. Zhou et al. [29] developed DroidMOSS that is a approach which
evaluate the apps on the principal of analogy. Fuzzy hashing technique is utilized to
discover modifications done in the app by re-packaging. This framework is restricted
to a limited number of malware samples. Haung et al. [30] able to identified 81% of
the malicious apps by implementing Machine learning algorithms which is worked
on the rule of labeling.
Aafer et al. [31] present DroidAPIMiner which combines permission established
on the principle of behavioral footprints and implemented a filtering mechanism to
discover the existence of malware in Android apps. Authors achieved the accuracy
of 99% by using API level to identify malware and benign apps. ComDroid [32]
detects app communication vulnerabilities.
DREBIN is a light-weight approach proposed by Arp et al. [33] which discovered
malicious apps by utilizing the standards of joint vector space. DERBIN achieve the
performance of 94% with some false-alarms by using machine learning technique
to discover malware apps. CrowDroid is proposed by Burguera et al. [34] utilizes
the behavior of Android apps to discover malware by utilized unsupervised machine
learning techniques and outcomes were stored at the server.
Zhao et al. [35] proposed AntiMal-Droid which rely on the behavior of apps to
detect whether the app is malware and benign. AntiMal-Droid work on the principle
of the signature comparing to identify that the apps belong to benign and malware
categories. Enck et al. [36] developed TaintDroid which is based on real-time anal-
ysis. It follows various source of important information and recognizes the data
leakage. Shabtai et al. [10] proposed Andromaly which is based of machine learning
techniques to check Android devices and to identify that app belongs to benign and
malware category.
Yan et al. [37] developed DroidScope which operates on an Android device which
provides in assisting custom analysis and to identify privilege based attacks. Feng et
al. [38] proposed Apposcopy having the characteristics of static analysis, taint anal-
ysis and Inter-component call graph which successfully identify the malware apps.
Narayanan et al. [39] developed Scalable Android malware detector and Context-
aware Adaptive which is able to identifying all kinds of malicious behavior apps, but
it is adaptive to developing malware.
Earlier, researchers presented features selection techniques for detecting malware
from real-world apps. Table 6.2 highlights the researches conducted by distinct au-
thors to choose the best features which are used to develop a model for malware
detection from real-world apps.
Table 6.2 Feature selection techniques used in the literature

Approach/Author Feature selection technique used
Andromaly [10] Fisher score, Chi-square and information gain
Mas’ ud et al. [48] Information gain & Chi-square
MKLDroid [49] Chi-squared
Allix et al. [50] Information gain
Azmoodeh et al. [51] Information gain
6.2.1 Research Questions
An experiment is performed to find out the performance of malware detection ap-

proaches by utilizing the proposed detection framework. This work also emphasis
on identifying optimal number of features to detect whether the app is benign or
malware. We carried out the following research questions in this chapter:
RQ1: Does it is feasible to detect malware from Android apps by utilized semi-
supervised machine learning technique?
By the help of this question, we examine the performance of the LLGC to detect
malware from Android apps. In this study, LLGC semi-supervised machine learning
classifiers have been considered for building a model by recognizing a set of features
as input and able to detect either the app is benign or malware.
RQ2: Does the feature sub-set selection approaches pay any impact on the perfor-
mance of the semi-supervised machine learning classifier or not?
It is noticed that certain feature sub-set selection approaches work very well with a
certain classification techniques. Therefore, in this work, four distinct sub-set selec-
tion approaches are evaluated by utilizing LLGC as an classifier.
RQ3: Which feature sub-set selection approach work best for the task of detecting
malware from Android apps?
This question helps us to choose the best features by applying feature sub-set selec-
tion methods on our collected data set. Further, feature selected by feature sub-set
selection method are utilized to develop a model to detect either the app is benign or
malware.
RQ4: Does a selected set of features perform better than considering all set of fea-
tures for the task to detect either the app is benign or malware?
In this research question, our objective is to select best set of features by applying
feature sub-set selection method which help us to differentiate between benign or
malware apps.
6.3 Data Set Description
Earlier framework/approaches reviewed only a few number of Android apps to in-

vestigate the relationships between malware apps and set of features. Therefore, it is
not feasible to make any decisions which is applicable to all categories of Android
apps and systems. In this study, thirty different categories of Android apps are in-
vestigated to generalize and strengthen our outcomes. We collect the experimental
data set for our study from promise repositories. We collected 3,00,000 of .apk files,
from Google’s play store,1 pandaapp,2 gfan,3 hiapk,4 Android,5 appchina,6 mumayi
7
and slideme.8 Among these, 2,75,000 are distinct. These applications are collected
after removing viruses, reported by Virus Total9 and Microsoft Windows Defender.10
Virus Total assist us to identify malware by antivirus engines and contains over 60 an-
tivirus software. A number of 35,000 malware samples, from three different datasets
[52–54] are collected. In [52], Kadir et al. introduced an Android sample set of 1929
botnets, consisting of 14 different botnet families. Android Malware Genome project
[53] contains a collection of 1200 malware samples that cover the mostly of present
Android malware families. We collected about 17,871 samples from AndroMalShare
[54] along with their package names. After removing duplicate packages from the
collected dataset, we have 25,000 different malware samples left in our study. Both
malware and benign applications have been collected from the sources mentioned
above until March 2019. Table 6.3 gives the category of the Android app with the
number of samples used in our study.
Next, to develop a model for malware detection we extract features from collected
.apk files. In our study, we use Android studio as an emulator and extract features
(i.e., permissions and API calls) by writing a code in Java language and collect
features from the collected Android apps. Further, the extracted features are divided
into thirty different sets belong to their categories. Table 6.4. show us the formulation
of different sets that contains the information about the features (i.e., permissions,
API calls, number of user download the app, and the rating of the app).
1 https://play.google.com/store?hl=en.
2 http://android.pandaapp.com/.
3 http://www.gfan.com/.
4 http://www.hiapk.com/.
5 http://andrdoid.d.cn/.
6 http://www.appchina.com/.
7 http://www.mumayi.com/.
8 http://slideme.org/.
9 https://www.virustotal.com/.
10 https://www.microsoft.com/en-in/windows/comprehensive-security.
Table 6.3 Categories of .apk files belong to their respective families (.apk)
ID Category N T Ba W Bo S
D1 Arcade and action (AA) 8291 440 100 204 130 600
D2 Books and reference (BR) 8235 200 250 56 150 150
D3 Brain and puzzle (BP) 4928 820 54 28 50 50
D4 Business (BU) 8308 152 150 150 22 22
D5 Cards and casino (CC) 8886 76 65 81 100 44
D6 Casual (CA) 8010 321 69 46 150 140
D7 Comics (CO) 8667 65 95 35 3 0
D8 Communication (COM) 18,414 250 50 500 3 3
D9 Education (ED) 8744 560 68 50 50 68
D10 Entertainment (EN) 19,222 500 500 500 100 42
D11 Finance (FI) 7999 50 200 99 65 92
D12 Health and fitness (HF) 8551 98 65 45 140 140
D13 Libraries and demo (LD) 8655 70 100 100 6 500
D14 Lifestyle (LS) 7650 155 200 100 193 192
6 Feature-Based Semi-supervised Learning …
D15 Media and video (MV) 8019 100 123 162 450 71
D16 Medical (ME) 6000 12 13 12 24 25
D17 Music and audio (MA) 8621 65 100 65 165 165
D18 News and magazines (NM) 8164 100 100 100 100 32
D19 Personalization (PE) 9334 500 42 500 200 22
D20 Photography (PH) 9133 100 120 50 96 500
D21 Productivity (PR) 9850 100 516 250 250 62
D22 Racing (RA) 9766 50 100 210 100 180
D23 Shopping (SH) 9673 100 100 120 150 50
D24 Social (SO) 6159 100 50 210 150 150
D25 Sports (SP) 9669 100 240 100 450 112
D26 Sports games (SG) 9889 100 145 145 650 198
D27 Tools (TO) 8346 120 500 550 475 563
D28 Transportation (TR) 8796 2 500 100 100 20
D29 Travel and local (TL) 9180 500 220 150 48 100
D30 Weather (WR) 9841 120 23 700 50 25
101
“N” stands for Normal, “T” stands for Trojan, “Ba” stands for Backdoor, “W” stands for"Worm", “BO” stands for Botnet, and “S” stands for “Spyware”
Table 6.4 Formulation of sets (having permissions, API calls, number of user download the app
and rating of the apps)
Set No. Description Set Description
No.
S1 SYNCHRONIZATION _DATA S2 CONTACT_INFORMATION
S3 PHONE_STATE and S4 AUDIO and VIDEO
PHONE_CONNECTION
S5 SYSTEM_SETTINGS S6 BROWSER_INFORMATION
S7 BUNDLE S8 LOG_FILE
S9 LOCATION_INFORMATION S10 WIDGET
S11 CALENDAR_INFORMATION S12 ACCOUNT_SETTINGS
S13 DATABASE_INFORMATION S14 IMAGE
S15 UNIQUE_IDENTIFIER S16 FILE_INFORMATION
S17 SMS_MMS S18 READ
S19 ACCESS_ACTION S20 READ_AND_WRITE
S21 YOUR_ACCOUNTS S22 STORAGE_FILE
S23 SERVICES_THAT_COST_YOU_MONEY S24 PHONE_CALLS
S25 SYSTEM_TOOLS S26 NETWORK_INFORMATION and
BLUETOOTH_INFORMATION
S27 HARDWARE_CONTROLS S28 DEFAULT GROUP
S29 API CALLS S30 RATING and USER DOWNLOADS
THE APP
6.4 Feature Sub-set Selection Approaches
It is essential to choose appropriate set of features for data preprocessing task in

machine learning. On the basis of Table 6.2, it is seen that in the previous studies,
researchers applied distinct feature ranking methods to choose the best features to
detect malware from Android apps. In this chapter, four distinct kinds of feature sub-
set selection approaches are implemented on thirty distinct categories of Android
app to discover the best set of features which support us to detect malware with
better detection rate and also minimize the misclassification errors. The subsequent
subsections emphasizes on distinct feature sub-set selection approaches to discover
a limited group of features from total possible features that jointly have excellent
detection capability.
6.4.1 Consistency Sub-set Evaluation Approach
Consistency sub-set evaluation approach classify the importance of a sub-set of

attributes according to degree of coherence in that class worths once the training
samples are expected onto the sub-set of attributes [55]. The coherence rate is mea-
sured utilizing incongruity rate wherever two measuring points are studied incom-
patible whether they have the similar feature importance among two distinct class
names (i.e., benign or malware). For this work, destination variable i.e., apps is hav-
ing two distinct characteristics (i.e., 0 for benign apps and 1 for malware apps). A
group of features (GF) is having Z amount of sample, there are z amount of instances
in a manner that Z = X 1 + X 2 + · · · + X z . Instance X i seems in entirely A samples
from which A0 numbers of samples are marked by 0 and A1 number of samples are
marked by 1, here A = A0 + A1 . If A1 is less than A0 , then the difference count for
the instance X i is I N Ci = A − A0 . The inconsistency rate (I N C R) of feature set
is computed by utilizing succeeding equation:
z
I N Ci
I NC R = i=1
(6.1)
Z
6.4.2 Filtered Sub-set Evaluation
Filtered sub-set evaluation is based on the principal to select random sub-set evaluator
from data set that was gained by applying arbitrary filtering approach [56]. The
filtering technique does not based on any learning induction algorithm. Filtered sub-
set evaluation approach is fast and scalable.
6.4.3 Rough Set Analysis Approach
Rough set analysis method is based on the principal of the similarity of a conventional
crisp set11 phrases of sets in pairs, which deliver the upper and the lower estimation
of the original data set [57]. This ceremonious similarity, depicts the upper, and
lower limits of the original data set. Rough set analysis approach creates information
model apparent by decreasing the “degree of precision” [58]. We use RSA to search
diminished set of features. RSA utilized trinity distinct kinds of notations such as
reduced attributes, approximations, and information system.
i. Approximation: Let A = (C, Z ), X ⊆ Z and Y ⊆ C. X − topmost (X Y ) and

X − lowermost (< uline > X < /uline > Y ) approaches of X are utilized to
estimate Y. The topmost limit includes all the objects which maybe the part
to the set and the lowermost approach includes of all objects which certainly
be a part of the set. The X Y and (< uline > X < /uline > Y ) are computed
utilizing subsequent equations:
11 https://en.wikipedia.org/wiki/Rough_set.
X̄ Y = {yi ∈ U | [yi ] I nd(B) ∩ Y = 0} (6.2)
X Y = {yi ∈ U | [yi ] I nd(B)∩Y }, (6.3)
where | [yi ] I nd(C) belongs to the same class of yi in connection I nd(C).

ii. Reduced attributes: Correctness evaluation of the group Z (Acc(Z )) in A ⊆ B
is determined as:
car d(B Z )
μ B (A) = (6.4)
car d( B̄ Z )
where car d() several elements in the lowermost or topmost approach of the set
Z . Entire feasible groups are chosen whom correctness is equivalent to precision
of the universal set.
iii. Information system: It is determined as Z = (C B), where C is a universe in-
cluding non-empty group of confined objects and B is the confined attribute sets.
Here occur a corresponding Fb : C → Vb for every b ∈ B, here Vb is the group
of importance of attribute b. For the sake of every group attribute Z ⊂ B, there
exist a related parity association named as B-indiscernibility (Ind(Z)) relation.
I nd(Z ) is determined in the following way:
I N D A (Z ) = {(x, y) ∈ A2 | ∀a ∈ Z , a(x) = a(y)}. (6.5)
6.4.4 Feature Sub-set Selection Approach Based on

Correlation
Feature sub-set selection approach based on Correlation approach chooses a sub-set

of features that are particularly related to the class (i.e., benign or malware). For this
work, Pearson’s correlation (R: Coefficient of correlation) being studied in favor of
searching the subordinate among set of features. If measure the correlation coefficient
which is greater among the group of features, then it signify a persistent structural
connection among such features. Specifically it implies that, despite the fact that
features evaluate distinct features of class structure, here exist a symbolic statistical
cause to consider that classes with lower (or highest) feature assess additionally it
have lower (or higher) ranges of other highly correlated features.
6.5 Machine Learning Classifiers
In our study, we apply LLGC, a semi-supervised machine learning method, which

trains with the help of few labeled instances and a enormous number of unlabeled
instances and tests performed on unlabeled data [12, 59]. LLGC is an iteration
algorithm that works on the following assumptions: (1) Points which are closer to
the expected value have the similar label and (2) Similar structure points are expected
to have the same label.
LLGC algorithm [12, 59] described as following:
if j = i and Ai, j = 0 then

xi −x j
2
Build affinity matrix A where Ai, j = exp 2·σ 2
;
Create the matrix M = A−1/2 · D · A−1/2 where A is the diagonal matrix with its (i, i) − th
entry adequate to the sum of the i − th row of D;
while ¬ Convergence do
S(t + 1) = α · F · S(t) + (1 − α) · Y where α ∈ (0, 1);
S ∗ is the boundary of the series {S(t)};
Label at every point of xi as argmax j≤c Fi,∗ j ;
Algorithm 1: LLGC approach
Let χ = {z 1 , z 2 , . . . , zl−1 , zl } ⊂ Rm is constituted of the samples and α =

{1, 2, . . . , c} be the group of class labels (group of labels is composed of two classes
i.e., malware and benign). Further, let xa (i + 1 ≤ a ≤ n) denotes the unlabeled in-
stances. The aim of LLGC is to detect the class of the unlabeled samples [12, 59].
Let F represents the group of m × n matrices with non-negative data, i.e. F contains
matrices like F = [F1t , . . . , Fmt ] that correspond to the class of the data set χ of every
sample xi , with the label given by z i = argmax j≤c Fi,v . F can be represented as a
vectorial function such as F : χ −→ R c to given a vector Fi to the samples xi . W
is an n × x matrix such as W ∈ F with Wi,v = 1 when xi is labeled as z i = v and
Wi,v = 0.
LLGC first describes a pair-wise relationship A on the data set X set the diagonal
elements to 0. Assume a graph G = (V, E) define among Z , in which the values
weight the set of vertex V is equivalent to Z and the set of edge E in [12, 59].
LLGC normalizes the matrix W symmetrically of G. In semi-supervised learning
algorithm; this step assures the approach of the iteration. After every iteration, every
sample gathers the knowledge of its neighbor instances whereas it holds its primary
information. The feature α signifies the qualified volume of knowledge from the
nearby instances, and the primary knowledge of every instances [12, 60]. Further,
knowledge is distributed symmetrically considering S a symmetric matrix. In the
last step, LLGC approach groups the class of every unlabeled sample of the class to
which it has expected most knowledgeable at the time of iteration process [12, 59].
Evaluation of machine learning classifier is typically split into two subsequent
phases, i.e., testing and training. In the testing phase, we trained the data set with
selected features which are obtained by different feature subset selection techniques.
These features are extracted from both benign and malware apps. In the testing
time, the performance of the classifier is examined by using selected performance
parameters (i.e., F-measure and Accuracy).
6.6 Proposed Detection Framework
To examine the effectiveness of our proposed malware detection model. We compare

the result of our proposed model with two different techniques.
a. Comparison with AV scanners: To compare the outcome of our proposed model,
we select five different anti-virus scanners and compare their performance with
our proposed framework.
b. Comparison with previously used classifiers: To check the feasibility of our
proposed model, we compare the parameters like F-measure and Accuracy with
other models build in the literature.
6.7 Evaluation of Parameters
This section yields the fundamental descriptions of the performance parameters uti-
lized for malware detection. Each of these performance parameters are calculated by
utilizing Confusion matrix. It includes actual and detected classification information
done by detection approach. Table 6.5 gives the confusion matrix for the malware de-
tection model. For our work, two performance parameters F-measure and Accuracy
are utilized for evaluating the performance of malware detection methods. F-measure
and accuracy can be measured by using Eqs. (6.6) and (6.7).
N Benign→Benign + N Malwar e→Malwar e

Accuracy = . (6.6)
Nclasses
2 ∗ Pr ecision ∗ Recall
F − measur e =
Pr ecision + Recall
2 ∗ N Malwar e→Malwar e
= .
2 ∗ N Malwar e→Malwar e + N Benign→Malwar e + N Malwar e→Benign
(6.7)
Table 6.5 Confusion matrix to classify an Android app is benign or malware (.apk)
Benign Malware
Benign Benign → Benign Benign → Malware
Malware Benign → Malware Malware → Malware
6.8 Experimental Setup
In this part of the chapter, the experimental set-up to discover the efficacy of mal-
ware detection model using the proposed detection framework is presented. LLGC
is utilized to build a model that detect either the app is benign or malware. These
approaches are implemented on thirty different categories of Android apps, as shown
in Table 6.3. All these categories have a different percentage of benign or malware
apps. Figure 6.2 shows the proposed framework for malware detection.
The subsequent steps are considered during selecting a set of features to build the
malware detection model which help us to detect either the app is benign or mal-
ware. Feature sub-set selection approaches are applied to thirty different categories
of Android apps. Consequently, a total of 150 [(4 feature sub-set selection approach
+ 1 recognizing all features) X 30 distinct Android apps data sets X 1 detection
approaches] different detection models have been build in this work.
1. In this work, four feature sub-set selection approaches are implemented on thirty
different categories of Android apps to chose the appropriate set of features for
malware detection.
2. The sub-sets of features achieved from the above one steps are considered as
input to semi-supervised machine learning algorithms while building a model.
For the effectiveness of Android malware detection model, we implemented 20-
fold cross-validation technique. The effectiveness of all build malware detection
models are compared by utilizing two distinct performance parameters namely
as F-measure and Accuracy.
Fig. 6.2 Framework of proposed work

3. The effective model build from above mentioned two stages are used to validate
with proposed malware detection framework.
6.9 Outcomes of the Experiment
This section of the chapter contains, the relationship among a distinct set of features
and malware detection at the Android level. F-measure and Accuracy are recog-
nized as performance assessment parameters to compare the performance of mal-
ware detection model build by utilizing LLGC as an classifier approach. To depict
the outcomes, we utilize the respective abbreviations as revealed in Table 6.6 to their
authentic names.
6.9.1 Feature Sub-set Selection Approaches
In this work, four distinct kinds of feature sub-set selection approaches are im-
plemented on thirty data sets of Android apps one after another. Feature sub-set
selection approaches work on the principle of hypothesis which make models with
better accuracy and make less amount of misclassified errors, while selecting the best
features from available number of features. Later, these isolation sub-set of features
has been recognized as an input for building a model to detect either the app is be-
nign or malware. Features selected by distinct feature sub-set selection approaches
are demonstrated in Fig. 6.3.
Table 6.6 Naming conventions for distinct approaches (.apk)

Abbreviation Corresponding name
DS Data set
FS1 Correlation best feature selection
FS2 Classifier subset evaluation
FS3 Filtered subset evaluation
FS4 Rough set analysis (RSA)
AF All extracted features
(a) Classifier (b) correlation
(c) Filtered (d) RSA

Fig. 6.3 Features selected by utilizing feature sub-set selection approaches
6.9.2 Machine Learning Classifier
In this study, semi-supervised machine learning classifiers have been recognized to

build a model to detect either the app is benign or malware. Five sub-sets of features
(4 outcomes collected from feature selection approaches + 1 recognizing all set of
features) are considered as input to build a model for detecting malware from Android
apps. Hardware utilized to conduct this experiment is Core i7 processor with storage
capacity of 1 TB hard disk and 8GB RAM. Detection models are build by utilizing the
MATLAB environment. The effectiveness of individual detection model is examined
by two performance parameters: F-measure and Accuracy.
Table 6.7 Measured accuracy by using LLGC as an machine learning classifier

Accuracy
ID AF FS1 FS2 FS3 FS4
D1 73.33 85.0 87.67 87.66 90
D2 70 82.08 86.27 82.66 91
D3 72 84.8 84.67 81.06 90.7
D4 72 81.08 82.27 81.60 98
D5 76 81 82 82 96
D6 71.8 82.08 81.27 81.66 94
D7 68 85 87 82 96
D8 61 72 78 81 99
D9 76 88 88 86 97
D10 71 81 82 81 98
D11 62 83 84 83 95
D12 70 81 82 82 95
D13 71 81 81 82 99
D14 61 77 80 82 99
D15 72 83 83 82 97
D16 71 80 87 86 98
D17 63 71 72 80 98
D18 72 84 81 80 89.9
D19 76 87 89 81 96
D20 72 86 86 84 89.8
D21 79 78 80 82 89.7
D22 71 86 87 89.7 90
D23 70 81 80.88 83.76 98.9
D24 71 80 80.8 82 97.2
D25 76 88 86 89 97.8
D26 74 87.7 86.9 86.6 93
D27 62 85 82 81 90.9
D28 76 88 86 82 89
D29 68 82 85 86 94.7
D30 76 82 84 87 94.8
Tables 6.7 and 6.8 show the gained performance values for distinct data sets by
utilizing LLGC as an classifiers. On the basis of Tables 6.7 and 6.8, it can be implicit
that:
– In case of LLGC, malware detection model build by recognizing the selected set
of features by utilizing FS4, i.e., RSA gained better outcomes when matched to
other feature sub-set selection approaches.
Table 6.8 Measured F-measure by using LLGC as an machine learning classifier

F-measure
ID AF FS1 FS2 FS3 FS4
D1 0.71 0.82 0.83 0.83 0.89
D2 0.68 0.82 0.84 0.83 0.89
D3 0.71 0.80 0.81 0.80 0.87
D4 0.73 0.80 0.82 0.81 0.9
D5 0.71 0.77 0.82 0.81 0.83
D6 0.78 0.82 0.84 0.83 0.94
D7 0.61 0.80 0.82 0.81 0.89
D8 0.61 0.80 0.82 0.80 0.92
D9 0.69 0.81 0.84 0.83 0.87
D10 0.52 0.81 0.80 0.82 0.86
D11 0.60 0.81 0.82 0.82 0.88
D12 0.62 0.80 0.81 0.81 0.89
D13 0.78 0.80 0.81 0.80 0.88
D14 0.72 0.82 0.81 0.80 0.93
D15 0.72 0.84 0.83 0.82 0.89
D16 0.62 0.81 0.80 0.80 0.9
D17 0.72 0.80 0.83 0.85 1
D18 0.73 0.81 0.83 0.86 0.97
D19 0.71 0.82 0.83 0.85 0.96
D20 0.72 0.86 0.83 0.82 0.89
D21 0.62 0.85 0.84 0.85 0.86
D22 0.62 0.83 0.84 0.85 0.89
D23 0.72 0.80 0.81 0.80 0.95
D24 0.70 0.81 0.82 0.86 0.9
D25 0.73 0.81 0.86 0.87 0.98
D26 0.70 0.81 0.82 0.83 0.85
D27 0.60 0.82 0.81 0.86 0.88
D28 0.78 0.82 0.85 0.82 0.89
D29 0.70 0.81 0.84 0.86 0.9
D30 0.68 0.81 0.82 0.85 0.94
In this chapter, one classifier and two evaluation parameters are recognized to
detect either the app belongs to benign or malware class. Figure 6.4 demonstrates the
two box-plot diagrams for each of the cases i.e., F-Measure and Accuracy . Every
single figure contains five box-plots. The model which is having the high value of
median and less numbers of outliers is consider the superior model. On the basis of
these box-plot diagram, we can analyze that:
(a) Accuracy (b) F-measure
Fig. 6.4 Box-plot diagrams of F-measure and accuracy
– In all feature sub-set approaches, FS4 have achieved high value of median with
lesser outliers. Based on box-plots demonstrated in Fig. 6.4, FS4 produced the
better outcome, i.e., feature sub-set selection by utilizing RSA compute the best
set of features for detecting malware and benign apps and give best results as
compared to others.
6.9.3 Comparison of Outcomes
Pair-wise t-test being utilized to identify which feature sub-set selection approaches
perform better or all of approaches worked equally well.
Feature Sub-set Selection Approaches: For this work, we consider four distinct
feature sub-set selection approaches as an input to build a model with thirty distinct
categories of Android apps and consider two outcome parameters, i.e., F-measure
and Accuracy. As every feature sub-set selection approach, used two sets, each with
30 points (1 classifier × 30 data sets). Hence, t-test among distinct feature sub-
set selection approaches are carried out and matched with the respective P-value to
measure the statistical importance. Figure 6.5 demonstrates the outcome of the t-test
study. Because, the values of P are presented by utilizing two distinct symbols such
as (green circle) P-value > 0.05 (no relevance importance) and (red circle) <= 0.05
(relevance importance). On the basis of Fig. 6.5 it has been observed, that a large
number of cells are filled with green circle; it means that it does not significantly
differentiate among applied feature selection approaches. Thus, FS4 selected set of
features using RSA, gives better outcomes as compared to other techniques.
Fig. 6.5 t-test analysis

(p-value) for feature subset
selection techniques
6.9.4 Evaluation of Proposed Framework Using Proposed

Detection Framework
Comparison with AV scanners: To validate that our proposed framework is capable

to detect malware from Android apps, we use five different anti-virus scanners avail-
able in the market. These anti-virus scanners work on signature based approach. For
this, we download free 1000 .apk files from different sources and use AV scanners
to detect malware from them. The accuracy of different AV scanners mentioned in
Table 6.9. On the basis of Table 6.9, it is seen that our proposed approach (FS4 +
LLGC) is able to detect 97.8% malware apps where as various anti-virus scanners
are able to detect only 39–93% malware apps.
Comparison with previously used classifiers: In this sub-section of the chapter, we
match the performance of our build model with existing used classifiers. Table 6.10
show us the comparison with existing classifiers. From Table 6.10, we analyze that
our proposed framework is capable to detect 97.8% malware from real-world apps
whether it compares with previously used machine learning classifiers.
Table 6.9 Comparison with AV scanners

AV scanners Accuracy (%)
AV1 (Panda free antivirus) 86
AV2 (Avast free antivirus) 93
AV3 (Adaware antivirus free) 39
AV4 (Comodo antivirus 10) 88
AV5 (AVG antiVirus FREE) 92
Proposed approach (LLGC+ FS4) 97.8
Table 6.10 Comparison with previously used classifiers having full dataset
Name of the machine learning classifier Averaged accuracy (%)
SimpleLogistic [9] 84.08
BayesNet K2 [9] 82
BayesNet TAN [9] 68.51
RandomTree [9] 83.32
Our proposed model (LLGC+FS4) 97.8
6.9.5 Experimental Finding
This part of the chapter contains, the overall finding of the empirical works done
so far. The empirical work was conducted on thirty different categories of Android
apps by selected features with the help of four distinct feature sub-set selection
techniques. Further, the selected features are trained with LLGC as an classifiers and
the performance are measured by using two effective performance parameters i.e.,
F-measure and Accuracy.
On the basis of empirical studies, this chapter able to answers the following re-
search questions.
RQ1: For building a malware detection model, LLGC have been considered to de-
tect either the app is benign or malware. On the basis of Tables 6.7 and 6.8, it can be
implicit that model developed using LLGC by recognizing selected set of features
by utilizing FS4 as an input gives better outcomes when compare to others.
RQ2: To give respond for RQ2, Fig. 6.4 were examined, and it is noted that the
outcome comes by using feature sub-set selection approaches is varied with LLGC.
It indicates that performance of LLGC to build a detection model to detect either
the app is benign or malware is influenced by the feature sub-set selection approaches.
RQ3: In this work, four distinct kinds of feature sub-set selection approaches are
recognized to select the smaller sub-set of features. On the basis of t-test study, it has
been analyzed that the feature sub-set selection by utilizing FS4 i.e., RSA approach
produces the best outcomes when compare to others.
RQ4: To give respond for RQ4, Figs. 6.4 and 6.5 were analyzed, we have seen that
model developed by using four different feature sub-set selection method is more
capable to detect malware rather than considering all extracted features from Android
apps.
6.9.6 Conclusion
This study highlighted on building a malware detection framework for identifying the
efficiency of the build malware detection model which is created by utilizing set of
features. In this chapter, thirty distinct set of features are utilized to build a model by
using LLGC. The execution process was conducted on thirty different categories of
Android app. The experiments carried out and outcomes are generated on MATLAB
environment.
Our outcomes of this chapter are as follows:

– Our empirical investigation outcomes indicates that, it is possible to determine a
small set of features. The malware detection model build by utilizing this identified
set of features is able to detect benign and malware apps with better accuracy and
lesser value of misclassification errors.
– On the basis of empirical study, we have seen that even after reducing 60% (aver-
age) of the available number of features the outcomes were better.
In this work, build models for malware detection only detects that either the app is
benign or malware. Further, study can be extended to determine how many number of
features are needed to judge that the app belong to which category. Further, this study
can be replicate over other benchmarks which is based on soft computing models to
achieve better accuracy for malware detection.
References
1. https://www.statista.com/statistics/266136/global-market-share-held-by-smartphone-
operating-systems/
2. https://www.statista.com/statistics/266210/number-of-available-applications-in-the-google-
play-store/
3. https://www.statista.com/statistics/271644/worldwide-free-and-paid-mobile-app-store-
downloads/
4. https://www.mcafee.com/in/resources/reports/rp-mobile-threat-report-2018.pdf
5. https://source.android.com/security/reports/Google Android Security 2017 Report Final.pdf
6. https://thehackernews.com/2018/03/android-botnet-malware.html
7. I.H. Witten, E. Frank, M.A. Hall, C.J. Pal, Data Mining: Practical Machine Learning Tools
and techniques (Morgan Kaufmann, 2016)
8. J. Sahs, L. Khan, A machine learning approach to android malware detection, in 2012 European
Intelligence and Security Informatics Conference (IEEE, 2012), pp. 141–147
9. B. Sanz, I. Santos, C. Laorden, X. Ugarte-Pedrero, P. Garcia Bringas, G. Álvarez, Puma:
permission usage to detect malware in android, in International Joint Conference CISIS’12-
ICEUTE 12-SOCO 12 Special Sessions (Springer, Berlin, Heidelberg, 2013), pp. 289–298
10. A. Shabtai, U. Kanonov, Y. Elovici, C. Glezer, Y. Weiss, Andromaly: a behavioral malware
detection framework for android devices. J. Intell. Inf. Syst. 38(1), 161–190 (2012)
11. A. Mahindru, P. Singh, Dynamic permissions based android malware detection using machine
learning techniques, in Proceedings of the 10th Innovations in Software Engineering Confer-
ence (ACM, 2017), pp. 202–210
12. D. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Schölkopf, Learning with local and global
consistency, in Advances in Neural Information Processing Systems (2004) pp. 321–328
13. L. Chen, M. Zhang, C. Yang, R. Sahita, POSTER: semi-supervised classification for dynamic
android malware detection, in Proceedings of the 2017 ACM SIGSAC Conference on Computer
and Communications Security (ACM, 2017), pp. 2479–2481
14. C. Cortes, V. Vapnik, Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
15. J. Li, L. Sun, Q. Yan, Z. Li, W. Srisa-an, H. Ye, Significant permission identification for
machine-learning-based android malware detection. IEEE Trans. Ind. Inf. 14(7), 3216–3225
(2018)
16. A. Zulkifli, I.R.A. Hamid, W. Md Shah, Z. Abdullah, Android malware detection based on
network traffic using decision tree algorithm, in International Conference on Soft Computing
and Data Mining (Springer, Cham, 2018), pp. 485–494
17. W. Wang, M. Zhao, J. Wang, Effective android malware detection with a hybrid model based
on deep autoencoder and convolutional neural network. J. Ambient Intell. Humanized Comput.
10(8), 3035–3043 (2019)
18. Z. Aung, W. Zaw, Permission-based android malware detection. Int. J. Sci. Technol. Res. 2(3),
228–234 (2013)
19. L. Cen, C.S. Gates, L. Si, N. Li, A probabilistic discriminative model for android malware
detection with decompiled source code. IEEE Trans. Dependable Secure Comput. 12(4), 400–
412 (2014)
20. L. Weichselbaum, M. Neugschwandtner, M. Lindorfer, Y. Fratantonio, V. van der Veen, C.
Platzer, Andrubis: android malware under the magnifying glass. Vienna University of Tech-
nology, Tech. Rep. TR-ISECLAB-0414-001 (2014)
21. P. Faruki, V. Ganmoor, V. Laxmi, M.S. Gaur, A. Bharmal, AndroSimilar: robust statistical fea-
ture signature for Android malware detection, Proceedings of the 6th International Conference
on Security of Information and Networks (ACM, 2013), pp. 152–159
22. A.P. Felt, E. Chin, S. Hanna, D. Song, D. Wagner, Android permissions demystified, in Pro-
ceedings of the 18th ACM Conference on Computer and Communications Security (ACM,
2011), pp. 627–638
23. W. Tang, G. Jin, J. He, X. Jiang, Extending android security enforcement with a security
distance model, in 2011 International Conference on Internet Technology and Applications
(IEEE, 2011), pp. 1–4
24. M. Zheng, M. Sun, J.C.S. Lui, Droid analytics: a signature based analytic system to collect,
extract, analyze and associate android malware, in 2013 12th IEEE International Conference
on Trust, Security and Privacy in Computing and Communications (IEEE, 2013), pp. 163–171
25. E.R. Wognsen, H.S. Karlsen, M.C. Olesen, R.R. Hansen, Formalisation and analysis of Dalvik
bytecode. Sci. Comput. Program. 92, 25–55 (2014)
26. W. Enck, M. Ongtang, P. McDaniel, On lightweight mobile phone application certification, in
Proceedings of the 16th ACM Conference on Computer and Communications Security (ACM,
2009), pp. 235–245
27. R. Sato, D. Chiba, S. Goto, Detecting android malware by analyzing manifest files. Proc.
Asia-Pac. Adv. Netw. 36, 23–31 (2013)
28. D.J. Wu, C.-H. Mao, T.-E. Wei, H.-M. Lee, K.-P. Wu, Droidmat: android malware detection
through manifest and api calls tracing, in 2012 Seventh Asia Joint Conference on Information
Security (IEEE, 2012), pp. 62–69
29. W. Zhou, Y. Zhou, X. Jiang, P. Ning, Detecting repackaged smartphone applications in third-
party android marketplaces, in Proceedings of the Second ACM Conference on Data and
Application Security and Privacy (ACM, 2012), pp. 317–326
30. C.Y. Huang, Y.-T. Tsai, C.-H. Hsu, Performance evaluation on permission-based detection for
android malware, in Advances in Intelligent Systems and Applications, vol. 2 (Springer, Berlin,
Heidelberg, 2013), pp. 111–120
31. Y. Aafer, W. Du, H. Yin, Droidapiminer: mining api-level features for robust malware detection
in android, in International Conference on Security and Privacy in Communication Systems
(Springer, Cham, 2013), pp. 86–103
32. E. Chin, A.P. Felt, K. Greenwood, D. Wagner, Analyzing inter-application communication in
Android, in Proceedings of the 9th International Conference on Mobile Systems, Applications,
and Services (ACM, 2011), pp. 239–252
33. D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, K. Rieck, C.E.R.T. Siemens, Drebin: ef-
fective and explainable detection of android malware in your pocket. Ndss 14, 23–26 (2014)
34. I. Burguera, U. Zurutuza, S. Nadjm-Tehrani, Crowdroid: behavior-based malware detection
system for android, in Proceedings of the 1st ACM Workshop on Security and Privacy in
Smartphones and Mobile Devices (ACM, 2011), pp. 15–26
35. M. Zhao, F. Ge, T. Zhang, Z. Yuan, AntiMalDroid: an efficient SVM-based malware detection
framework for android, in International Conference on Information Computing and Applica-
tions (Springer, Berlin, Heidelberg, 2011), pp. 158–166
36. W. Enck, P. Gilbert, S. Han, V. Tendulkar, B.-G. Chun, L.P. Cox, J. Jung, P. McDaniel, A.N.
Sheth, TaintDroid: an information-flow tracking system for realtime privacy monitoring on
smartphones. ACM Trans. Comput. Syst. (TOCS) 32(2) (2014)
37. L.K. Yan, H. Yin, DroidScope: seamlessly reconstructing the OS and Dalvik semantic views
for dynamic android malware analysis, in Presented as Part of the 21st USENIX Security
Symposium (USENIX Security 12) (2012), pp. 569–584
38. Y. Feng, S. Anand, I. Dillig, A. Aiken, Apposcopy: semantics-based detection of android
malware through static analysis, in Proceedings of the 22nd ACM SIGSOFT International
Symposium on Foundations of Software Engineering (ACM, 2014), pp. 576–587
39. A. Narayanan, M. Chandramohan, L. Chen, Y. Liu, Context-aware, adaptive and scal-
able android malware detection through online learning (extended version). arXiv preprint
arXiv:1706.00947 (2017)
40. BlackHat, Reverse Engineering with Androguard https://code.google.com/androguard (Online;
Accessed 29 Mar. 2013)
41. H. Kang, J. Jang, A. Mohaisen, H.K. Kim, Detecting and classifying android malware using
static analysis along with creator information. Int. J. Distrib. Sens. Netw. 11(6) (2015)
42. D. Octeau, S. Jha, M. Dering, P. McDaniel, A. Bartel, L. Li, J. Klein, Y.L. Traon, Combin-
ing static analysis with probabilistic models to enable market-scale android inter-component
analysis, in ACM SIGPLAN Notices, vol. 51, no. 1 (ACM, 2016), pp. 469–484
43. B. Amos, H. Turner, J. White, Applying machine learning classifiers to dynamic android
malware detection at scale, in 2013 9th International Wireless Communications and Mobile
Computing Conference (IWCMC) (IEEE, 2013), pp. 1666–1671
44. W.-C. Wu, S.-H. Hung, DroidDolphin: a dynamic Android malware detection framework using
big data and machine learning, in Proceedings of the 2014 Conference on Research in Adaptive
and Convergent Systems (ACM, 2014), pp. 247–252
45. S. Sheen, R. Anitha, V. Natarajan, Android based malware detection using a multifeature
collaborative decision fusion approach. Neurocomputing 151, 905–912 (2015)
46. M. Damshenas, A. Dehghantanha, K.-K. Raymond Choo, R. Mahmud, M0droid: an android
behavioral-based malware detection model. J. Inf. Priv. Secur. 11(3), 141–157 (2015)
47. R. Vinayakumar, K.P. Soman, P. Poornachandran, S. Sachin Kumar, Detecting android malware
using long short-term memory (LSTM). J. Intell. Fuzzy Syst. 34(3), 1277–1288 (2018)
48. M.Z. Mas’ ud, S. Sahib, M.F. Abdollah, S. Rahayu Selamat, R. Yusof, Analysis of features
selection and machine learning classifier in android malware detection, in 2014 International
Conference on Information Science & Applications (ICISA) (IEEE, 2014), pp. 1–5
49. A. Narayanan, M. Chandramohan, L. Chen, Y. Liu, A multi-view context-aware approach
to Android malware detection and malicious code localization. Empirical Softw. Eng. 23(3),
1222–1274 (2018)
50. K. Allix, T.F. Bissyandé, Q. Jérome, J. Klein, Y. Le Traon, Empirical assessment of machine
learning-based malware detectors for Android. Empirical Softw. Eng. 21(1), 183–211 (2016)
51. A. Azmoodeh, A. Dehghantanha, K.-K. Raymond Choo, Robust malware detection for internet
of (battlefield) things devices using deep eigenspace learning. IEEE Trans. Sustain. Comput.
4(1), 88–95 (2018)
52. A.F.A. Kadir, N. Stakhanova, A.A. Ghorbani, Android botnets: what urls are telling us, in
International Conference on Network and System Security (Springer, Cham, 2015), pp. 78–91
53. Y. Zhou, X. Jiang, Dissecting android malware: characterization and evolution, in 2012 IEEE
Symposium on Security and Privacy (IEEE, 2012), pp. 95–109
54. Botnet Research Team. SandDroid: An APK Analysis Sandbox. Xi’an Jiaotong University
(2014)
55. M. Dash, H. Liu, Consistency-based search in feature selection. Artif. Intell. 151(1–2), 155–176
(2003)
56. R. Kohavi, G.H. John, Wrappers for feature subset selection. Artif. Intell. 97(1–2), 273–324
(1997)
57. Z. Pawlak, Rough sets. Int. J. Comput. Inf. Sci. 11(5), 341–356 (1982)
58. C.-Y. Huang, Y.-T. Tsai, C.-H. Hsu, Performance evaluation on permission-based detection
for android malware, in Advances in Intelligent Systems and Applications, vol. 2 (Springer,
Berlin, Heidelberg, 2013), pp. 111–120
59. I. Santos, B. Sanz, C. Laorden, F. Brezo, P.G. Bringas, Opcode-sequence-based semi-supervised
unknown malware detection, in Computational Intelligence in Security for Information Systems
(Springer, Berlin, Heidelberg, 2011), pp. 50–57
60. S. Kokoska, C. Nevison, Critical values for Cochran’s test, in Statistical Tables and Formulae
(Springer, New York, 1989), p. 74

Feature-Based Semi-Supervised Learning To Detect Malware From Android

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Feature-Based Semi-Supervised Learning To Detect Malware From Android

Uploaded by

Copyright:

Available Formats

Chapter 6

Feature-Based Semi-supervised Learning

Arvind Mahindru and A. L. Sangal

Abstract Malware is potentially harmful to an Android operating system just like

Keywords Android apps · Permissions model · API calls · LLGC · Feature

A. Mahindru (B) · A. L. Sangal

© Springer Nature Switzerland AG 2020 93

6.2 Related Work

Table 6.1 Existing developed framework or approaches

Table 6.2 Feature selection techniques used in the literature

6.2.1 Research Questions

An experiment is performed to find out the performance of malware detection ap-

6.3 Data Set Description

Earlier framework/approaches reviewed only a few number of Android apps to in-

6.4 Feature Sub-set Selection Approaches

It is essential to choose appropriate set of features for data preprocessing task in

6.4.1 Consistency Sub-set Evaluation Approach

Consistency sub-set evaluation approach classify the importance of a sub-set of

6.4.2 Filtered Sub-set Evaluation

6.4.3 Rough Set Analysis Approach

i. Approximation: Let A = (C, Z ), X ⊆ Z and Y ⊆ C. X − topmost (X Y ) and

X̄ Y = {yi ∈ U | [yi ] I nd(B) ∩ Y = 0} (6.2)

X Y = {yi ∈ U | [yi ] I nd(B)∩Y }, (6.3)

where | [yi ] I nd(C) belongs to the same class of yi in connection I nd(C).

I N D A (Z ) = {(x, y) ∈ A2 | ∀a ∈ Z , a(x) = a(y)}. (6.5)

6.4.4 Feature Sub-set Selection Approach Based on

Feature sub-set selection approach based on Correlation approach chooses a sub-set

6.5 Machine Learning Classifiers

In our study, we apply LLGC, a semi-supervised machine learning method, which

if j = i and Ai, j = 0 then

Let χ = {z 1 , z 2 , . . . , zl−1 , zl } ⊂ Rm is constituted of the samples and α =

6.6 Proposed Detection Framework

To examine the effectiveness of our proposed malware detection model. We compare

6.7 Evaluation of Parameters

N Benign→Benign + N Malwar e→Malwar e

6.8 Experimental Setup

Fig. 6.2 Framework of proposed work

6.9 Outcomes of the Experiment

6.9.1 Feature Sub-set Selection Approaches

Table 6.6 Naming conventions for distinct approaches (.apk)

(a) Classifier (b) correlation

(c) Filtered (d) RSA

6.9.2 Machine Learning Classifier

In this study, semi-supervised machine learning classifiers have been recognized to

Table 6.7 Measured accuracy by using LLGC as an machine learning classifier

Table 6.8 Measured F-measure by using LLGC as an machine learning classifier

(a) Accuracy (b) F-measure

Fig. 6.4 Box-plot diagrams of F-measure and accuracy

6.9.3 Comparison of Outcomes

Fig. 6.5 t-test analysis

6.9.4 Evaluation of Proposed Framework Using Proposed

Comparison with AV scanners: To validate that our proposed framework is capable

Table 6.9 Comparison with AV scanners

6.9.5 Experimental Finding

Our outcomes of this chapter are as follows:

You might also like

X̄ Y = {yi ∈ U | [yi ] I nd(B) ∩ Y = 0} (6.2)

if j = i and Ai, j = 0 then