You are on page 1of 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/322158832

A Hybrid Community Based Rough Set Feature Selection Technique in Android


Malware Detection

Chapter · January 2018


DOI: 10.1007/978-981-10-6916-1_23

CITATIONS READS

9 207

2 authors, including:

Abhishek Bhattacharya
Institute of Engineering & Management
54 PUBLICATIONS 364 CITATIONS

SEE PROFILE

All content following this page was uploaded by Abhishek Bhattacharya on 10 January 2018.

The user has requested enhancement of the downloaded file.


A Hybrid Community Based Rough Set
Feature Selection Technique in Android
Malware Detection

Abhishek Bhattacharya and Radha Tamal Goswami

Abstract Feature selection is the process of grouping most significant set of fea-
tures which reduces dimensionality and generates most analytical result. Choosing
relevant attributes are a critical issue for competitive classifiers and for data
reduction also. This work proposes a hybrid feature selection technique based on
Rough Set Quick Reduct algorithm with Community Detection scheme. The pro-
posed technique is applied in Android malware detection domain and compared
with the performances of existing feature selectors. It produces highest average
classification accuracy of 97.88% and average ROC values up to 0.987.

Keywords Feature selection ⋅


Android malware ⋅ Rough set
Community detection ⋅
Quick reduct

1 Introduction

Now Android smart phones are vulnerable to threats like activating malevolent
services without user’s knowledge, denial of services etc. Moreover, Android
applications are easy targets for reverse engineering—an explicit characteristic of
Java applications, which is often abused by malicious attackers. Unlike other
mobile operating system, Android maintains openness and doesn’t put much con-
straint on its users in downloading and uploading apps. Every Android app requires
a set of permissions and these permissions are generally requested by applications

A. Bhattacharya (✉)
Institute of Engineering & Management, Kolkata, West Bengal, India
e-mail: abhishek.bhattacharya@iemcal.com
R. T. Goswami
Techno India College of Technology, Kolkata, West Bengal, India
e-mail: tamal.goswami@gmail.com

© Springer Nature Singapore Pte Ltd. 2018 249


X.-S. Yang et al. (eds.), Smart Trends in Systems, Security
and Sustainability, Lecture Notes in Networks and Systems 18,
https://doi.org/10.1007/978-981-10-6916-1_23
250 A. Bhattacharya and R. T. Goswami

during installation on mobile devices. Permission vector of Android app may


contain around 135 features. Some of the unnecessary permissions of an over
privileged app may be leaked to malware apps. So it is quite feasible to identify
malware based on the permission sets they require during installation time. But
these huge data is extraordinary difficult because of dimensionality and it may
slowdown learning process and learning efficiency also may be degraded. As a
common technique for data mining, feature selection has been attracted much
attention in recent times [1, 2]. So feature reduction techniques are highly required
to reduce the dimensionally of data. Community detection stands for the process of
grouping data according to certain similarity distances from weighted graph. It is
one of the major tools in social network analysis, like viral marketing, sharing of
information, sentiments, emotions etc. [3], but its implementation in the area of
feature reduction in the context of malware detection is quite rare. In this paper,
a hybrid community based rough set feature selection framework in Android
malware detection is proposed where community detection and rough set Quick
Reduct algorithm have been hybridized to select more prominent set of features.
The reason behind hybridization is that, on many cases, a single technique may
have certain limitations and always it cannot be treated as most efficient technique.
Under such scenario, it is better to hybridize the technique with some other proven
techniques to make the solution relatively better optimal. The rest of this paper is
organized as follows. Few works related to this malware detection and feature
reduction techniques are discussed in Sect. 2. Then proposed methodology on
hybrid community based feature reduction technique has been discussed in Sect. 3.
It is followed by results and discussion of the same in Sect. 4. We conclude in
Sect. 5 discussing the future scope of this work.

2 Related Works

Different feature reduction techniques in permission based detection of Android


malware have been proposed in recent times. In recent years, rough set theory
(RST) has become a subject of great importance to researchers and has been applied
to many data mining domains. It is possible to find a optimal subset of the attributes
of a dataset with discrete attribute values using RST that are most informative and
where the loss of information will be minimal. One of such works like [4], presents
the application of rough set theory for feature selection in pattern recognition.
Authors in [5] propose customer classification prediction system based on rough set
to reduce complexity. In [6], a two phase rough set based model in multi criteria
decision making has been discussed. Authors in [7] present feature selection and
classification technique based on rough set approach. A new approach of feature
selection based on rough set has been proposed in [8]. Authors in [9] have
A Hybrid Community Based Rough Set Feature … 251

discussed the hybridization of rough set and statistical method (Principal Compo-
nent Analysis) in feature reduction. Works in [10] discuss the hybridization of
rough sets with fuzzy sets for feature reduction. Similarly, the hybridization of
rough sets with meta heuristic algorithms is proposed in [11].

3 Proposed Methodology

This section deals with static framework of permission driven Android malware
detection using community based feature reduction approach.

3.1 Dataset Preparation

To start with, permission sets from known malware and benign applications are
extracted and permission vectors are created from a large number of app (3004
benign and 1363 malwares). Dataset 1 comprises of 504 Benign and 213 malware
samples. Malwares are collected from Contagiodump Mobile Dump [12] and from
different user agencies. Dataset 2 comprises of 2500 Benign and 1150 malware
samples which are downloaded from Wang’s repository [13] as research commu-
nity is keen on using standardized dataset.

3.2 Feature Selection Technique

Huge data is extraordinary difficult because of the dimensionality. So feature


reduction techniques are highly required to reduce the dimensionally of data. The
basic reason behind feature reduction is that there are superfluous and unimportant
attributes in datasets which may affect the performance of machine learning clas-
sifiers and may generate inaccurate results. Feature selection allows faster model
construction by reducing the number of features and hence helps visualizing the
trend of data. Feature reduction has been applied to permission based malware
detection in order to improve its scalability, efficiency and classification accuracy
[1]. One of the popular methods of feature selection is calculation of similarities
between feature vectors. Various similarity measures have been used by researchers
to find similarity between attributes. This work attempts to measure similarities
between feature vectors through term level similarity. Correlation coefficient has
been used in this work which is defined as follows.

Correlation Coefficient:
It can be used to measure the degree of similarity between feature vectors:
252 A. Bhattacharya and R. T. Goswami

nð∑ xyÞ − ð∑ xÞ.ð∑ yÞ


Correlation Coefficient ðrÞ = qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
½n ∑ x2 − ð∑ xÞ2 ½n ∑ y2 − ð∑ yÞ2

The computed value lies in (0, 1). A feature is measured to be redundant if it is


highly correlated with other features.

3.3 Feature Similarity Graph (FSG) Generation

The resulting similarity matrix can be considered as a weighted complete graph.


This graphs are referred Feature Similarity Graph (FSG) which is an ordered pair
G = (V, E) comprising a set of nodes V together with a set E of edges. Every
permission vector is considered as node and between every pair of nodes a
weighted edge is considered. Here weight denotes the similarity between two
permission vectors. FSG contains edges between every pair of nodes, so if numbers
of permission vectors be n, the number of edges is nðn2− 1Þ.

3.4 Community Detection

A complex network is likely to have community structure if the nodes of the


network can be grouped into sets of nodes in such a way that every set of nodes are
tightly connected internally. For non-overlapping community detection, this implies
that the network splits naturally into sets of nodes with dense relations internally
and sparser relations between sets. The idea is based on the principle that pair of
nodes are more expected to be connected if they are both members of same
community(s), and less likely to be connected if they do belong to different com-
munities. Infomap [14] community detection technique is presented as follows.

3.4.1 Community Detection Through Infomap [14]

In proposed methodology, similarity of permission vectors is the basis of finding


community within feature similarity graph (FSG). Infomap is a tool developed by
[14]. Each community is also represented by a representative node (permission
vector) and these community representatives are considered in the formation of
reduct using Algorithm 1. Number of detected communities detected by Infomap
for Dataset1 and Dataset 2 are 1 and 3 respectively.
A Hybrid Community Based Rough Set Feature … 253

3.5 Threshold Feature Similarity Graph (FSG) Generation

As the number of communities identified by Infomap are significantly low in case


of FSG generated based on Correlation coefficient, the threshold version of Feature
Similarity Graph (FSG) have been considered. Threshold is computed by taking the
average weight of all the edges in the graph and the edges whose weights are less
than threshold weight are removed. So Threshold FSG contains less light weight
edges than the original FSG. After creating threshold FSG, Algorithms 1 have been
applied on that threshold FSG to generate communities. The number of commu-
nities detected by Infomap on threshold FSG generated based on mean Correlation
coefficient for both Dataset 1 and Dataset 2 are 6 and 25 respectively. Community
representatives are finally considered in the formation of intermediate reduct.
254 A. Bhattacharya and R. T. Goswami

3.6 Hybridization Using Supervised Rough Set Quick


Reduct (SRSQR) Algorithm

Feature selection methods are generally classified to three types as Filter, Wrapper
and Hybrid approaches. The major difficulty of filter method is that the interface of
features between themselves is not considered. The wrapper method uses a given
learning algorithm to assess the feature sets. The disadvantages of previous two
methods can be eradicated by hybrid approaches. Hybrid models work mainly in
two stages. In first phase, any filter method is applied to remove irrelevant features.
In the second phase, a wrapper method is used on filtered dataset and better clas-
sification accuracy is obtained. In hybrid models, classification result is generally
same or better than that of filter/wrapper methods [15]. Rough Set Theory is used as
a tool to find out the data dependencies and to decrease the number of attributes
contained in a dataset [16, 17]. Like neural network and fuzzy sets, rough set theory
has been also used in hybridization with conventional techniques. Some illustrious
hybridization of rough sets with other techniques are: Rough sets with fuzzy sets,
Rough set with neural networks and Rough set with meta heuristic algorithms [18].
In this work, Infomap based community detection technique has been hybridized
with supervised rough set Quick Reduct (SRSQR) algorithm. In first phase, Info-
map based community detection technique has been applied on datasets to generate
intermediate reducts and in the second phase, supervised rough set Quick Reduct
algorithm (SRSQR) has been executed on those intermediate reducts to produce
more concrete reducts. Finally machine learning classifiers have been applied on
those reducts to generate classification results. In Quick Reduct algorithm, the
reduction of features is determined by comparing equivalence relations which are
generated by sets of attributes. SRSQR tries to calculate a minimal reduct without
exhaustively generating all possible set of subsets. As a result, it does not always
guarantee to find a minimal reduct, actually it generates nearly minimal reduct.
Table 1 shows the number of features present in reducts generated by proposed
method and SRSQR for both Dataset 1 and Dataset 2.

Table 1 Reduct details of community detection using infomap


Datasets Objects Features Reduct size (no. of features)
SRSQR Proposed method
Dataset 1 717 82 38 28
Dataset 2 3650 88 37 13
A Hybrid Community Based Rough Set Feature … 255

4 Result and Discussions

One of the effective features of machine learning algorithms is that they improve
their ability to discriminate normal behavior from anomalous behavior with expe-
rience. The performances of proposed hybrid rough set community based feature
selection approach have been compared with six existing feature selection tools of
Weka toolkit [19] which are used frequently in different literatures. For each
existing feature selectors (Pearson coefficient, Information Gain, Gain Ratio, Chi
Square, One R and Relief) top ten ranked features have been considered. The
performance of Supervised Rough set Quick Reduct algorithm (SRSQR) is also
compared with proposed hybrid method. Experiments carried out to evaluate the
effectiveness of proposed method with eight machine learning classifiers (Bayesnet,
Naïve Bayes, SMO, Decision Table, Random Tree, Random Forest, J48 and
Multilayer Perceptron). Figures 1–4 depict the results of experiments carried out to
evaluate the effectiveness of hybrid community based rough set feature reduct with
eight machine learning classifiers. Figure 1 reveals that highest average TPR % is
produced by proposed method in Dataset 1(97.13%) and in Dataset 2 (99.28%)
which is greater than that of those six conventional feature rankers for both Dataset
1 and Dataset 2.
Proposed hybrid method generates highest average F1 score (0.916) in Dataset 1,
whereas in Dataset 2, it produces highest average F1 score (0.985) which is pre-
sented in Fig. 2. It clearly shows that proposed hybrid feature selection method
outperforms conventional feature selectors. Figure 3 depicts that in Dataset 1,
highest average ROC is 0.869 through proposed hybrid method and similarly
highest generated average ROC (0.987) in Dataset 2. The comparison of average
Accuracy percentages with Dataset 1 and Dataset 2 is demonstrated in Fig. 4. It
shows that proposed hybrid method obtains highest average Accuracy % (87.80%)
with Dataset 1 and (97.88%) with Dataset 2. It clearly shows the effectiveness of
proposed feature selection strategy over conventional feature rankers.

Fig. 1 Average TPR % of 101


different reducts for Dataset 1 100
99
and Dataset 2 98
Avg. TPR %

97
96
95
94
93
92
91
90
GAIN RATIO
Pearson
IG

QUICK REDUCT
ONE R
RELIEF

Proposed Method
CHI SQ

DATASET 1
DATASET 2
256 A. Bhattacharya and R. T. Goswami

Fig. 2 Average F1 score of 1


different reducts for Dataset 1 0.98
and Dataset 2 0.96

Avg. F1 Score
0.94
0.92
0.9
0.88
0.86
0.84

Pearson
IG
GAIN RATIO

ONE R
RELIEF

Proposed Method
QUICK REDUCT
CHI SQ
DATASET 1
DATASET 2

Fig. 3 Average ROC values 1.2


of different reducts for Dataset
1
1 and Dataset 2
0.8
Avg. AUC 0.6
0.4
0.2
0
Pearson
IG
GAIN RATIO

ONE R
RELIEF
QUICK REDUCT
Proposed Method
CHI SQ

DATASET 1
DATASET 2

Fig. 4 Average accuracy % 105


of different reducts for Dataset
100
1 and Dataset 2
Avg. Accuracy %

95

90

85

80

75
GAIN RATIO
Pearson
IG

Proposed Method
ONE R
RELIEF
QUICK REDUCT
CHI SQ

DATASET 1
DATASET 2
A Hybrid Community Based Rough Set Feature … 257

5 Conclusion and Future Scope

In this work, a hybrid technique for permission based detection of Android malware
through community based rough set feature selection methods has been proposed.
Better classification performances have been yielded over existing feature rankers
like IG, Gain Ratio, Pearson’s Correlation coefficient, OneR, Chi Square and Relief
for most of the machine learning classifiers. Therefore, the main contribution of this
work is to exhibit that it is possible to implement hybrid community based rough set
feature selection technique for filtering out malware in Android. From the experi-
mental results, it can be concluded the proposed approach is quite comparable with
the well-known feature selection algorithms. Even if a larger data set is applied, the
feature selection techniques developed in this work can be used for feature selection
to make the classification more efficient. In future, rough set theory hybridized with
meta heuristic algorithms can be projected to design more efficient feature selection
methods in permission based Android malware detection.

References

1. Hassanien, A.E., Tolba, M., Azar, A.T.: Advanced machine learning technologies and
applications, Communications in Computer and Information Science, vol. 488, Springer,
GmbH, Berlin/Heidelberg (2014). ISBN: 978–3-319-13460-4
2. Hu, Q.H., Yu, D.R., Xie, Z.X.: Information-preserving hybrid data reduction based on
fuzzy-rough techniques. Pattern Recognit. Lett. 27(5), 414–423 (2006)
3. Baumgarten, M.M.D., Mulvenna, M.D., Rooney, N., Reid, J.: Keyword-based sentiment
mining using twitter. Int. J. Ambient Comput. Intell. (IJACI) 5(2), 56–69 (2013)
4. Swiniarski, R.W., Skowron, A.: Rough set methods in feature selection and recognition.
Pattern Recogn. Lett. 24, 833–849 (2003)
5. Li, J., Wang, X., Xu, S.: Prediction of customer classification based on rough set theory.
Procedia Eng. 7, 366–370 (2010)
6. Chakhar, S., Saad, I.: Dominance-based rough set approach for groups in multicriteria
classification problems. Decis. Support Syst. 54, 372–380 (2012)
7. Kadzinski, M., Greco, S., Slowinski, R.: Robust ordinal regression for dominance-based
rough set approach to multiple criteria sorting. Inf. Sci. 83, 211–228 (2014)
8. Zhang, N., Yao, J.T.: A rough sets based approach to feature selection. In: Proceedings of the
International Conference of the North American Fuzzy Information Processing Society,
pp. 434–439 (2004)
9. Zhong, N., Dong, J.: Using rough sets with heuristics for feature selection. J. Intell. Inf. Syst.
16, 199–214 (2001)
10. Meher, S.K.: Explicit rough-fuzzy pattern classification model. Pattern Recogn. Lett. 36,
54–61 (2014)
11. Inbarani, H.H., Azar, T.A., Jothi, G.: Supervised hybrid feature selection based on PSO and
rough sets for medical diagnosis. Comput. Methods Programs Biomed. 113, 175–185 (2014)
12. Contagiodump mobile dump. http://contagiodump.blogspot.in/ (2016) Accessed October 31
2016
13. Wei Wang’s Home Page. http://infosec.bjtu.edu.cn/wangwei/?page_id=85 (2015). Accessed
31 October 2015
14. Infomap Project. http://www.mapequation.org
258 A. Bhattacharya and R. T. Goswami

15. Kishor, D.R., Venkateswarlu, N.B.: Novel hybridization of expectation-maximization and


K-means algorithms for better clustering performance. Int. J. Ambient Comput. Intell. (IJACI)
7(2), 47–74 (2016)
16. Pawlak, Z.: Rough Sets: Theoretical Aspects of Reasoning about Data. Kluwer (1991)
17. Pawlak, Z.: Rough set approach to knowledge-based decision support. Eur. J. Oper. Res. 99,
48–57 (1997)
18. Dorigo, M. Stutzle, T.: Ant colony optimization. A Bradford Book (2004)
19. Weka Toolkit. http://www.cs.waikato.ac.nz/ml/weka/ (2016) Accessed 2016
Lecture Notes in Networks and Systems 18

Xin-She Yang
Atulya K. Nagar
Amit Joshi Editors

Smart Trends
in Systems,
Security and
Sustainability
Proceedings of WS4 2017

View publication stats

You might also like