Professional Documents
Culture Documents
net/publication/322158832
CITATIONS READS
9 207
2 authors, including:
Abhishek Bhattacharya
Institute of Engineering & Management
54 PUBLICATIONS 364 CITATIONS
SEE PROFILE
All content following this page was uploaded by Abhishek Bhattacharya on 10 January 2018.
Abstract Feature selection is the process of grouping most significant set of fea-
tures which reduces dimensionality and generates most analytical result. Choosing
relevant attributes are a critical issue for competitive classifiers and for data
reduction also. This work proposes a hybrid feature selection technique based on
Rough Set Quick Reduct algorithm with Community Detection scheme. The pro-
posed technique is applied in Android malware detection domain and compared
with the performances of existing feature selectors. It produces highest average
classification accuracy of 97.88% and average ROC values up to 0.987.
1 Introduction
Now Android smart phones are vulnerable to threats like activating malevolent
services without user’s knowledge, denial of services etc. Moreover, Android
applications are easy targets for reverse engineering—an explicit characteristic of
Java applications, which is often abused by malicious attackers. Unlike other
mobile operating system, Android maintains openness and doesn’t put much con-
straint on its users in downloading and uploading apps. Every Android app requires
a set of permissions and these permissions are generally requested by applications
A. Bhattacharya (✉)
Institute of Engineering & Management, Kolkata, West Bengal, India
e-mail: abhishek.bhattacharya@iemcal.com
R. T. Goswami
Techno India College of Technology, Kolkata, West Bengal, India
e-mail: tamal.goswami@gmail.com
2 Related Works
discussed the hybridization of rough set and statistical method (Principal Compo-
nent Analysis) in feature reduction. Works in [10] discuss the hybridization of
rough sets with fuzzy sets for feature reduction. Similarly, the hybridization of
rough sets with meta heuristic algorithms is proposed in [11].
3 Proposed Methodology
This section deals with static framework of permission driven Android malware
detection using community based feature reduction approach.
To start with, permission sets from known malware and benign applications are
extracted and permission vectors are created from a large number of app (3004
benign and 1363 malwares). Dataset 1 comprises of 504 Benign and 213 malware
samples. Malwares are collected from Contagiodump Mobile Dump [12] and from
different user agencies. Dataset 2 comprises of 2500 Benign and 1150 malware
samples which are downloaded from Wang’s repository [13] as research commu-
nity is keen on using standardized dataset.
Correlation Coefficient:
It can be used to measure the degree of similarity between feature vectors:
252 A. Bhattacharya and R. T. Goswami
Feature selection methods are generally classified to three types as Filter, Wrapper
and Hybrid approaches. The major difficulty of filter method is that the interface of
features between themselves is not considered. The wrapper method uses a given
learning algorithm to assess the feature sets. The disadvantages of previous two
methods can be eradicated by hybrid approaches. Hybrid models work mainly in
two stages. In first phase, any filter method is applied to remove irrelevant features.
In the second phase, a wrapper method is used on filtered dataset and better clas-
sification accuracy is obtained. In hybrid models, classification result is generally
same or better than that of filter/wrapper methods [15]. Rough Set Theory is used as
a tool to find out the data dependencies and to decrease the number of attributes
contained in a dataset [16, 17]. Like neural network and fuzzy sets, rough set theory
has been also used in hybridization with conventional techniques. Some illustrious
hybridization of rough sets with other techniques are: Rough sets with fuzzy sets,
Rough set with neural networks and Rough set with meta heuristic algorithms [18].
In this work, Infomap based community detection technique has been hybridized
with supervised rough set Quick Reduct (SRSQR) algorithm. In first phase, Info-
map based community detection technique has been applied on datasets to generate
intermediate reducts and in the second phase, supervised rough set Quick Reduct
algorithm (SRSQR) has been executed on those intermediate reducts to produce
more concrete reducts. Finally machine learning classifiers have been applied on
those reducts to generate classification results. In Quick Reduct algorithm, the
reduction of features is determined by comparing equivalence relations which are
generated by sets of attributes. SRSQR tries to calculate a minimal reduct without
exhaustively generating all possible set of subsets. As a result, it does not always
guarantee to find a minimal reduct, actually it generates nearly minimal reduct.
Table 1 shows the number of features present in reducts generated by proposed
method and SRSQR for both Dataset 1 and Dataset 2.
One of the effective features of machine learning algorithms is that they improve
their ability to discriminate normal behavior from anomalous behavior with expe-
rience. The performances of proposed hybrid rough set community based feature
selection approach have been compared with six existing feature selection tools of
Weka toolkit [19] which are used frequently in different literatures. For each
existing feature selectors (Pearson coefficient, Information Gain, Gain Ratio, Chi
Square, One R and Relief) top ten ranked features have been considered. The
performance of Supervised Rough set Quick Reduct algorithm (SRSQR) is also
compared with proposed hybrid method. Experiments carried out to evaluate the
effectiveness of proposed method with eight machine learning classifiers (Bayesnet,
Naïve Bayes, SMO, Decision Table, Random Tree, Random Forest, J48 and
Multilayer Perceptron). Figures 1–4 depict the results of experiments carried out to
evaluate the effectiveness of hybrid community based rough set feature reduct with
eight machine learning classifiers. Figure 1 reveals that highest average TPR % is
produced by proposed method in Dataset 1(97.13%) and in Dataset 2 (99.28%)
which is greater than that of those six conventional feature rankers for both Dataset
1 and Dataset 2.
Proposed hybrid method generates highest average F1 score (0.916) in Dataset 1,
whereas in Dataset 2, it produces highest average F1 score (0.985) which is pre-
sented in Fig. 2. It clearly shows that proposed hybrid feature selection method
outperforms conventional feature selectors. Figure 3 depicts that in Dataset 1,
highest average ROC is 0.869 through proposed hybrid method and similarly
highest generated average ROC (0.987) in Dataset 2. The comparison of average
Accuracy percentages with Dataset 1 and Dataset 2 is demonstrated in Fig. 4. It
shows that proposed hybrid method obtains highest average Accuracy % (87.80%)
with Dataset 1 and (97.88%) with Dataset 2. It clearly shows the effectiveness of
proposed feature selection strategy over conventional feature rankers.
97
96
95
94
93
92
91
90
GAIN RATIO
Pearson
IG
QUICK REDUCT
ONE R
RELIEF
Proposed Method
CHI SQ
DATASET 1
DATASET 2
256 A. Bhattacharya and R. T. Goswami
Avg. F1 Score
0.94
0.92
0.9
0.88
0.86
0.84
Pearson
IG
GAIN RATIO
ONE R
RELIEF
Proposed Method
QUICK REDUCT
CHI SQ
DATASET 1
DATASET 2
ONE R
RELIEF
QUICK REDUCT
Proposed Method
CHI SQ
DATASET 1
DATASET 2
95
90
85
80
75
GAIN RATIO
Pearson
IG
Proposed Method
ONE R
RELIEF
QUICK REDUCT
CHI SQ
DATASET 1
DATASET 2
A Hybrid Community Based Rough Set Feature … 257
In this work, a hybrid technique for permission based detection of Android malware
through community based rough set feature selection methods has been proposed.
Better classification performances have been yielded over existing feature rankers
like IG, Gain Ratio, Pearson’s Correlation coefficient, OneR, Chi Square and Relief
for most of the machine learning classifiers. Therefore, the main contribution of this
work is to exhibit that it is possible to implement hybrid community based rough set
feature selection technique for filtering out malware in Android. From the experi-
mental results, it can be concluded the proposed approach is quite comparable with
the well-known feature selection algorithms. Even if a larger data set is applied, the
feature selection techniques developed in this work can be used for feature selection
to make the classification more efficient. In future, rough set theory hybridized with
meta heuristic algorithms can be projected to design more efficient feature selection
methods in permission based Android malware detection.
References
1. Hassanien, A.E., Tolba, M., Azar, A.T.: Advanced machine learning technologies and
applications, Communications in Computer and Information Science, vol. 488, Springer,
GmbH, Berlin/Heidelberg (2014). ISBN: 978–3-319-13460-4
2. Hu, Q.H., Yu, D.R., Xie, Z.X.: Information-preserving hybrid data reduction based on
fuzzy-rough techniques. Pattern Recognit. Lett. 27(5), 414–423 (2006)
3. Baumgarten, M.M.D., Mulvenna, M.D., Rooney, N., Reid, J.: Keyword-based sentiment
mining using twitter. Int. J. Ambient Comput. Intell. (IJACI) 5(2), 56–69 (2013)
4. Swiniarski, R.W., Skowron, A.: Rough set methods in feature selection and recognition.
Pattern Recogn. Lett. 24, 833–849 (2003)
5. Li, J., Wang, X., Xu, S.: Prediction of customer classification based on rough set theory.
Procedia Eng. 7, 366–370 (2010)
6. Chakhar, S., Saad, I.: Dominance-based rough set approach for groups in multicriteria
classification problems. Decis. Support Syst. 54, 372–380 (2012)
7. Kadzinski, M., Greco, S., Slowinski, R.: Robust ordinal regression for dominance-based
rough set approach to multiple criteria sorting. Inf. Sci. 83, 211–228 (2014)
8. Zhang, N., Yao, J.T.: A rough sets based approach to feature selection. In: Proceedings of the
International Conference of the North American Fuzzy Information Processing Society,
pp. 434–439 (2004)
9. Zhong, N., Dong, J.: Using rough sets with heuristics for feature selection. J. Intell. Inf. Syst.
16, 199–214 (2001)
10. Meher, S.K.: Explicit rough-fuzzy pattern classification model. Pattern Recogn. Lett. 36,
54–61 (2014)
11. Inbarani, H.H., Azar, T.A., Jothi, G.: Supervised hybrid feature selection based on PSO and
rough sets for medical diagnosis. Comput. Methods Programs Biomed. 113, 175–185 (2014)
12. Contagiodump mobile dump. http://contagiodump.blogspot.in/ (2016) Accessed October 31
2016
13. Wei Wang’s Home Page. http://infosec.bjtu.edu.cn/wangwei/?page_id=85 (2015). Accessed
31 October 2015
14. Infomap Project. http://www.mapequation.org
258 A. Bhattacharya and R. T. Goswami
Xin-She Yang
Atulya K. Nagar
Amit Joshi Editors
Smart Trends
in Systems,
Security and
Sustainability
Proceedings of WS4 2017