The Use of Machine Learning Techniques To Advance The Detection and Classification of Unknown Malware

Available online at www.sciencedirect.
com
Available online at www.sciencedirect.com
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science 00 (2019) 000–000
Procedia
Procedia Computer
Computer Science
Science 17000 (2019)
(2020) 000–000
917–922 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
International Workshop on Data-Driven Security (DDSW 2020),

International Workshop
April 6 - on Data-Driven
9, 2020, Security
Warsaw, Poland(DDSW 2020),
April 6 - 9, 2020, Warsaw, Poland
The
The Use
Use of
of Machine
Machine Learning
Learning Techniques
Techniques to
to Advance
Advance the
the Detection
Detection
and Classification of Unknown Malware
and Classification of Unknown Malware
Ihab Shhadataa , Bara’ Batainehaa , Amena Hayajnehaa , Ziad A. Al-Sharifb,∗
Ihaba Shhadat , Bara’ Bataineh , Amena Hayajneh , Ziad A. Al-Sharifb,∗
Department of Computer Science, Jordan University of Science and Technology, Irbid, 22110, Jordan
a Department of Computer Science, Jordan University of Science and Technology, Irbid, 22110, Jordan
b Department of Software Engineering, Jordan University of Science and Technology, Irbid, 22110, Jordan
b Department of Software Engineering, Jordan University of Science and Technology, Irbid, 22110, Jordan
Abstract
Abstract
Relying on technology has grown significantly over the last decade. Subsequently, this motivates attacker to develop new malware
Relying
that can on technology
perform has grownact,
their malicious significantly
which may over the destruction
cause last decade.or Subsequently, this motivates
gather intelligence attacker
and critical to developThus,
information. new malware
malware
that can perform their malicious act, which may cause destruction or gather intelligence and critical information.
detection is a crucial factor in the security of systems; including smart and portable devices. Often, an automated malware detection Thus, malware
detection is a crucial
system is one factor
of the first in the
steps thatsecurity of systems;
aim to recognize including
abnormal smart and
activities andportable
identify devices.
maliciousOften, an automated
programs. malware
This detection detection
is needed to
system
protect is one offrom
devices the first stepsand
hackers thatprevent
aim to recognize abnormal
the information fromactivities and identify malicious
getting compromised. However,programs.
currently This detection
applied standardis needed
methods,to
protect devices from hackers and prevent the information from getting compromised. However, currently
such as signature-based and dynamic-based, do not provide reliable detection of unknown or unaddressed attacks; mainly for applied standard methods,
such as signature-based
malware that can change its andforms
dynamic-based, do not provide
such as the polymorphic reliable
viruses. As adetection
result, theofdemand
unknownfor or unaddressed
a new detection attacks;
technique mainly for
emerges.
malware that can
The purpose change
of this workitsisforms such as the
to investigate thepolymorphic viruses.techniques
machine learning As a result, theare
that demand forthe
used in a new detection
detection technique malware.
of unknown emerges.
The
This purpose of thisa work
work presents more is to investigate
enhanced featurethe
set machine learning
using Random techniques
Forest that the
to decrease arenumber
used inofthe detection
features. of unknown
Several machinemalware.
learning
This work presents
algorithms a more
were applied onenhanced
a benchmarkfeature set using
dataset Random
in our Forest to
experiments. Ourdecrease
results the number
achieved of features.
accuracy Several machine
improvements over alllearning
binary
algorithms were applied
and multi-classifiers. Theon a benchmark
highest accuracydataset in our experiments.
was achieved by Decision TreeOur is
results
98.2% achieved accuracy
for binary improvements
classification overbyall
and 95.8% binary
Random
and multi-classifiers. The highest accuracy was achieved by Decision Tree is 98.2% for binary classification
Forest for multi-class classification. The lowest accuracy was achieved by Bernoulli Naı̈ve Bayes with an accuracy of 91% and and 95.8% by Random
Forest
81.8% for multi-class
for binary classification.
classification The lowestclassification,
and multi-class accuracy wasrespectively.
achieved by Bernoulli Naı̈ve Bayes with an accuracy of 91% and
81.8% for binary classification and multi-class classification, respectively.
c 2020

© 2020 The Authors. Published
The Authors. Published by
by Elsevier
Elsevier B.V.
B.V.
c 2020 The Authors.

This
This is
is an
an open
open accessPublished
access article by Elsevier
article under
under the
the CC B.V.
CC BY-NC-ND
BY-NC-ND license
license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
This is an open
Peer-review
Peer-review access
under article under
responsibility
responsibility theConference
of the
of the CC BY-NC-ND
Conference license
Program
Program (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Chairs.
Chairs.
Peer-review under responsibility of the Conference Program Chairs.
Keywords:
Keywords:
Malware Detection; Malware Classification; Dynamic Analysis; Machine Learning; Computer Security.
Malware Detection; Malware Classification; Dynamic Analysis; Machine Learning; Computer Security.
1. INTRODUCTION
1. INTRODUCTION
Malware is a universal concept for all types of software attacks. With the fast growth of technology, malware is
oneMalware is a significant
of the most universal concept
securityfor all types
threats of software
[1]. Any programattacks. With
designed to the fast growth
infiltrate ofatechnology,
or harm malware
computer system is
with
one of the most significant security threats [1]. Any program designed to infiltrate or harm a computer system with
∗ Corresponding author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000.

∗ Corresponding
E-mail address:author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000.
zasharif@just.edu.jo
E-mail address: zasharif@just.edu.jo
1877-0509 c 2020 The Authors. Published by Elsevier B.V.
1877-0509
This c 2020

is an open Thearticle
access Authors. Published
under by Elsevier B.V.
the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
1877-0509
This
Peer-review ©
is an open2020
under Thearticle
access Authors.
under
responsibility Published
of by Elsevier
the Conference
the CC BY-NC-ND B.V. Chairs.
license
Program (http://creativecommons.org/licenses/by-nc-nd/4.0/)
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
10.1016/j.procs.2020.03.110
918 Ihab Shhadat et al. / Procedia Computer Science 170 (2020) 917–922
2 Author name / Procedia Computer Science 00 (2019) 000–000
infecting a legitimate user’s computer, such as information stealing or spying is considered a malware. Malicious
software can be categorized into different classes depending how they attempt to harm or behave such as Trojan,
Virus, Rootkit, Worm and Spyware [13]. While the family of malware is developing, anti-virus detectors seem unable
to satisfy critical requirements, appearing in millions of computer software being threatened. According to Kaspersky
Labs in the year of 2018 1 , there was about 5.638.828 different hosts that were attacked. A further article by Juniper
Research 2 noticed that more than 33 billion records will be stolen by cybercriminals in 2023 alone. Nowadays,
detection of malware is challenging due to the high accessibility of attacking methods on the Internet. Additionally,
the high availability of anti-detection methods which gives everyone the chance to originate an attack or malicious
software without a certain level of experience. Moreover, attackers are using methods to immediately upgrade to a
newer version in a short period of time to avoid detection methods.
Therefore, malware protection of computer systems is one of the most fundamental tasks for users and organiza-
tions, since even a single attack can result in serious damages to data and severe losses. Huge losses and frequent
attacks impose the need for reliable and accurate techniques for detection. Malware detection is divided into static
analysis, which means the analysis of a compiled file or program and dynamic analysis that means analyzing the run
time behavior such as battery consumption, memory reads and writes, and network utilization of the device [2]. Static
analysis concerns about reading the source code of malware without the execution of the program file, it tries to find
the behavioral attributes of the file such as file format inspection, string extraction, AV scanning, fingerprinting, and
disassembling of the binary format. The dynamic analysis depends on actual-time monitoring of the file while it is
being executed, running in a virtual environment. Malware detection techniques can also be divided into signature-
based and heuristics-based [6]. However, the accuracy in this method is not always adequate for detection, resulting
in a lot of false-positives and false-negatives.
The need for a new detection method is becoming urgent. For this reason, machine learning-based techniques can
be invaluable to the safety of the system. Machine learning models can be trained to classify the malware into infected
execution and legitimate execution. Malicious software examination requires powerful detection abilities for deciding
whether a suspicious file is malicious or not. It is also used in searching for what family a malware is likely to belong
to. Machine learning methods can be used to discover what is a normal act in the beginning, and search for anything
that might be far of it. Therefore, it can provide protection for users by keeping systems safe and stop attacks much
faster than they had in the past. Numerous machine learning methods are observed to be helpful in malware detection
and classification some of these techniques are Random Forest (RF), Support Vector Machine (SVM), Naı̈ve Bayes
(NB), Logistic Regression (LR) and AdaBoost [3].
In our study, we follow the methodology proposed in reference paper [4]. In this paper, the authors experimenta-
tions are based on 306 features extracted from actual behavior of the files (Heuristic) observed by a sandbox. They
applied machine learning algorithms on 984 malicious files and 172 benign files to perform binary and multi-class
classification. The highest accuracy achieved with the Random Forest model 95.69% for multi-class classification
and 96.8% for binary classification. In our work, we are using Random Forest classifier to select the most important
features and ignore irrelevant features. Thus, we obtained better accuracy than the referenced paper in [4]. Our ex-
perimentations show that the highest accuracy that are achieved by Decision Tree is 98.2% for binary classification
and the Random Forest achieved 95.8% for multi-classification, respectively. The rest of this paper is organized as
follows. Section 2 presents our related work. Section 3 presents our methodology and datasets followed by the results
in section 4. Finally, Section 5 presents our conclusion and highlights our future works.
2. Related Works
The study of automated malware detection is not new, however, the accuracy of the results has been the goal of
most researchers. In particular, Rieck et al. [10] collected 10, 072 unique samples and divided them into 14 malware
families. They trained Support Vector machine (SVM) and managed to classify 88% of the provided testing bina-
ries to their correct malware family. Singh et al. [14] studied malware detection using SVM. They investigated the
1 Kaspersky Security Bulletin, https://go.kaspersky.com/rs/802-IJN-240/images/KSB_statistics_2018_eng_final.pdf

2 Cybersecurity breaches to triple by 2023, https://www.governmenteuropa.eu/cybersecurity-breaches-triple-2023/
Ihab Shhadat et al. / Procedia Computer Science 170 (2020) 917–922 919
Author name / Procedia Computer Science 00 (2019) 000–000 3
performance of malware scoring techniques namely, Hidden Markov Models (HMM), Simple Substitution Distance
(SSD), and Opcode Graph-based (OGS) detection while performing morphing strategies to the binaries. As morph-
ing increased, these scoring techniques failed in predicting malware families. Firdausi et al. [5] studied the dynamic
analysis of both the malware and benign files. They collected 220 unique malware and of 250 unique benign software
samples. Five classifiers were trained on their dataset namely, k-Nearest Neighbor, Naı̈ve Bayes, Support Vector Ma-
chine (SVM), J48 decision tree, and Multilayer Perceptron (MLP) neural network using Weka. The best performance
was achieved by J48 with an accuracy of 96.8%. However, Naı̈ve Bayes achieved the poorest performance. Santos
et al. [12] gathered 1, 000 malicious programs for malicious files and 1, 000 for benign files. They performed feature
extraction to extract static and dynamic features then feature selection to reduce the number of features and ended
with 1, 000 features. They trained Decision Trees, KNNs and SVMs.
Similarly, Rehman et al. [9] proposed a new hybrid malware detection model for Android applications which uses
static and dynamic analysis for malware detection. Additionally, Sahs and Khan [11] investigated malware detec-
tion on Android devices. Their dataset contains 2081 benign and 91 malicious Android applications. They performed
dynamic analysis with a one-class support vector machine due to more benign files in the dataset. This approach is
suitable for detecting zero-day and unknown malware. Furthermore, Wang et al. [15] combined machine learning
with static analysis and dynamic analysis. They proposed a misuse detector and an anomaly detector system. Rana
and Sung [8] investigated supervised machine learning algorithm to detect malware on Android-based on their access
permissions. Their best results were achieved by K-Nearest Neighbors (KNN) with an average of 96% performance.
SVM obtained similar results 93%. neural network algorithm obtained average results of 89%. The lowest perfor-
mance was achieved by Discriminant analysis with 86% followed by Naı̈ve Bayes 84%. Jung et al. [7] used Random
Forests classifier (RF) and the Android APIs called by each app and used as a feature for android malware detec-
tion. They used RF with 5-fold cross-validation to classify each app in the dataset. For benign files, the false positive
rate was 0.016% and the false negative rate was 0%. Which means only 5 apps were classified as malicious apps
inaccurately. For malicious files, the false positive rate was 0% and the false negative rate was 3.16% as 1, 305 were
classified inaccurately. The classifier achieved a high accuracy up to 99%. Thus, this work has motivated us to apply
the RF classifier to improve the accuracy of the work presented in [4].
3. Methodology
Our work aims to investigate and extend the work achieved in [4] by using the RF classifier for feature selection
and using cross-validation for data splitting instead of the narrative way of data splitting to sensible performance
improvements. We used the data proposed in [4]. They collected a total of 1156 files consisting of 984 malicious files
and 172 benign files. The files used in the dataset contains several formats such as .exe, .pdf and .docx. Moreover, they
considered the following malware in their study: Dridex, Locky, TeslaCrypt, Vawtrak, Zeus, DarkComet, CyberGate,
CTB-Locker and Xtreme. We estimate the performance of classifiers depending on different types of metrics: Accuracy,
Precision, Recall, and F1-score. It is calculated based on the confusion matrix which provides the representation of
the number of actual and predicted cases obtained from the classifier.
3.1. Feature Selection
In [4], they considered the heuristic strategy for feature extraction. A heuristic strategy is a dynamic analysis that
runs the malicious file by executing it in a virtual environment and gathering information about its characteristics.
However, in our work, we investigate the effect of including only the relevant features that can achieve higher predic-
tion accuracy and rejecting those features that are unnecessary and can reduce the model efficiency and accuracy. The
RF are used as a way of rating features that are often used to decide the most important features. It relies on tree-based
strategies that rank the Gini impurity of the node. The highest impurity begins to appear at the root, while the lowest
impurity appears at the end of trees. Moreover, once a tree is trained, it can calculate how much each feature reduces
the measured impurity in a tree. Therefore, by cutting trees under a specific node, we can generate a subset of features
which are the most relevant features.
Calculating the correlation between multiple features and in the dataset is a crucial step in the feature selection
phase. Positive correlation conveys the existence of a clear relationship. Hence, features with strong correlation are
reliant and have a massive influence in developing a model of machine learning. On the other hand, the inverse cor-
relation is called negative correlation. Correlation is defined as the correlation coefficient denoted by r, each of those
correlations can occur in a range from −1.0 to +1.0. Average positive correlation coefficient can be 0.5 or 0.7 and
strong positive correlation can be 0.9 or 1. The closest that the value of the correlation coefficient to −1, the more
negative correlation it represents. 0 indicates no correlation. We have plotted the correlation among features proposed
in [4], we observed a clear connection among features that make the algorithm biased and therefore, providing inac-
curate results. Figure 1a shows the correlation between the features which indicates negative correlations between the
features. Figure 1b shows the correlation matrix after eliminating features that have lower correlation than 0.015 and
any features exceeded that threshold, kept as a feature. As a result, we ended up with 256 features.
(a) Correlation between features proposed in [4]. (b) Correlation between features after applying Random Forest (RF).
Fig. 1: The correlation matrix before and after eliminating the features that have lower correlation than 0.015 threshold.
3.2. Cross Validation
Cross-validation is a better way to evaluate the ability of a machine learning model to classify data. K-fold cross-
validation divides the initial set into k subsets of similar size then performs different training and validation repetitions
such that in each repetition a different fold of the data is held out for validation whereas the remaining subsets are used
for training. Hence, we average the estimation results of several iterations in a single estimate that gives us a more
precise representation of a model’s performance. In the referenced paper [4], we noticed the lack of balance between
malicious and benign files. Also, they divided the dataset into 69% for training and 31% for testing to estimate the
model performance. Therefore, this splitting technique may lower the model performance since it does not ensure
that the model will train on all samples of the data, and it will train on only a subset of the data randomly. Machine
learning algorithms may not work very well with imbalanced data. The first limitation in our work that the result of
the classifier may be biased as it always outputs the class of the largest probability of occurrence (Majority) and tends
to misclassify less common class (Minority). The other limitation was the accuracy may not be a good reflective of
correct performance since it measures the correct prediction over total prediction so that it would identify all data in a
test set with the majority class label in the case of imbalanced data set. We used 15-fold cross-validation in our work
to mitigate the imbalance dataset problem, by making a balanced dataset out of an imbalanced one and making all
data points appear in the training phase instead of a random subset of the data as in the traditional splitting.
4. Experiments and Results
As demonstrated in the previous sections, we specified the important features through Random forest and evalu-
ated the models using K-fold-cross-validation to give higher performance. We applied various models for malware
detection and classification using implementation provided by sklearn, namely: K-nearest Neighbors (KNN), Support
Vector Machine (SVM), Bernoulli Naı̈ve Bayes (Bernoulli NB), J48 Decision Tree (DT), Random Forest (RF),
Logistic Regression (LR) and Hard Voting (HV) on a particular classification algorithms: LG, SVM, Bernoulli NB
and DT.
Ihab Shhadat et al. / Procedia Computer Science 170 (2020) 917–922 921
Author name / Procedia Computer Science 00 (2019) 000–000 5
4.1. Binary Classification
The highest accuracy was achieved by DT, RF and HV models 98%, 97.8%, and 97% respectively followed by
KNN, SVM and LR 96.1%, 96.1% and 95% respectively. The lowest accuracy was achieved by Bernoulli NB 91% as
shown in Table 1a.
Table 1: Binary classification vs. multi-class classification.
(a) Results from binary classifications. (b) Results from multi-class classification.
Classifier Accuracy Recall Precision F1 Classifier Accuracy

KNN 96.1% 85.1% 89.2% 86.6% KNN 88%
SVM 96.1% 86.2% 88.5% 87.1% SVM 88.6%
Bernoulli NB 91% 89.6% 66.8% 76.2% Bernoulli NB 81.8%
RF 97.8% 87.3% 96.4% 91.2% RF 95.8%
HV 97% 92% 91% 92% HV 92%
LR 95% 85% 87% 86% LR 90%
DT 97.8% 87.3% 96.4% 91.2% DT 92%
4.2. Multi-classification
The highest accuracy was achieved by RF, DT and HV models 95.8%, 92%, and 92% respectively followed by
LG, SVM and KNN 90%, 88.6% and 88% respectively. The lowest accuracy was achieved by Bernoulli NB 81.8% as
demonstrated in Table 1b.
(a) Comparison of the performance of binary classification (b) Comparison of the performance of multi-class classification
between the reference paper [4] and our work between the reference paper [4] and our work
Fig. 2: The accuracy achieved by our experiments against the referenced paper for binary and multi-classification.
Although we obtained higher accuracy over all classifiers in binary and multi classification due to the enhancements
applied by restricting number of features and using cross-fold for validation, the recall and precision was slightly lower
in our experiments. This may be due to the highly biased dataset towards malicious files. Figure 2a and Figure 2b show
the accuracy achieved by our paper comparing to the reference paper for binary and multi-classification.
(a) Precision Measurements (b) Recall Measurements
Fig. 3: The difference in precision and recall between the referenced paper and our experiments.
In the reference paper, the highest accuracy was achieved by RF 96.8% and 95.69% for binary and multi-
classification, respectively. The lowest accuracy was obtained by Bernoulli NB 55% and 72.34%. However, in our
work, the highest accuracy was achieved by DT 98.2% for binary and by RF 95.8% for multi- classification. The
lowest accuracy was achieved by NB 91% and 81.8% for binary and multi-classification. However, the accuracy of
NB in our experiments is much higher comparing to their experiments. This is because the feature selection phase that
eliminated the correlated features which causes the performance of this classifier to degrade. The correlated features
may appear as replicated in the model, thus wrongly over-inflating their importance. Figure 3a and Figure 3b shows
the difference of precision and recall between reference paper and our work.
5. CONCLUSION
To conclude, this research studied the performance of machine learning algorithms during malware detection and
classification. We applied various machine learning algorithms on a pre-existed benchmark dataset. Our experiments
show higher accuracy in all binary and multi-classifiers; the achieved accuracy using Decision-Trees for binary-
classification is 98.2% and using Random-Forests for multi-classification is 95.8%. However, the precision and recall
are slightly lower; this is maybe due to the data skewness and the low frequency of the benign files in the original
dataset. However, Naı̈ve Bayes classifier accuracy improved significantly from 55% to 91% for binary classifica-
tion and from 72.34% to 81.8% for multi-classification. The Naı̈ve Bayes accuracy improvements were due to the
alteration of the features set. In future works, we will generalize our work using bigger and more balanced dataset.
References
[1] Aijaz, U.N., Patra, A., Siddiq, A.S., Chatterjee, B., Ghiyas Khan, M., 2018. Malware Detection on Server using Distributed Machine Learning.
Proceedings of Knowledge Discovery in Information Technology and Communication Engineering (KITE-2018) 2, 172–175. URL: http:
//www.pices-journal.com/downloads/V2I7-PICES0046.pdf.
[2] Amos, B., Turner, H., White, J., 2013. Applying machine learning classifiers to dynamic android malware detection at scale. 2013 9th Inter-
national Wireless Communications and Mobile Computing Conference, IWCMC 2013 , 1666–1671doi:10.1109/IWCMC.2013.6583806.
[3] Baychev, Y., Bilge, L., 2018. Spearphishing malware: Do we really know the unknown?, in: International Conference on Detection of Intrusions
and Malware, and Vulnerability Assessment, Springer. pp. 46–66.
[4] Chumachenko, K., et al., 2017. Machine learning methods for malware detection and classification .
[5] Firdausi, I., Lim, C., Erwin, A., Nugroho, A.S., 2010. Analysis of machine learning techniques used in behavior-based malware detection.
Proceedings - 2010 2nd International Conference on Advances in Computing, Control and Telecommunication Technologies, ACT 2010 ,
201–203doi:10.1109/ACT.2010.33.
[6] Green, J.P., Chandnani, A.D., Christensen, S.D., 2019. Detecting script-based malware using emulation and heuristics. US Patent 10,387,647.
[7] Jung, J., Kim, H., Shin, D., Lee, M., Lee, H., Cho, S., Suh, K., 2018. Android malware detection based on useful api calls and machine
learning, in: 2018 IEEE First International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), pp. 175–178. doi:10.
1109/AIKE.2018.00041.
[8] Rana, M.S., Gudla, C., Sung, A.H., 2018. Evaluating machine learning models for android malware detection: A comparison study, in:
Proceedings of the 2018 VII International Conference on Network, Communication and Computing, ACM, New York, NY, USA. pp. 17–21.
URL: http://doi.acm.org/10.1145/3301326.3301390, doi:10.1145/3301326.3301390.
[9] Rehman, Z.U., Khan, S.N., Muhammad, K., Lee, J.W., Lv, Z., Baik, S.W., Shah, P.A., Awan, K., Mehmood, I., 2018. Machine learning-assisted
signature and heuristic-based detection of malwares in Android devices. Computers and Electrical Engineering 69, 828–841. doi:10.1016/
j.compeleceng.2017.11.028.
[10] Rieck, K., Holz, T., Willems, C., Düssel, P., Laskov, P., 2008. Learning and classification of malware behavior, in: International Conference
on Detection of Intrusions and Malware, and Vulnerability Assessment, Springer. pp. 108–125.
[11] Sahs, J., Khan, L., 2012. A machine learning approach to android malware detection, in: 2012 European Intelligence and Security Informatics
Conference, pp. 141–147. doi:10.1109/EISIC.2012.34.
[12] Santos, I., Devesa, J., Brezo, F., Nieves, J., Bringas, P.G., 2013. OPEM: A static-dynamic approach for machine-learning-based malware
detection. Advances in Intelligent Systems and Computing 189 AISC, 271–280. doi:10.1007/978-3-642-33018-6_28.
[13] Singh, A., Handa, A., Kumar, N., Shukla, S.K., 2019. Malware classification using image representation, in: International Symposium on
Cyber Security Cryptography and Machine Learning, Springer. pp. 75–92.
[14] Singh, T., Di Troia, F., Corrado, V.A., Austin, T.H., Stamp, M., 2016. Support vector machines and malware detection. Journal of Computer
Virology and Hacking Techniques 12, 203–212. doi:10.1007/s11416-015-0252-0.
[15] Wang, X., Yang, Y., Zeng, Y., Tang, C., Shi, J., Xu, K., 2015. A novel hybrid mobile malware detection system integrating anomaly detection
with misuse detection, pp. 15–22. doi:10.1145/2802130.2802132.

The Use of Machine Learning Techniques To Advance The Detection and Classification of Unknown Malware

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Use of Machine Learning Techniques To Advance The Detection and Classification of Unknown Malware

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

International Workshop on Data-Driven Security (DDSW 2020),

∗ Corresponding author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000.

1 Kaspersky Security Bulletin, https://go.kaspersky.com/rs/802-IJN-240/images/KSB_statistics_2018_eng_final.pdf

3.1. Feature Selection

3.2. Cross Validation

4. Experiments and Results

4.1. Binary Classification

Classifier Accuracy Recall Precision F1 Classifier Accuracy

(a) Precision Measurements (b) Recall Measurements

You might also like