You are on page 1of 11

IET Information Security

Research Article

Malware classification using compact image ISSN 1751-8709


Received on 14th April 2019

features and multiclass support vector


Revised 9th December 2019
Accepted on 17th January 2020
E-First on 15th May 2020
machines doi: 10.1049/iet-ifs.2019.0189
www.ietdl.org

Lahouari Ghouti1 , Muhammad Imam2


1Department of Computer Science, Prince Sultan University, Riyadh, Saudi Arabia
2Department of Computer Engineering, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
E-mail: lghouti@psu.edu.sa

Abstract: Malware and malicious code do not only incur considerable costs and losses but impact negatively the reputation of
the targeted organisations. Malware developers, hackers, and information security specialists are continuously improving their
strategies to defeat each other. Unfortunately, there is no one-size-fits-all solution to detect and eradicate any malware. This
situation is aggravated more by the undetected vulnerabilities that usually impair computer software and internet tools. Such
vulnerabilities will remain undetected until fully exploited by malware developers, which will eventually cause considerable
financial and reputation losses. In this paper, we propose a novel scheme to detect and classify malware using only image
representations of the malware binaries. Highly discriminative features of the malware category and structure are extracted in a
compact subspace using principal component analysis. Then, an optimised support vector machine model classifies the
extracted features into malware categories. Unlike existing classification models, our solution requires simple algebraic dot
products to classify malware based on representative digital images. To assess its performance, publicly-available image
datasets, Malimg, Ember and BIG 2015, are considered. Our performance analysis indicates that their classifier outperforms
state-of-the-art models and attains classification accuracies of 0.998, 0.911, and 0.997 using Malimg, Ember and BIG 2015
malware datasets, respectively.

1 Introduction encompass a variety of approaches that have been widely accepted


by the information security community [3].
In 2017, American companies spent US$ 3.82 million to resolve This study proposes a new machine learning (ML) model to
malware attacks [1]. In Western Europe and Japan, costs ranged detect and classify malware categories using greyscale image
from US$ 2.56 to 1.28 million. In addition, resolving each malware representations of the malware binaries. Unlike most existing
takes 6 days on average [1]. At the same time, cyber-attacks are models, our model does not rely on direct cybersecurity analysis of
gaining momentum with the recent developments in information malware code. To alleviate the need for static and dynamic
and communication technologies. The availability of software malware analysis, our model represents each malware category
development tools across the deep web has greatly contributed to using rich and compact image features. Feature-based image
the proliferation of malware software around the world. Several classification has gained wide notoriety thanks to the recent
efforts have been exerted to mitigate the devastating effects of advances in deep learning and computer vision frameworks. To
malicious software and malware. Unlike before, malware attacks accurately classify malware into distinct categories, our classifier is
are currently initiated by individuals, organisations, and countries designed such that the extracted compact feature pertaining to two
as an integral part of what are commonly known as cyberwarfare. different malware categories should be highly uncorrelated. On the
For instance, the computer network of a Taiwan-based company, other hand, image features extracted from different instances of the
TSMC, was infected by malware in August 2018. This malware same malware class should be deemed identical by our classifier.
attack caused 3 days of delay in the production of IPhone and IPad Finally, the combination of the discriminative power of the retained
chips [2]. To minimise the devastating costs of malware, image features with the optimality of the proposed support vector
companies are dedicating considerable budgets to secure their machine (SVM) model leads to superior classification performance
computing infrastructures and resources from various types of as demonstrated later in this study.
attacks including malware. Also, information security companies
are sparing no efforts to detect and identify any emerging malware
and malicious code. However, these companies are continuously 1.1 Paper outline
challenged as malware developers keep exploiting unknown The remaining paper is organised as follows. Section 2 gives an
vulnerabilities in operating systems and internet software. Given overview of existing solutions for the detection and classification
the evolving nature of deployed software and, therefore, malware, of malware and malicious code. Then, the proposed malware
there will be no one-size-fits-all approach to analyse malware and classification model is introduced in Section 3 where the compact
detect malicious code. New malware strategies are developed and feature extraction and classification modules are discussed in
deployed to compromise existing operating systems and internet detail. The performance analysis of the proposed classifier is
resources all the time. These strategies are purposefully designed to presented in Section 4. The performance of our classifier is
defeat any existing protection mechanism. Over several years, as compared to state-of-the-art deep and ML models in the literature
the struggle between malware developers and information security where the classification performance is contrasted to highlight the
specialists’ rages, novel protection mechanisms are promptly superiority of the proposed malware classifier model. To optimise
devised in response to the emergence of unknown malware code. our malware classifier, the effects of several design parameters on
Anti-malware tools and software are the Swiss knife of the the classification performance are thoroughly investigated at the
information security specialists to neutralise malware and limit its end of this section. Section 5 concludes the paper where future
potential damages. Anti-malware techniques deal with the work directions are given and conclusions drawn.
detection and/or removal of malware code. These techniques

IET Inf. Secur., 2020, Vol. 14 Iss. 4, pp. 419-429 419


© The Institution of Engineering and Technology 2020
17518717, 2020, 4, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ifs.2019.0189 by Nat Prov Indonesia, Wiley Online Library on [19/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2 Review of malware classification techniques fed into an artificial neural network model to classify the executed
code fragments into one of the sets of malware categories. Similar
2.1 Cybersecurity approaches to the work in [15], Rieck et al. applied dynamic analysis on
Malicious software (Malware) has been actively investigated by collected malware samples and clustered samples with similar
the cybersecurity community. Idika and Mathur surveyed 45 behaviour into the same cluster [16]. Then, zero-day vulnerabilities
malware detection techniques and compared their performance [3]. were correctly identified from the resulting clusters. In [17],
They gave an overview of current visualisation techniques malware Anderson et al. constructed graphs using a collection of instruction
detection methods. To study malware, static and dynamic malware traces. In their solution, Anderson et al. expressed the likelihood of
analysis approaches are commonly used [4]. While basic static a Markov chain graph using 2-g features. Malware samples are
malware analysis examines the malware executable code compared using two different similarity metrics. However, this
irrespective of the code behaviour, reverse engineering is used by technique suffers high computational complexity which prevents it
advanced static analysis methods. Source code, obtained by from real-time deployment. Bayer et al. proposed a classification
reverse-engineering the executable code, is thoroughly analysed to technique using the locality-sensitive hashing algorithm where
reveal the concealed malware behaviour. On the other hand, malware samples in large datasets are automatically labelled into
dynamic analysis approaches monitor the malware behaviour various classes [18]. The connections between the malware
through code execution in virtual machine (VM) environments to activities and the system changes are investigated by Barhoom and
prevent the malware from spreading and regenerating. During Qeshta [19]. Barhoom and Qeshta defined an identity for malicious
malware execution, the main VM memory, the software processed behaviour using the software processes spawned and the network
spawned, the system changes and the Internet queries are connections initiated by the analysed code fragment. Kalash et al.
constantly monitored [4]. Furthermore, in [5], the authors studied [20] proposed a deep learning malware classifier. A convolutional
visualisation methods used to detect potential malware behaviour. neural network (CNN), called malware CNN (MCNN), is trained
Then, they provided a technique using a similarity matrix to device using two different malware datasets: Malimg [21] and BIG 2015
an accurate malware classification. They used visualisation of the [22]. In addition to the MCNN classifier, a baseline model is also
extended [6] x86 IA-32 similarity patterns where static and provided using GIST features [23] and multi-class SVM. Park et
dynamic alongside with visualisation of similarity matrices al. suggested a new malware classification technique based on
techniques are used to detect zero-day malware. On the other hand, maximal common subgraph [24]. Park et al. captured malware
in [7], the ML technique is used to analyse the occurrence of system calls thanks to a software sanbbox model. Then, behaviour
opcode. They used the top 20 features and found that 5 of them can graphs are generated from the captured system calls to enable
facilitate malware detection with accuracy close to 100%. As the malware classification. The proposed method successfully
number of malware code gets larger, it is becoming difficult to classified new malware with low false-positive rates. In [25], a new
perform real-time malware detection. Ding et al. [8] introduced a one-class SVM model is attributed to Burnaev and Smolyakov.
two-level parallelism approach to ensure real-time malware Their model determines normal code behaviour by taking into
detection. Their approach used parallelism of different level of account privileged information during the training process. Then,
detail present in malware data to achieve high efficiency with large anomalies are detected by quantifying the distance to the normal
malware data sets. class. The new approach showed a slight improvement over the
traditional one-class SVM approach. To evade detection, malware
2.2 Machine learning approaches designers introduce polymorphism to malware code. To overcome
this challenge, Narayanan et al. visualised malware as images to
Artificial intelligence and its active field of ML have made preserve malware's general features [26]. Malware belonging to
considerable progress over the last decade. Many successful ML- different families showed a different pattern that is distinguishable
based solutions for malware classification have been reported in across other families. Narayanan et al. method showed a slight
the literature. In [9], Schultz et al. generated binary profiles from improvement in detection accuracy while drastically reducing the
malware code. Then, features, extracted from these profiles, are fed computational time. In [27], Sahay and Sharma proposed an
to an ML-based classifier for malware classification. The work in algorithm to detect malware based on their file sizes. They used
[9] is extended by Kolter and Maloof using a data mining approach optimal k-means clustering to detect malware polymorphism and
where malware code is detected given n-gram features [10]. These new malware. In [28], Zhao et al. developed a new malware
n-gram features are classified using various classifiers including intermediate representation to improve malware detection. This
naive Bayes, SVM, and boosted decision trees (DTs). Boosted DTs representation consists of an absorbing functional sequence of
achieved the highest classification performance in terms of area- malware features. Then, long short term memory units learn
under-the-curve (AUC) measures as it achieved an AUC score of regularities in malware data. Zhao et al. evaluated the performance
0.996. of their system based on almost 300,000 malware samples. Unlike
The first automated malware classifier is attributed to Kong and other classification models, Zhao et al. classified successfully
Yan [11]. Kong and Yan proposed a malware classifier-based unknown malware. Thanks to a deep learning framework, Lee et
features extracted from the function call graph in each malware al. detected several malware classes [29]. The proposed deep
code. Similar malware samples are identified using similarity learning model provided a variation in the analysis of obfuscation
metrics. The classification of Trojan malware is investigated by malware. In [30], Sewak et al. compared deep and ML malware
Tian et al. [12]. In [12], the malware byte size defines the function detection algorithms. Two malware classifiers based on a deep
length frequency. Tian et al. found that the function frequency and neural network (DNN) and random forests are assessed.
range have a direct impact on the malware classification Surprisingly, Sewak et al. reported higher performance scores for
performance. the random forest model for various input features. On the other
Unlike the previous solutions, Santos et al. proposed semi- hand, Cakir and Dogdu claimed that some deep learning models
supervised learning for the classification of malware code [13]. can have better performance in analysing long sequences of system
Using the local and global consistency algorithm, a semi- calls [31]. To represent malware, Cakir and Dogdu extracted
supervised learning strategy, Santos successfully detected and features thanks to a shallow DNN model. Finally, Dey et al.
identified zero-day exploits. As is practically difficult to label large introduced a byte-level malware classification technique [32]. Dey
instances of malware binaries, semi-supervised learning mixes and et al. model represent an improvement over current image-based
classifies unlabelled samples using the model trained based on the malware detection algorithms. This improvement consists of
few available labelled ones while attaining acceptable accuracy capturing hidden malware patterns. Reported performance results
levels. Siddiqui et al. inspected traffic packets to detect worms indicated higher accuracy and lower false positive (FP) rates
using DTs and random forests [14]. Thanks to the dynamic analysis compared to current methods.
approach, malware behaviour is monitored by securely executing
code fragments inside isolated VMs as suggested by Zolkipli and
Jantan [15]. Then, features, resulting from malware analysis, are

420 IET Inf. Secur., 2020, Vol. 14 Iss. 4, pp. 419-429


© The Institution of Engineering and Technology 2020
17518717, 2020, 4, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ifs.2019.0189 by Nat Prov Indonesia, Wiley Online Library on [19/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Fig. 1 Malware image formation process [21]

dataset is compiled by Microsoft corporation and freely provided to


ML experts through the Microsoft Malware Classification
Challenge [22]. However, this dataset is not represented in the form
of images. Therefore, it is not considered in our study.

Fig. 2 Proposed ML-based malware classifier 3 Proposed malware classifier


To classify malware images into a specific malware category, a ML
model is proposed in this study. As malware images could be of
considerable sizes, image features will be extracted from an image
subspace using the principal component analysis (PCA) technique.
Then, these extracted features are fed into an SVM-based multi-
classification model. The overall architecture of the proposed
malware classifier is depicted in Fig. 2.

3.1 Model properties


Following Nataraj approach [21], grayscale malware images are
extracted from unpacked malware binaries. Then, compact features
are extracted from these highly redundant image representations.
Unlike convolutional layers and autoencoders, the proposed model
extracts compact features from meaningful subspace
representations in the Eigen space domain using the PCA feature
extraction approach. As corroborated by the simulation results,
malware images represent challenging raw image data to
convolutional layers, which usually rely on the presence of
meaningful image textures and activities. Finally, a multiclass
Fig. 3 Classification performance sensitivity to feature vector size SVM module is efficiently trained using these highly compact PCA
features. Other possible classification models such as softmax can
be also envisaged [33].
Malware images

3.2 Feature extraction using PCA


As the average size of the malware images is 510 × 410, the
performance of any classification model will suffer from the curse-
of-dimensionality. Therefore, we need first to reduce the size of the
extracted feature vectors into boost the performance of the
proposed malware classifier. Selecting the appropriate size of the
feature vector is a challenging ML problem that has received
considerable attention in the literature [33]. Fig. 3 illustrates the
effect of feature vector size on the classification performance. The
curse-of-dimensionality is clearly visible in Fig. 3.
Therefore, an informed selection of the optimal size for the
feature vectors will greatly impact the overall classification
performance. Lower-rank, p-dimensional vectors, {z(i)}im= 1, are
Fig. 4 Illustration of 2D data projection onto principal components obtained by projecting the m,n-dimensional feature vectors,
(a) Sample data with main eigenvectors, (b) Rotated data with respect to main {x(i)}im= 1 onto the top p eigenvectors of the covariance matrix Σ
eigenvectors, (c) Projected data onto the first eigenvector, (d) Projected data onto the [33]. An illustration of this concept is provided in Fig. 4. The
second eigenvector sample two-dimensional data, shown in Fig. 4a, has a well-
structured representation using the two eigenvectors as depicted in
2.3 From executable code to malware images Fig. 4b. Fig. 4d clearly indicates that the projected data onto the
second eigenvector can be safely discarded without affecting the
To highlight the usefulness of image features for the classification overall data representation. Retaining only the projections onto the
of malware code, Nataraj et al. developed a very rich malware first eigenvector will ensure that most of the noise inherent in the
dataset, where each malware code sample is converted into a data is removed [33]. A sample estimate of Σ is given by
digital image [21]. The main purpose of this publicly available
dataset is to allow cybersecurity and ML researchers to evaluate m
and assess their malware detection and classification algorithms. ∑ = m1 ∑ x(i) − μ
T
x(i) − μ (1)
Fig. 1 illustrates the malware image creation process. i=1
The greyscale images, resulting from the process proposed by
Nataraj et al. showed specific texture patterns that were highly where μ is the sample mean of m, x(i) vectors represented as a row
informative about the underlying malware class. Then, image vector. The left and right eigenvectors, U and V, of Σ can be
features were extracted based on the GIST descriptor [23]. These efficiently computed using the singular value decomposition
features are grouped into clusters corresponding to different (SVD) algorithm as follows [33]:
malware classes using the K-means algorithm. Another malware
IET Inf. Secur., 2020, Vol. 14 Iss. 4, pp. 419-429 421
© The Institution of Engineering and Technology 2020
17518717, 2020, 4, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ifs.2019.0189 by Nat Prov Indonesia, Wiley Online Library on [19/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Σ=U⋅S⋅V (2)

where U ∈ ℜn × n, S ∈ ℜn × m, and V ∈ ℜm × m. S is a diagonal


matrix whose entries are the singular values associated with the
resulting singular vectors. The corresponding eigenvalues are
obtained by squaring these singular values.

3.3 Feature reduction using image energies


The size of the low-rank feature vectors can be decided using the
singular or eigenvalues obtained by (2). First, these values are
sorted in decreasing order. Then, a user-defined threshold is
selected to retain a specific percentage of the total energy explained
by all left eigenvectors in U as follows:

i. Let the total energy of the eigenvalues:


Fig. 5 Cumulative energy of eigenvalues
n
∑ si (3)
2
tot_energy =
i=1

where si is the singular value associated with the ith


eigenvector ui.
ii. Pick the first p eigenvectors such that
p
∑ si ≤ τ × tot_energy (4)
2
p_energy =
i=1

where τ is a user-defined threshold to control the dimension


reduction strength.

The value of p is usually much smaller than n as depicted in Fig. 5.


We notice a clear drop in the energy associated with the
eigenvalues for low and moderate values of p. Fig. 6 ‘Optimal’ separating hyperplanes in SVM models

3.4 SVM-based classifier


A multiclass SVM model is trained using the lower-rank feature
vectors resulting from the SVD-based feature extractor described
in Sections 3.2–3.3. In this way, the images associated with the
malware categories are classified into their respective class. The
original maximum-margin SVM model is attributed to Vapnik and
Chervonenkis [33]. Using an optimal hyperplane, this model
defines a discriminative classifier. The discriminative power of
SVMs is summarised in Fig. 6. Given the m p-dimensional feature
vectors, z(i), and their associated class labels (0 and 1), the two-
class SVM model is defined as [33] Fig. 7 SVM solution. Without kernels (left). Kernel-based (right)
1 formulated and solved as a Lagrangian optimisation problem where
min E(w, b, ξ) = ∥ w ∥2 + c ⋅ Φ(ξ)
2 non-linear kernels are considered in the dual space solution [33]. In
subjectto: (5) most instances, kernels-based derivative-free cost functions and
y(i) ⟨w, v(i)⟩ + b ≥ 1 − ξ(i) grid search optimisation routines are recommended to solve (2).
, ∀i = 1, 2, …, m Fig. 7 shows the benefits of kernel-based SVM solutions [33].
ξ(i) ≥ 0
3.5 Model efficiency
w and b define the optimal hyperplane estimated by the SVM
model [33]. Once trained, the proposed malware classifier requires two basic
inner product operations to compute the extracted features and
The inner product, ⟨w, v(i)⟩, is carried out on the v(i)’s feature
categorize the malware images into malware classes. These inner
vectors. The v(i) vectors represent the kernelised version of the p- products are efficiently implemented in data processing and
dimensional feature vectors z(i). The slack variables, ξ(i), will allow machine learning libraries including Python Scikit-Learn, Numpy
some classification errors during training to ensure that the and Pandas.
separating hyperplane, shown in Fig. 6, will be wider. This strategy
will result in smaller errors during the testing phase. The term c
4 Performance evaluation of proposed malware
controls the amount of classification errors allowed and the penalty
cost is defined as classifier
All computer experiments are carried out on a computer
m
workstation running 16.04 Linux Ubuntu operating system. An I
Φ(ξ) = ∑ξ (i)
(6) NVidia graphical processing unit, the GeForce GT 740M with 2
i=1
GB memory, is used to speed up the learning computations. The
malware classification model is implemented in the Python 3
The support vectors are represented by the data points located on programming language using OpenCV and ScikitLearn libraries
the separating hyperplanes as shown in Fig. 6. In general, (2) is
422 IET Inf. Secur., 2020, Vol. 14 Iss. 4, pp. 419-429
© The Institution of Engineering and Technology 2020
17518717, 2020, 4, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ifs.2019.0189 by Nat Prov Indonesia, Wiley Online Library on [19/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
(Python code to reproduce the results of our work is available at
https://github.com/lahouari2018/Malware_Classification).

4.1 Malware datasets


To evaluate the performance of the proposed malware classification
model, three different malware datasets are considered

i. Malimg malware image dataset [21].


ii. Endgame Malware BEnchmark for Research (Ember dataset)
[34].
iii. Microsoft Malware Classification Challenge (BIG 2015) [22]. Fig. 8 Sample labelled malware images [13]. Adialer.C malware (left).
Skintrim.N malware (right)
4.1.1 Malimg dataset: The Malimg malware image dataset is
freely provided by the Vision Research Group at the University of Table 1 Summary of Malimg dataset [21]
California at Santa Barbara [21]. The Maligm dataset is primarily Malware class Malware type Number of Number of
intended to train the machine and deep learning models to correctly samples samples per
classify malware categories. 9339 images are extracted from type
various code samples pertaining to 25 different malware types.
Agent.FYI Backdoor 116 274
Sample malware images are shown in Fig. 8. However, this dataset
suffers from high imbalance as shown in Table 1. This imbalance Rbot!gen Backdoor 158
will impact the performance of any ML-based classifier if not Adialer.C Dialer 122 730
properly handled [33]. Dialplatform.B Dialer 177
Table 1 summarises the malware types available in the Malimg Instantaccess Dialer 431
dataset. Lolyda.AA1 Password stealer 213 679
(PWS)
4.1.2 Ember dataset: The Endgame malware benchmark for Lolyda.AA2 PWS 184
research (Ember) dataset is provided by Endgame Company to Lolyda.AA3 PWS 123
advance the academic research on malware detection and
Lolyda.AT PWS 159
classification. It consists of basic 2380-dimensional malware
features extracted from portable executable files. In this study, we Fakerean Rogue 381 381
will consider the 2018 version only. 5000 and 2000 samples are Alueron.gen!J Trojan 198 760
randomly selected to form the training and test subsets. C2LOP.P Trojan 146
C2LOP.gen!g Trojan 200
4.1.3 BIG 2015 dataset: This dataset is provided by Microsoft Malex.gen!J Trojan 136
Corporation for research purposes. It contains 10,868 labelled Skintrim.N Trojan 80
samples pertaining to nine different malware categories [22]. As Dontovo.A Trojan Downloader 162 661
each malware record consists of the malware binary code and its
Obfuscator.AD Trojan Downloader 142
assembly code version extracted using the IDA Pro disassembler
[22], malware images were constructed from the binary Swizzor.gen!E Trojan Downloader 128
representation using the Nataraj approach [21]. Similar to the Swizzor.gen!I Trojan Downloader 132
Ember dataset, the training and test subsets consist of 5000 and Wintrim.BX Trojan Downloader 97
2000 randomly selected samples, respectively. Allaple.A Worm 2949 5748
Allaple.L Worm 1591
4.2 Malware classification performance measures VB.AT Worm 408
The efficiency of the malware classification problem is assessed Yuner.A Worm 800
using the precision, recall, accuracy, and F1-score metrics. Given Autorun.K Worm.AutoIT 106 106
the true positive (TP), true negative (TN), false positive (FP), and Total 9339
false negative (FN), these metrics are defined below [33]

TP eigenvectors (or eigenmalwares), {u(i)}i16− 1, are displayed on the


Precision = (7)
TP + FP
right side of Fig. 9.
The energy associated with each of the left eigenvectors is
TP
recall = (8) reported in Fig. 10. The drastic drop in the energy associated with
TP + FN the resulting eigenvectors clearly hints that only a small subset of
eigenvectors carry most of the energy in the malware images. In
TP + TN fact, projecting the malware images onto this small subset of
Accuracy = (9)
# Samples eigenvectors will lead to very compact image features that will be
efficiently used to train our proposed malware classifier. More
Precision × recall specifically, it is expected, as reported in Fig. 10, that the extracted
F1 − score = 2 × (10)
Precision + recall image features will not contain more than 100 elements. Exceeding
this number will trigger the overfitting problem, which may result
4.3 Efficiency of extract compact image features in poor classification performance during the testing phase [33].
The behaviour of the malware eigenvectors, reported in Fig. 10,
To maintain the aspect ratio of the malware images, their size is
confirms the compaction properties of the PCA algorithm shown in
reduced to 100 × 128 prior to feature extraction. Then, each of the
Fig. 5. The effect of the number of retained left eigenvectors on the
9342 malware images is flattened as a column in a 12800 × 9342 malware classification performance will be investigated in Section
design matrix X. The sample mean image is then estimated in order 4.4.
to mean-normalise the columns of X. The resulting sample mean
image is shown in the left side of Fig. 9. The sample covariance
matrix, Σ, is estimated using (1). Finally, left and right
eigenvectors of Σ are computed using (2). The first 16 left
IET Inf. Secur., 2020, Vol. 14 Iss. 4, pp. 419-429 423
© The Institution of Engineering and Technology 2020
17518717, 2020, 4, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ifs.2019.0189 by Nat Prov Indonesia, Wiley Online Library on [19/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4.4 Malware classification performance results addition, projecting the malware images onto a larger subset of
eigenvectors will negatively impact the SVM classifier
A multiclass SVM classifier model is trained using the lower-rank performance as these eigenvectors carry more noise than
feature vectors. To assess the effects of the various design meaningful information. Furthermore, feature descriptors with
parameters on the classification performance, we will not attempt larger sizes would require more training data.
to optimise the SVM model parameters. A standard 25-class SVM We will turn our attention to investigate the effect of the
classifier is used with the following settings: training data size of the classification performance. As the malware
classes are not evenly represented in the Malimg malware dataset,
• Kernel used: radial basis functions with a spread factor of 0.001. we will implement the SVD decomposition, defined in (2), using a
• The penalty term control parameter c, defined in (5), is fixed to balanced number of images per malware class. Fig. 12 shows the
10. effect of the number of samples per malware category on the
performance of the SVM-based malware classifier. It should be
We will first evaluate the effect of the number of the retained noted that for malware categories with a small number of image
eigenvectors on the classification performance. This effect is samples, image augmentation is used to create the required number
shown in Fig. 11. Retaining only 40–60 eigenvectors yielded the of samples by altering the available image samples. It is interesting
highest classification performance in terms of accuracy, precision, to note that the proposed classifier achieves perfect malware
recall, and F1-score metrics. The subset of eigenvectors has the classifier when the number of samples per malware category is
highest discriminative power among the available eigenvectors. In

Fig. 9 Sample mean malware image (left). Top 16 left eigenmalware images (right)

Fig. 10 Image energy distribution among eigenvalues (left). Corresponding cumulative energy (right)

424 IET Inf. Secur., 2020, Vol. 14 Iss. 4, pp. 419-429


© The Institution of Engineering and Technology 2020
17518717, 2020, 4, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ifs.2019.0189 by Nat Prov Indonesia, Wiley Online Library on [19/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Fig. 11 Classification performance sensitivity to the number of retained eigenvectors

Fig. 12 Effect of number of samples per malware category on the classification performance

around 400. With this considerable number of samples, the baseline model by Kalash et al. [20]. Table 2 summarises the
classifier is able to capture most of the variations in the malware classification performance attained by these classifiers. The
categories thanks to the image augmentation trick that is commonly performance of the existing classifiers is assessed using the
used in deep learning to boost the performance of deep learning accuracy scores only as reported in [20, 21]. Our proposed
models. classifier is trained using 40-dimensional lower-rank features
Unit-variance lower-rank features are preferred over non- assuming 400 sample images per malware category. The M-CNN
normalised features. This type of normalisation, commonly known model is trained for 25 epochs with a batch size of six images [20].
as the whitening process, involves a scaling step of the extracted Thanks to the proposed feature reduction and parameter tuning, our
features by the eigenvalues of the retained eigenvectors. Fig. 13 proposed classifier outperforms state-of-the-art existing malware
illustrates the effect of the feature whitening step on the classifiers (i.e. the M-CNN model). However, it should be noted
classification accuracy of the proposed malware classifier. It is that our classifier suffers from scalability and should be retrained
clear that the whitening process provides a drastic boost to the from scratch to accommodate new malware categories. This
classification performance which corroborates the common represents the main limitation that would prevent our model from
practice of preferring normalised features over non-normalised online learning to adapt to new emerging malware instances.
ones. A similar effect is observed with the remaining performance Table 3 summarises the performance of the SVM-based and
metrics. softmax-based malware classifiers using Ember and BIG2015
Finally, we compare the performance of our proposed malware datasets.
classifier to the CNN-based models proposed by Kalash et al. [20],
the image-based solution proposed by Nataraj et al. and the

IET Inf. Secur., 2020, Vol. 14 Iss. 4, pp. 419-429 425


© The Institution of Engineering and Technology 2020
17518717, 2020, 4, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ifs.2019.0189 by Nat Prov Indonesia, Wiley Online Library on [19/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Fig. 13 Effect of feature whitening on the classification accuracy measure

Table 2 Classification performance of proposed and existing malware classifiers using Malimg dataset
Classifier Accuracy Precision Recall F1-score
Nataraj et al. [21] 0.9718 — — —
GIST + SVM [20] 0.9323 — — —
M-CNN [20] 0.9852 — — —
proposed (PCA with SVM) 0.998 0.999 0.999 0.999

Table 3 Classification performance of proposed classifier using Ember and BIG2015 malware datasets
Classifier Accuracy Precision Recall F1-score
autoencoder with Softmax (Ember dataset) 0.772 0.770 0.771 0.7705
PCA with SVM (Ember dataset) 0.911 0.910 0.912 0.914
autoencoder with Softmax (BIG2015 dataset) 0.953 0.951 0.952 0.9515
PCA with SVM (BIG2015 dataset) 0.997 0.996 0.997 0.9965

4.5 Non-PCA-based feature extraction deployment plays an important role in selecting the appropriate
algorithm. Table 5 summarises the computational complexity of the
Malware image features can be also extracted using autoencoder proposed and existing malware classifiers assuming that malware
models. Self-extracted features have been efficiently used in samples are classified based on n-dimensional feature vectors
several computer vision problems. Fig. 14 depicts features extracted from 128 × 100 malware images. More specifically,
extracted using a vanilla autoencoder. It is interesting to note that Nataraj et al. [21] algorithm requires the computation of GIST
the extracted features capture some of the textures present in features from steerable pyramid image representations followed by
malware images similar to those shown in Fig. 8. However, our a clustering routine based on the K-nearest neighbour algorithm.
PCA-based feature extractor is capable of retaining most of the While the first step requires O n ⋅ log2(n) operations, the latter has
image activity present in the malware images as depicted in Fig. 9.
This contrast in feature extraction capability is well justified by the a linear complexity [33]. The classifier, proposed by Kalash et al.
classification performance of the proposed malware classifier [20], classifies the same features as Nataraj et al. using an SVM
model. This basic autoenconder model consists of single-layered classifier which would require O(n) operations as the SVM is
encoder and decoder components as shown in Fig. 15. Using the efficiently implemented using basic inner products [33]. Unlike the
self-extracted features, malware samples are classified using a 25- previous models, the CNN-based model is impaired with a
dimensional softmax layer. The combined autoencoder-softmax quadratic complexity due to the underlying matrix and vector
model is shown in Fig. 16. The performance of the softmax-based multiplications associated with neural networks [33]. Finally, the
classifier can be further improved using self-extracted features proposed model has a linear complexity as it involves only inner
from stacked autoencoders as commonly used in deep learning products in the feature extraction (PCA) and classification (SVM)
solutions [35]. The performance of basic and stacked autoencoders phases. Therefore, the classification performance and
is reported in Table 4. It is clearly demonstrated that the SVM computational complexity of the proposed malware classifier give
model can accurately discriminate the PCA-based features as the it an edge over existing malware classifiers.
data reduction is drastic where a compression ratio of 12,800/100 is
achieved. On the other hand, a deep stack of autoencoders is 5 Conclusions
required to achieve similar feature reduction rates without affecting In this study, we have presented a novel ML solution for malware
the classification performance. Such stacking comes usually at the classification. Thanks to the informed feature reduction and
cost of increased training and test times [35]. parameter tuning during the training process, our malware
classifier requires simple algebraic dot products to classify
4.6 Computational efficiency malware based on representative digital images. To assess its
The proposed and existing malware classifiers are trained off-line performance, publicly-available malware datasets, Malimg, Ember,
using the same training and test subsets. However, the and BIG2015, are considered. The reported performance analysis
computational complexity of the malware classifier during indicates that our classifier outperforms state-of-the-art models and

426 IET Inf. Secur., 2020, Vol. 14 Iss. 4, pp. 419-429


© The Institution of Engineering and Technology 2020
17518717, 2020, 4, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ifs.2019.0189 by Nat Prov Indonesia, Wiley Online Library on [19/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License

Fig. 14 Auto-extracted features using a basic single-layered autoencoder with 1000 hidden nodes

attains classification accuracies of 0.998, 0.911, and 0.997 using 6 Acknowledgments


Malimg, Ember, and BIG2015 malware datasets, respectively.
However, as our malware classifier is trained to learn malware The authors would like to acknowledge the support provided by
behaviour, it requires re-training to enable the detection and Prince Sultan University (PSU) and King Fahd University of
classification of emerging malware categories. Future work Petroleum and Minerals (KFUPM).
includes the investigation of the effect of malware packing on the
classification performance using various standard packing tools at
variable compression rates.

IET Inf. Secur., 2020, Vol. 14 Iss. 4, pp. 419-429 427


© The Institution of Engineering and Technology 2020
17518717, 2020, 4, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ifs.2019.0189 by Nat Prov Indonesia, Wiley Online Library on [19/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Fig. 15 Architecture of basic single-layered autoencoder with 1000 hidden nodes

Fig. 16 Architecture of softmax-based malware classifier

Table 4 Classification performance of proposed malware [13] Santos, I., Devesa, J., Brezo, F., et al.: ‘OPEM: A static-dynamic approach for
machine-learning-based malware detection’. Int. Joint Conf. CISIS'12–
classifier using Malimg dataset and different classification ICEUTE'12–SOCO'12 Special Sessions, Ostrava, Czech Republic, 5th–7th
models September 2012, pp. 271–280
Classifier Accuracy Precision Recall F1-Score [14] Siddiqui, M., Wang, M.C., Lee, J.: ‘Detecting internet worms using data
mining techniques’, J. Syst. Cybern. Inf., 2009, 6, (6), pp. 48–53
autoencoder with Softmax 0.888 0.887 0.889 0.8875 [15] Zolkipli, M., Jantan, A.: ‘An approach for malware behavior identification
autoencoder with SVM 0.933 0.931 0.929 0.929 and classification’. 2011 3rd Int. Conf. on Computer Research and
Development, Shanghai, China, 2011, vol. 1, pp. 191–194
PCA with Softmax 0.961 0.960 0.960 0.960
[16] Rieck, K., Trinius, P., Willems, C., et al.: ‘Automatic analysis of malware
PCA with SVM 0.998 0.997 0.998 0.9975 behavior using machine learning’, J. Comput. Secur., 2011, 19, (4), pp. 639–
668
[17] Anderson, B., Quist, D., Neil, J., et al.: ‘Graph-based malware detection using
dynamic analysis’, J. Comput. Virol., 2011, 7, (4), pp. 247–258
Table 5 Computational complexity during deployment [18] Bayer, U., Comparetti, P.M., Hlauschek, C., et al.: ‘Scalable, behavior-based
Classifier Computational complexity malware clustering’. Network and Distributed System Security Symp., San
Diego, California, USA, 2009, pp. 8–11
Nataraj et al. [21] O n ⋅ log2(n) + O 320 ⋅ n [19] Barhoom, T., Qeshta, H.: ‘Worm detection by combination of classification
GIST + SVM [20] O n ⋅ log2(n) + O n with neural networks’, Int. Arab J. e-Technol., 2013, 3, (2), pp. 57–65
[20] Kalash, M., Rochan, M., Mohammed, N., et al.: ‘Malware classification with
M-CNN [20] O n2 deep convolutional neural networks’. 2018 9th IFIP Int. Conf. on New
Technologies, Mobility and Security (NTMS), Paris, France, 2018, pp. 1–5
proposed (PCA with SVM) On [21] Nataraj, L., Karthikeyan, S., Jacob, G., et al.: ‘Malware images: visualization
and automatic classification’. Proc. 8th Int. Symp. on Visualization for Cyber
Security. VizSec'11, Pittsburgh, Pennsylvania, USA, 2011, pp. 4:1–4:7
[22] Competition, K.: ‘Microsoft malware classification challenge (big 2015)’,
7 References 2017. Available at https://www.kaggle.com/c/malwareclassification
[23] Oliva, A., Torralba, A.: ‘Modeling the shape of the scene: A holistic
[1] Accenture, Institute, P.: ‘Cost of cyber crime study: insights on the security
representation of the spatial envelope’, Int. J. Comput. Vision, 2001, 42, (3),
investments that make a difference’. Ponemon Institute LLC, 2017
pp. 145–175
[2] ‘TSMC: Outbreak of malware that triggered delays losses caused by software
[24] Park, Y., Reeves, D., Mulukutla, V., et al.: ‘Fast malware classification by
for new tool’. Available at https://www.anandtech.com/show/13193/tsmc-
automated behavioral graph matching’. Proc. Sixth Annual Workshop on
outbreak-of-malware-that-triggereddelays-losses-caused-by-software-for-
Cyber Security and Information Intelligence Research. CSIIRW'10, Oak
new-tool (accessed 15 March 2019)
Ridge, Tennessee, USA, 2010, pp. 45:1–45:4
[3] Idika, N., Mathur, A.: ‘A survey of malware detection techniques’ (Purdue
[25] Burnaev, E., Smolyakov, D.: ‘One-class SVM with privileged information and
University, West Lafayette, Indiana, USA, 2007)
its application to malware detection’. 2016 IEEE 16th Int. Conf. on Data
[4] Sikorski, M., Honig, A.: ‘Practical malware analysis: the hands-on guide to
Mining Workshops (ICDMW), Barcelona, Spain, 2016, pp. 273–280
dissecting malicious software’ (No Starch Press, San Francisco, California,
[26] Narayanan, B.N., Djaneye-Boundjou, O., Kebede, T.M.: ‘Performance
USA, 2012, 1st edn.)
analysis of machine learning and pattern recognition algorithms for malware
[5] Venkatraman, S., Alazab, M.: ‘Use of data visualisation for zero-day malware
classification’. 2016 IEEE National Aerospace and Electronics Conf.
detection’, Secur. Commun. Netw., 2018, 2018, pp. 1728303:1–1728303:13
(NAECON) and Ohio Innovation Summit (OIS), Dayton, Ohio, USA, 2016,
[6] Venkatraman, S., Alazab, M.: ‘Classification of malware using visualisation
pp. 338–342
of similarity matrices’. 2017 Cybersecurity and Cyberforensics Conf. (CCC),
[27] Sahay, S., Sharma, A.: ‘Grouping the executables to detect malware with high
London, UK, 2017, pp. 3–8
accuracy’, Procedia Comput. Sci., 2016, 78, pp. 667–674
[7] Sharma, S., Challa, R., Sahay, S.: ‘Detection of advanced malware by
[28] Zhao, B., Han, J., Meng, X.: ‘A malware detection system based on
machine learning techniques’, CoRR, 2019, pp. 333–342
intermediate language’. 2017 4th Int. Conf. on Systems and Informatics
[8] Ding, H., Sun, W., Chen, Y., et al.: ‘Malware detection and classification
(ICSAI), Hangzhou, Zhejiang, China, 2017, pp. 824–830
based on parallel sequence comparison’. 5th Int. Conf. on Systems and
[29] Lee, Y.S., Lee, J.U., Soh, W.Y.: ‘Trend of malware detection using deep
Informatics, ICSAI 2018, Nanjing, China, 10–12 November 2018, pp. 670–
learning’. Proc. 2nd Int. Conf. on Education and Multimedia Technology.
675
ICEMT 2018, Okinawa, Japan, 2018, pp. 102–106
[9] Schultz, M.G., Eskin, E., Zadok, E., et al.: ‘Data mining methods for
[30] Sewak, M., Sahay, S.K., Rathore, H.: ‘Comparison of deep learning and the
detection of new malicious executables’. Proc. 2001 IEEE Symp. on Security
classical machine learning algorithm for the malware detection’. 2018 19th
and Privacy. SP'01, Oakland, California, USA, 2001, pp. 38–49
IEEE/ACIS Int. Conf. on Software Engineering, Artificial Intelligence,
[10] Kolter, J.Z., Maloof, M.A.: ‘Learning to detect and classify malicious
Networking and Parallel/Distributed Computing (SNPD), Busan, South
executables in the wild’, J. Mach. Learn. Res., 2006, 7, pp. 2721–2744
Korea, 2018, vol. abs/1809.05889, pp. 293–296
[11] Kong, D., Yan, G.: ‘Discriminant malware distance learning on structural
[31] Cakir, B., Dogdu, E.: ‘Malware classification using deep learning methods’.
information for automated malware classification’. Proc. 19th ACM SIGKDD
Proc. ACMSE 2018 Conf. (ACMSE'18), Richmond, Kentucky, USA, 2018,
Int. Conf. on Knowledge Discovery and Data Mining. KDD ‘13, Chicago,
pp. 10:1–10:5
Illinois, USA, 2013, pp. 1357–1365
[32] Dey, A., Bhattacharya, S., Chaki, N. ‘Byte label malware classification using
[12] Tian, R., Islam, R., Batten, L., et al.: ‘Differentiating malware from cleanware
image entropy’, in Chaki, R., Cortesi, A., Saeed, K., Chaki, N. (Eds.):
using behavioural analysis’. 2010 5th Int. Conf. on Malicious and Unwanted
‘Advanced computing and systems for security’ (Springer, Singapore, 2019),
Software, Nancy, Lorraine, France, 2010, pp. 23–30
pp. 17–29

428 IET Inf. Secur., 2020, Vol. 14 Iss. 4, pp. 419-429


© The Institution of Engineering and Technology 2020
17518717, 2020, 4, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ifs.2019.0189 by Nat Prov Indonesia, Wiley Online Library on [19/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
[33] Hastie, T., Tibshirani, R., Friedman, J.: ‘The elements of statistical learning: [35] Goodfellow, I., Bengio, Y., Courville, A.: ‘Deep learning’ (MIT Press,
data mining, inference and prediction’ (Springer, New York, USA, 2009, 2nd Cambridge, Massachusetts, USA, 2016)
edn.)
[34] Anderson, H.S., Roth, P.: ‘EMBER: an open dataset for training static PE
malware machine learning models’, ArXiv e-prints, 2018

IET Inf. Secur., 2020, Vol. 14 Iss. 4, pp. 419-429 429


© The Institution of Engineering and Technology 2020

You might also like