Professional Documents
Culture Documents
Research Article
Abstract: Malware and malicious code do not only incur considerable costs and losses but impact negatively the reputation of
the targeted organisations. Malware developers, hackers, and information security specialists are continuously improving their
strategies to defeat each other. Unfortunately, there is no one-size-fits-all solution to detect and eradicate any malware. This
situation is aggravated more by the undetected vulnerabilities that usually impair computer software and internet tools. Such
vulnerabilities will remain undetected until fully exploited by malware developers, which will eventually cause considerable
financial and reputation losses. In this paper, we propose a novel scheme to detect and classify malware using only image
representations of the malware binaries. Highly discriminative features of the malware category and structure are extracted in a
compact subspace using principal component analysis. Then, an optimised support vector machine model classifies the
extracted features into malware categories. Unlike existing classification models, our solution requires simple algebraic dot
products to classify malware based on representative digital images. To assess its performance, publicly-available image
datasets, Malimg, Ember and BIG 2015, are considered. Our performance analysis indicates that their classifier outperforms
state-of-the-art models and attains classification accuracies of 0.998, 0.911, and 0.997 using Malimg, Ember and BIG 2015
malware datasets, respectively.
Fig. 9 Sample mean malware image (left). Top 16 left eigenmalware images (right)
Fig. 10 Image energy distribution among eigenvalues (left). Corresponding cumulative energy (right)
Fig. 12 Effect of number of samples per malware category on the classification performance
around 400. With this considerable number of samples, the baseline model by Kalash et al. [20]. Table 2 summarises the
classifier is able to capture most of the variations in the malware classification performance attained by these classifiers. The
categories thanks to the image augmentation trick that is commonly performance of the existing classifiers is assessed using the
used in deep learning to boost the performance of deep learning accuracy scores only as reported in [20, 21]. Our proposed
models. classifier is trained using 40-dimensional lower-rank features
Unit-variance lower-rank features are preferred over non- assuming 400 sample images per malware category. The M-CNN
normalised features. This type of normalisation, commonly known model is trained for 25 epochs with a batch size of six images [20].
as the whitening process, involves a scaling step of the extracted Thanks to the proposed feature reduction and parameter tuning, our
features by the eigenvalues of the retained eigenvectors. Fig. 13 proposed classifier outperforms state-of-the-art existing malware
illustrates the effect of the feature whitening step on the classifiers (i.e. the M-CNN model). However, it should be noted
classification accuracy of the proposed malware classifier. It is that our classifier suffers from scalability and should be retrained
clear that the whitening process provides a drastic boost to the from scratch to accommodate new malware categories. This
classification performance which corroborates the common represents the main limitation that would prevent our model from
practice of preferring normalised features over non-normalised online learning to adapt to new emerging malware instances.
ones. A similar effect is observed with the remaining performance Table 3 summarises the performance of the SVM-based and
metrics. softmax-based malware classifiers using Ember and BIG2015
Finally, we compare the performance of our proposed malware datasets.
classifier to the CNN-based models proposed by Kalash et al. [20],
the image-based solution proposed by Nataraj et al. and the
Table 2 Classification performance of proposed and existing malware classifiers using Malimg dataset
Classifier Accuracy Precision Recall F1-score
Nataraj et al. [21] 0.9718 — — —
GIST + SVM [20] 0.9323 — — —
M-CNN [20] 0.9852 — — —
proposed (PCA with SVM) 0.998 0.999 0.999 0.999
Table 3 Classification performance of proposed classifier using Ember and BIG2015 malware datasets
Classifier Accuracy Precision Recall F1-score
autoencoder with Softmax (Ember dataset) 0.772 0.770 0.771 0.7705
PCA with SVM (Ember dataset) 0.911 0.910 0.912 0.914
autoencoder with Softmax (BIG2015 dataset) 0.953 0.951 0.952 0.9515
PCA with SVM (BIG2015 dataset) 0.997 0.996 0.997 0.9965
4.5 Non-PCA-based feature extraction deployment plays an important role in selecting the appropriate
algorithm. Table 5 summarises the computational complexity of the
Malware image features can be also extracted using autoencoder proposed and existing malware classifiers assuming that malware
models. Self-extracted features have been efficiently used in samples are classified based on n-dimensional feature vectors
several computer vision problems. Fig. 14 depicts features extracted from 128 × 100 malware images. More specifically,
extracted using a vanilla autoencoder. It is interesting to note that Nataraj et al. [21] algorithm requires the computation of GIST
the extracted features capture some of the textures present in features from steerable pyramid image representations followed by
malware images similar to those shown in Fig. 8. However, our a clustering routine based on the K-nearest neighbour algorithm.
PCA-based feature extractor is capable of retaining most of the While the first step requires O n ⋅ log2(n) operations, the latter has
image activity present in the malware images as depicted in Fig. 9.
This contrast in feature extraction capability is well justified by the a linear complexity [33]. The classifier, proposed by Kalash et al.
classification performance of the proposed malware classifier [20], classifies the same features as Nataraj et al. using an SVM
model. This basic autoenconder model consists of single-layered classifier which would require O(n) operations as the SVM is
encoder and decoder components as shown in Fig. 15. Using the efficiently implemented using basic inner products [33]. Unlike the
self-extracted features, malware samples are classified using a 25- previous models, the CNN-based model is impaired with a
dimensional softmax layer. The combined autoencoder-softmax quadratic complexity due to the underlying matrix and vector
model is shown in Fig. 16. The performance of the softmax-based multiplications associated with neural networks [33]. Finally, the
classifier can be further improved using self-extracted features proposed model has a linear complexity as it involves only inner
from stacked autoencoders as commonly used in deep learning products in the feature extraction (PCA) and classification (SVM)
solutions [35]. The performance of basic and stacked autoencoders phases. Therefore, the classification performance and
is reported in Table 4. It is clearly demonstrated that the SVM computational complexity of the proposed malware classifier give
model can accurately discriminate the PCA-based features as the it an edge over existing malware classifiers.
data reduction is drastic where a compression ratio of 12,800/100 is
achieved. On the other hand, a deep stack of autoencoders is 5 Conclusions
required to achieve similar feature reduction rates without affecting In this study, we have presented a novel ML solution for malware
the classification performance. Such stacking comes usually at the classification. Thanks to the informed feature reduction and
cost of increased training and test times [35]. parameter tuning during the training process, our malware
classifier requires simple algebraic dot products to classify
4.6 Computational efficiency malware based on representative digital images. To assess its
The proposed and existing malware classifiers are trained off-line performance, publicly-available malware datasets, Malimg, Ember,
using the same training and test subsets. However, the and BIG2015, are considered. The reported performance analysis
computational complexity of the malware classifier during indicates that our classifier outperforms state-of-the-art models and
Fig. 14 Auto-extracted features using a basic single-layered autoencoder with 1000 hidden nodes
Table 4 Classification performance of proposed malware [13] Santos, I., Devesa, J., Brezo, F., et al.: ‘OPEM: A static-dynamic approach for
machine-learning-based malware detection’. Int. Joint Conf. CISIS'12–
classifier using Malimg dataset and different classification ICEUTE'12–SOCO'12 Special Sessions, Ostrava, Czech Republic, 5th–7th
models September 2012, pp. 271–280
Classifier Accuracy Precision Recall F1-Score [14] Siddiqui, M., Wang, M.C., Lee, J.: ‘Detecting internet worms using data
mining techniques’, J. Syst. Cybern. Inf., 2009, 6, (6), pp. 48–53
autoencoder with Softmax 0.888 0.887 0.889 0.8875 [15] Zolkipli, M., Jantan, A.: ‘An approach for malware behavior identification
autoencoder with SVM 0.933 0.931 0.929 0.929 and classification’. 2011 3rd Int. Conf. on Computer Research and
Development, Shanghai, China, 2011, vol. 1, pp. 191–194
PCA with Softmax 0.961 0.960 0.960 0.960
[16] Rieck, K., Trinius, P., Willems, C., et al.: ‘Automatic analysis of malware
PCA with SVM 0.998 0.997 0.998 0.9975 behavior using machine learning’, J. Comput. Secur., 2011, 19, (4), pp. 639–
668
[17] Anderson, B., Quist, D., Neil, J., et al.: ‘Graph-based malware detection using
dynamic analysis’, J. Comput. Virol., 2011, 7, (4), pp. 247–258
Table 5 Computational complexity during deployment [18] Bayer, U., Comparetti, P.M., Hlauschek, C., et al.: ‘Scalable, behavior-based
Classifier Computational complexity malware clustering’. Network and Distributed System Security Symp., San
Diego, California, USA, 2009, pp. 8–11
Nataraj et al. [21] O n ⋅ log2(n) + O 320 ⋅ n [19] Barhoom, T., Qeshta, H.: ‘Worm detection by combination of classification
GIST + SVM [20] O n ⋅ log2(n) + O n with neural networks’, Int. Arab J. e-Technol., 2013, 3, (2), pp. 57–65
[20] Kalash, M., Rochan, M., Mohammed, N., et al.: ‘Malware classification with
M-CNN [20] O n2 deep convolutional neural networks’. 2018 9th IFIP Int. Conf. on New
Technologies, Mobility and Security (NTMS), Paris, France, 2018, pp. 1–5
proposed (PCA with SVM) On [21] Nataraj, L., Karthikeyan, S., Jacob, G., et al.: ‘Malware images: visualization
and automatic classification’. Proc. 8th Int. Symp. on Visualization for Cyber
Security. VizSec'11, Pittsburgh, Pennsylvania, USA, 2011, pp. 4:1–4:7
[22] Competition, K.: ‘Microsoft malware classification challenge (big 2015)’,
7 References 2017. Available at https://www.kaggle.com/c/malwareclassification
[23] Oliva, A., Torralba, A.: ‘Modeling the shape of the scene: A holistic
[1] Accenture, Institute, P.: ‘Cost of cyber crime study: insights on the security
representation of the spatial envelope’, Int. J. Comput. Vision, 2001, 42, (3),
investments that make a difference’. Ponemon Institute LLC, 2017
pp. 145–175
[2] ‘TSMC: Outbreak of malware that triggered delays losses caused by software
[24] Park, Y., Reeves, D., Mulukutla, V., et al.: ‘Fast malware classification by
for new tool’. Available at https://www.anandtech.com/show/13193/tsmc-
automated behavioral graph matching’. Proc. Sixth Annual Workshop on
outbreak-of-malware-that-triggereddelays-losses-caused-by-software-for-
Cyber Security and Information Intelligence Research. CSIIRW'10, Oak
new-tool (accessed 15 March 2019)
Ridge, Tennessee, USA, 2010, pp. 45:1–45:4
[3] Idika, N., Mathur, A.: ‘A survey of malware detection techniques’ (Purdue
[25] Burnaev, E., Smolyakov, D.: ‘One-class SVM with privileged information and
University, West Lafayette, Indiana, USA, 2007)
its application to malware detection’. 2016 IEEE 16th Int. Conf. on Data
[4] Sikorski, M., Honig, A.: ‘Practical malware analysis: the hands-on guide to
Mining Workshops (ICDMW), Barcelona, Spain, 2016, pp. 273–280
dissecting malicious software’ (No Starch Press, San Francisco, California,
[26] Narayanan, B.N., Djaneye-Boundjou, O., Kebede, T.M.: ‘Performance
USA, 2012, 1st edn.)
analysis of machine learning and pattern recognition algorithms for malware
[5] Venkatraman, S., Alazab, M.: ‘Use of data visualisation for zero-day malware
classification’. 2016 IEEE National Aerospace and Electronics Conf.
detection’, Secur. Commun. Netw., 2018, 2018, pp. 1728303:1–1728303:13
(NAECON) and Ohio Innovation Summit (OIS), Dayton, Ohio, USA, 2016,
[6] Venkatraman, S., Alazab, M.: ‘Classification of malware using visualisation
pp. 338–342
of similarity matrices’. 2017 Cybersecurity and Cyberforensics Conf. (CCC),
[27] Sahay, S., Sharma, A.: ‘Grouping the executables to detect malware with high
London, UK, 2017, pp. 3–8
accuracy’, Procedia Comput. Sci., 2016, 78, pp. 667–674
[7] Sharma, S., Challa, R., Sahay, S.: ‘Detection of advanced malware by
[28] Zhao, B., Han, J., Meng, X.: ‘A malware detection system based on
machine learning techniques’, CoRR, 2019, pp. 333–342
intermediate language’. 2017 4th Int. Conf. on Systems and Informatics
[8] Ding, H., Sun, W., Chen, Y., et al.: ‘Malware detection and classification
(ICSAI), Hangzhou, Zhejiang, China, 2017, pp. 824–830
based on parallel sequence comparison’. 5th Int. Conf. on Systems and
[29] Lee, Y.S., Lee, J.U., Soh, W.Y.: ‘Trend of malware detection using deep
Informatics, ICSAI 2018, Nanjing, China, 10–12 November 2018, pp. 670–
learning’. Proc. 2nd Int. Conf. on Education and Multimedia Technology.
675
ICEMT 2018, Okinawa, Japan, 2018, pp. 102–106
[9] Schultz, M.G., Eskin, E., Zadok, E., et al.: ‘Data mining methods for
[30] Sewak, M., Sahay, S.K., Rathore, H.: ‘Comparison of deep learning and the
detection of new malicious executables’. Proc. 2001 IEEE Symp. on Security
classical machine learning algorithm for the malware detection’. 2018 19th
and Privacy. SP'01, Oakland, California, USA, 2001, pp. 38–49
IEEE/ACIS Int. Conf. on Software Engineering, Artificial Intelligence,
[10] Kolter, J.Z., Maloof, M.A.: ‘Learning to detect and classify malicious
Networking and Parallel/Distributed Computing (SNPD), Busan, South
executables in the wild’, J. Mach. Learn. Res., 2006, 7, pp. 2721–2744
Korea, 2018, vol. abs/1809.05889, pp. 293–296
[11] Kong, D., Yan, G.: ‘Discriminant malware distance learning on structural
[31] Cakir, B., Dogdu, E.: ‘Malware classification using deep learning methods’.
information for automated malware classification’. Proc. 19th ACM SIGKDD
Proc. ACMSE 2018 Conf. (ACMSE'18), Richmond, Kentucky, USA, 2018,
Int. Conf. on Knowledge Discovery and Data Mining. KDD ‘13, Chicago,
pp. 10:1–10:5
Illinois, USA, 2013, pp. 1357–1365
[32] Dey, A., Bhattacharya, S., Chaki, N. ‘Byte label malware classification using
[12] Tian, R., Islam, R., Batten, L., et al.: ‘Differentiating malware from cleanware
image entropy’, in Chaki, R., Cortesi, A., Saeed, K., Chaki, N. (Eds.):
using behavioural analysis’. 2010 5th Int. Conf. on Malicious and Unwanted
‘Advanced computing and systems for security’ (Springer, Singapore, 2019),
Software, Nancy, Lorraine, France, 2010, pp. 23–30
pp. 17–29