You are on page 1of 8

1

Static Malware Analysis Using Multiple


Machine Learning Algorithm for Malware Detection
Abhinav Parkash Sharma Mayank Ranjan Udai Agarwal
BTech III Year BTech III Year BTech III Year
Vellore Institute of Technology, Vellore Institute of Technology, Vellore Institute of Technology,
Chennai, India Chennai, India Chennai, India
abhinavparkash.2016@gmail.com mayank.ranjan2016@vitstudent.ac.in udai.agarwal2016@vitstudent.ac.in

Abstract— The aim of malware analysis is to detect applied to extract useful information from large datasets.
whether a file is infected or not in order to avoid any kind of This has made it possible to analyze the data extracted from
system intrusion. The goal of this research is to find the the attributes of malicious portable executables.
optimal machine learning algorithm to predict whether a file is
malicious or not by using different machine learning models on
a given dataset. For the above purpose the implementation and A. Detection Methods
accuracy comparisons are done with the help of python
libraries and then summary analysis will be then used to
suggest the best machine learning model for the detection of Static Analysis:
the malware infected files this can then be used as a layer in a It is done by analyzing the program in the form of
bigger neural network for dynamic malware analysis and software code of malware and gain knowledge how the
attack detection and prevention. malware works. Reverse engineering is used in the form of
decompile tool, disassemble tool for understanding the
Keywords— malware, machine learning, malware analysis, structure of the malware[7]. It includes the various
Decision Tree, Support Vector Machines, Random Forest, Linear
techniques:
Regression
1. String Extraction (error messages)
2. Fingerprinting (in the form of hash and detect
I. INTRODUCTION hardcoded username, files)
3. File Metadata (PE headers)
The use of Internet and its wide spread resources
has been on an exponential increasing trend in the past few Dynamic Analysis:
decades. This trend has led to multiple services being made When we execute a file, its behavior is being noted along
available to the user over secure connections, these services with the other information related to the file as well as its
include banking, purchases and even data exchange. This is properties and intentions of the creator for those executable
one of the main reason due to most of the hackers to export files. It is faster as compared to static analysis of the
different kind of malwares to naïve users. When the malware.
malware infects an particular computing device be it a
mobile device or a desktop each and every transactions in
B. Why Machine Learning?
which that particular device is involved becomes
compromised. Mostly malwares are spywares or viruses that
are intended for the basic purpose of stealing confidential In order to detect a polymorphic malware that
information or money in any form. These malwares then change its signatures, as well as new malware, for which
empower hackers to commit various cyber-crimes like fake signatures have not been created yet. Due to the inaccuracy
e-payments, denial of service attack, illegal hacking etc. of the heuristic based detectors while detecting the malware
Despite of so many antimalware measures there have been we need to switch ourselves to machine learning algorithms
so many practical attacks that have occurred in the past. Due combined with heuristic approach to offer high accuracy
to the ever-changing internet and new technologies coming rate. When relying on heuristics-based approach, there has
in which have increased the ease of network and computer to be a positive threshold for malware triggers, defining
penetration, the task of keeping the users safe has become the quantity of heuristics wished for the software to be
really difficult for companies that provide the world with called malicious. For example, we can outline a set of
anti-malware solutions. To avoid attacks and to provide the suspicious features, such as “registry key changed”,
users with security updates in the limited period of time is “connection established”, “permission changed”, etc.
another challenge that such companies face. Any software that shows at least five features from that set
Malware protection and removal once infected, is one of the can will be termed as malicious. Although this approach
main tasks of any anti-malware companies, as even a single provides some stage of effectiveness, it is no
attack could lead to loss of confidential data, money and the longer always accurate, considering that some features can
systems privacy. Recent advancements in computer science have greater “weight” than others, for example, “permission
and hardware have made it possible for machine learning changed” usually results in greater severe impact to
algorithms and models to effectively and efficiently be the device than “registry key changed”. In addition to that,
2

some function combinations [8] may be extra suspicious than files. The dataset contains more than 50 attributes describing
features via themselves. To take these correlations into file properties of different malicious files. Visualization’s
account and provide more correct detection, machine were then made on various attributes of this dataset and how
gaining knowledge of strategies can be used. it affects in order to know whether a file is malicious or not.
This research was conducted using a dataset with 4900 The machine learning algorithm that were used for static
records of portable executable attributes of malicious files, detection are Linear Regression, Decision Tree, Support
which is used to predict whether they the test files are Vector Machines and Random Forest. On the basis of the
malicious or not by using different machine learning detection and prediction summary analysis conclusions of
algorithms. The models will be trained using this dataset and which model is better over the other and which attribute is
the accuracy was the tested on another 100 records which the best for determining the malware were made.
were extracted from a mix of malicious and non-malicious
on technique and tools used in malware analysis. Most of
II. RELATED WORK the literature we came across during our research was either
focused on static analysis method or technique used for
Previously, static and dynamic analysis for analysis malware, there were only a few in which a
malware analysis was used by Distler [1. Meanwhile, Ari substantial comparison or conclusions were made upon the
[2] has also been doing malware analysis using reverse best possible machine learning algorithm for prediction of
engineering techniques and methodologies by bring in the malicious files. Whereas our work tests four different
use of biscuit apt1 as a malware sample. Another malware machine learning algorithms for malware analysis in depth
analysis research also doing by Flores [3] with comparison, using static analysis of portable header
win32.Kryptic. In the meantime, Daoud [4] has research attributes to get more detail information for characteristics
regarding technique used by malware to avoid detection of malware.
from antivirus. Research conducted by Uppal [5] more focus

Analysis Type Purpose Tools


STATIC Use as many antivirus detection engines as Antivirus
possible to assist classification.
Search the body of the malware for strings. Strings (Microsoft, 2008c)
DYNAMIC File integrity check to record baseline Winalysis (Winalysis.com, 2008)
configuration.
File monitoring. Find which tools are Filemon (Microsoft, 2008c)
opening,
reading and writing files.
Process monitoring. Determine resources that Process explorer (Microsoft, 2008c)
are

Network monitoring. Uncover which ports are Fort (Found stone, 2008), tcpview
open, (Microsoft, 2008c), nessus (Tenable
collect network traffic and find vulnerabilities. Network Security, 2008), nmap
(Insecure.org, 2008), wireshark (Combs,
2008), and snort (Sourcefire, 2008).
Registry monitoring. Monitor registry Regmon (Microsoft, 2008c)
activities as
they occur.
CODE Disassembly, debugging IDA Pro
OllyDbg (Yuschuk, 2008)

Table I Summary of malware analysis tools showing analysis type purpose and name of commonly used tool name.

Serial No. Title Author Contribution


1. Survey on malware Vinod, Gaur The detection method is related to signature-based detection,
detection methods reverse engineering of complex code to detect malicious behavior
of the malware. [15]

2 A Threat to Cyber Brand, Valli, Paper displayed a threat to digital flexibility as a theoretical model
Resilience: A Woodward of a malware rebirthing botnet [16]
malware (2011)
Rebirthing Botnet.
3

3 Lessons learned Brand, Valli, Examiner must comprehend the counter investigation procedure
from an Woodward that can be utilized and how to moderate them, the impediments of
Investigation into (2011) existing apparatuses and how to utilize a suitable examination
the system to reveal the purpose of malware. [17]
Analysis
Avoidance
Techniques of
Malicious
Software.

4 SPiKE: Vasudevan, Developed a new dynamic coarse grained binary


Engineering Yerraballi instrumentation framework codenamed SPiKE that aids in the
Malware Analysis construction of powerful malware analysis tools to combat
Tools using malware that are becoming increasingly hard to analyse. [13]
Unobtrusive
Binary-
Instrumentatio

5 The Malware Valli,


Analysis Brand Exhibited an establishment for a malware (MABOK) i.e, required to
Body of (2008 effectively forensically break down malware. [14]
Knowledge(MABO )
K)

6 Static analysis of Christodorescu,


executables to M Jha (2003) Presented an architecture for detecting malicious patterns in
detect executables that is resilient to common obfuscation transformations.
malicious patterns [11]

7 TT Analyze: A tool Bayer, Presented a tool TT analyzer for dynamically analyzing the
for Kruegel, behavior of windows executables [12]
analyzing malware. Kirda (2006)

Table II: Related work referred throughout the paper.


III. METHODOLOGY techniques range from Support Vector Machines (SVM) to
Neural Networks to Classification Trees. In order to produce
For the malware analysis we are using python more refined results as compared to the prior projects we
libraries along with a python script in order to extract will try to refine our dataset by declining the unnecessary
the data for the given dataset. This script extracts all the results from the output.
PE file values used in the dataset for classifying the
files as malicious or normal and appends it to a csv file Algorithms:
for analyzing. A. Logistic Regression

1. Gathering Malware Samples: Logistic Regression is used for categorical data.


Here in the given malware set class has binary input in the
The malware chooses for the analysis could be found at form of 1 and 0 which represents malicious and normal
https://github.com/urwithajit9/ClaMP/blob/master files. The logistic function also known as the sigmoid
/dataset/ClaMP_Raw-5184.arff. This dataset contains 54 function is an S-shaped curve that can take any real valued
attributes for various malicious and non-malicious files function between 0 and 1, but never exactly at those limits.
(4385). We have used python through the jupyter notebook
interface in order to analyze this dataset. 1 / (1 + e^-value)

2. Literature Survey: Where the base of the natural logarithms is e. [18]

There have been a lot of past works regarding the B. Decision Tree
malware analysis on the different binary malware datasets
using various machine learning algorithms. These
4

Decision Tree algorithm can also be used for directories, section table, and Import Address Table (IAT)
classification of dataset. The decision tree algorithm uses are the main contents of PE file.
tree representation in which the internal node represents the
attribute while the leaf nodes represents the class label. A finite set of quantitative (can be integer, real or binary
Here the attribute will be class and the leaf nodes will be value), categorical and labelled numeral’s can be derived
malicious and non-malicious depending on the other from the different types of features. An example of the
attributes. numerical feature is CPU (in %) or RAM (in Megabytes)
It also helps us to determine the most important attribute is usage, while nominal can be a file type (like ∗.dll or ∗.exe)
resulting in declaring a file as malicious or not. or Application Program Interface (API) function call (like
Pseudocode: write () or read ()). [6]
While distributing the dataset into small subsets the entropy
changes which results in Information Gain that is measured On Windows NT operating systems, PE currently supports
by change in entropy given by the below formulae: the IA-32, IA-64, x86-64 (AMD64/Intel
64), ARM and ARM64 instruction set architectures(ISAs).
Entropy (a, b) = - [instances of a*(log [instances of a]) Prior to Windows 2000, Windows NT (and thus PE)
+instances of b*(log [instances of b])] supported the MIPS, Alpha, and PowerPC ISAs. Because
PE is used on Windows CE, it continues to support several
This entropy is the measure of accuracy of the model on the variants of the MIPS, ARM (including Thumb),
given dataset. [18] and SuperH ISAs. [10]

C. Support Vector Machine IV. MODULES

Another discriminative classification algorithm The project implementation has six modules. They are
formally defined a separating hyperplane, the algorithm dataset collection, data pre-processing, Feature selection,
works by plotting the data in n-dimensional space. Model Selection, Classifier Model for predicting malicious
Hyperplanes are then used in order to distinguish between or normal file, Comparison of Accuracy on Logistic
different cluster classes. Out of all the hyperplanes in the regression, Decision Tree, Random Forest and Super Vector
given space the hyperplane e which is at the maximum Machine algorithms.
distance from both the clusters is choose. This is a
supervised learning model which outputs an optimal A. Dataset Collection
hyperplane. [18]
The collection of data is based on the different PE
header attributes of files present in the dataset as it is directly
D. Random Forest proportional to the probability of malware’s involvement in
those files to a great extent. The dataset can be collected
Random Forest works on the fundamental that it is from any online repository.
basically a combination of many decision tree resulting in
the formation of a forest. It gives a more stable and accurate B. Data Pre-Processing
results. Instead of searching for the most important feature
while splitting a node, it searches for the best feature among Pre-Processing can take a considerable amount of time
a subset of features. [18] as it needs removal of NULL values in different attributes
along with the unreliable data present in the dataset. Data
pre-processing includes cleaning, normalization, and
E. Static characteristics of PE files transformation of attributes according to the criteria for the
analysis of malware. In order to perform the feature selection
PE file format was introduced in Windows 3.1 as on a particular test data, it must be run through a python
PE32 and further evolved as PE32+ format for 64-bit script that is used for the purpose of PE header extraction.
Windows Operating Systems. Common Object File Format
(COFF) header, standard COFF fields such as header, data
5

C. Feature Selection classification of PE Files and comparing the models. Further


we also developed a script using PEV for extracting the
In machine learning and statistics, feature selection, header values for which the model was trained to further test
also known as variable selection, feature reduction, attribute the models. The dataset had null values for the e_res and
selection or variable subset selection, is the technique of e_res2 header value which needed to be removed. Dataset
selecting a subset of relevant features for building robust was left with 54 features to train, the script was altered
learning models. Feature selection is a particularly important accordingly to get the same 54 header values any
step in analyzing the data from many experimental executable. Dataset was split into train (80% of total data)
techniques. The Feature selection decreases the time to and test (20% of total data) dataset. There was a total of
classify and also classify in efficient manner. 4386 records in the train data with equivalent number of
malicious and non-malicious records. The test data had 800
D. Model Selection records of which had equal amount of malicious and non-
malicious file records.
The model will be selected based on the accuracy and time
consumption of various machine learning algorithm in order B. Feature Selection
to predict whether a given set of data is malicious or not. In
this module the dataset will be tested on various machine Feature selection is a significant part of analyzing
learning algorithms. The test dataset is used for analyzing the data, and achieving best results using minimum number of
efficiency of the models while the trained dataset is explicitly features. This also helps to prevent overfitting of data.
used for measuring the accuracy of algorithms. The Models Reducing the data being used can help the model to work
used in this module is basically used for classification faster by lowering the dimensions. For this we calculated the
techniques thus, helps in producing a binary output in the eigen values for the 54 features. The features with highest
form of 0 and 1 for absence and presence of malware eigen values were calculated, i.e. the features with eigen
respectively in the files for a given binary dataset. The values greater than 1. This came out to be 37. Further we
algorithms used are: Logistic Regression, Decision Tree, used PCA to reduce the number of components to 37 which
Random Forest, and Support Vector Machine. selects the features based on feature importance. The data
was then used to train all the models and compare the
E. Classifier Model
accuracy [9].
The data mining technique classification is done on the
dataset. Classification is done with purpose of identifying the C. Model Training
different set of attributes for a new observation on the basis
of a training dataset containing observation related to Classification models were chosen for this purpose. The
particular properties of the data involved in the dataset. models were trained for binary classificaiton of files using
These properties can be ordinal, continuous, categorical and the portable executable file’s header values. The models
discrete. Some algorithms requires continuous function compared for accuracy on the given dataset are logistic
presence while some other require discrete input for regression, decision tree classifier, SVM and random forest.
classification and analysis. PCA python library is used in The models were compared with each other on the accuracy
order to reduce the total number of attributes from 54 to 40 achieved with the test data. SVM gave the least results
by identifying that somehow the presence of 14 attributes followed by logistic regression and decision trees and
does not affect detection of malware in the given dataset. random forest had almost similar score for the data.
Thus increasing the efficiency of a model by reducing the Random forest was used for classification for its bagging
processing time and do the binary classification of the model. approach, it improves the results by splitting the data into
subsets to send them to the different nodes. The decision
F. Accuracy Comparison (Performance Evaluation) tree models are trained on those subsets of features. Results
from those trees are then averaged to give a prediction from
Classifier performance depends greatly on the all decision trees.
characteristics of the data to be classified. Various empirical
tests have been performed to compare classifier performance D. Classification of dataset
and to find the characteristics of data that determine classifier
performance. In this module the comparison is done on the Malware detection in portable executable files is done
accuracy of the models. In the project, the model’s using classification techniques. Classification is a
performance was increased by the use of PCA which datamining technique used to differentiate the objects in
eliminates the unnecessary attributes from the dataset thus various separable classes of objects already defined in the
producing the Random Forest Algorithm as the best model dataset. For the purpose of malware detection in PE files,
for measuring the accuracy. the dataset has two classes 0(non-malicious) and
1(Malicious). Binary classificaiton is done on the data to
find whether the header values of that files indicate it to be a
IV IMPLEMENTATION malware or not. The values like image base and size of stack
reserve from the file header are used for classification of the
A. Data Preprocessing file as malicious or clean.
We selected the ClaMP_Raw-5184 dataset for
6

E. Malware detection using ensemble learning Random Forest 1.00 0.978

The goal of using a bagging technique (ensemble model) Table IV gives the accuracy of various models on a
for classification is to combine the results of multiple predefined test dataset along with the actual dataset .This
models and get a combined result for better accuracy. gives the accuracy rate varying from 89% to 100% for the
Bagging technique uses bootstrapping for training multiple test dataset and 65% to 97.8% for the train dataset.
models and then from their collective result obtain the final
resulting model. Multiple subsets are created from the
original data with replacement. A base model is specified
and is then used for training each of these subsets
independently.
Random forest is one such model which uses bagging
technique for training a model for classification.
Bootstrapping is done to create the subsets from the data.
The base model used in the case of random forest is decision
trees which here, splits the records in two different
categories (clean and malicious) based on one feature,
which are which are further split into two based on the next
feature, selected in order of feature importance. The same is
continued till all the leaf nodes are classified (contain the
result malicious/non-malicious). These decision tree results
from the ones trained on different subsets are averaged to
finally produce the output. In this system all the malwares
are labelled as malicious while the rest are labelled as
normal files.

V. RESULTS AND COMPARISON

The classification of files as malicious or not using different


machine learning algorithms. The numbers of Correct and
Incorrect classification instances are counted and the
classification rate is measured for every model.

MACHINE CORRECTLY INCORRECTLY CLASSIFICATION


LEARNING CLASSIFIED CLASSIFIED RATE
MODEL
Logistic 3903 482 89%
Regression
Decision 4210 175 96%
Tree
Support 2850 1535 65%
Vector
Machine
Random 4289 96 97.8%
The best feature used to detect whether a file is malicious or
Forest not is IMAGEBASE as depicted by the above graph. The
graph also conveys that 14 attributes in the given dataset
Table III gives the classification rate for various files which does not affect the malicious nature of the files, thus their
varies from 65% to 97.8 %. The model which gives the best exclusion can help to increase the speed of the measurement
result is Random Forest Algorithm with classification rate of of accuracy of various models. Adding this feature
97.8%. extraction constraint helps to increase the accuracy of the
prediction for the given malware detection.
MACHINE TEST DATASET TRAIN
LEARNING ACCURACY DATASET The below graph shows the comparison of different
MODEL ACCURACY machine algorithm on the test and train dataset.
Logistic 0.89 0.89
Regression
Decision Tree 1.00 0.96
Support Vector M 1.00 0.65
7

The graph gives the following predictions:

1. Random Forest>Decision Tree>Logistic


Regression>SVM
2. RANDOM FOREST ALGORITHM is the best as
compared to other algorithms. (97.9974968711%
accuracy).
Despite of using PCA in both the approaches of random
forest and decision tree algorithm we are able to identify
that even with less attributes random forest is a better
approach for analyzing due to its more accuracy. So
according to the outputs obtained the derived best
algorithm can now be implemented as a part of a bigger
neural network which can increase the prediction
accuracy of the neural network substantially.

The Random Forest hence obtained from the


experimentation is given below:

VI CONCLUSION

The project is based on malware detection and analysis of


accuracy of different machine learning models. Malware
analysis is a multi-step process providing insight into
malware structure and functionality. Behavior monitoring,
an important step in the analysis process, is used to observe
malware relations with respect to the system and is achieved
by employing dynamic coarse-grained binary-
instrumentation on the target system. For this project we
used logistic regression, decision tree, random forest and
support vector machine approaches to analyze which model
gives best detection rate for malwares of the given binary
dataset. Out of all the models the best machine learning
algorithm is Random Forest approach. The future work of REFERENCES
the project is implementing the models for a multi class
dataset, and to use other approaches similar to PCA in order [1] Distler, “Malware Analysis: An Introduction”, D. 2007.
to reduce the time due to calculation overload on various
models.
8

[2] Ari N, “Penerapan Analisa Malware Pada Biscuit apt1


Menggunakan Teknik Reverse Engineering”, Journal of [11] Christodorescu M, Jha S., “Static analysis of
KNSI, February 2015. executables to detect malicious patterns”, Proceedings of
the 12th USENIX Security Symposium; 12–12; 2003.
[3] Flores, “Malware Reverse Engineering part1 of 2. Static
analysis”, Technical Report. [12] Bayer U, Kruegel C, Kirda E. TTAnalyze, “A tool for
analyzing malware”, Proceedings of EICAR; April 2006.
[4] Daoud, E. Al, Jebril, I. H., & Zaqaibeh, B. 2008.
Computer Virus Strategies and Detection Methods. Int. J. [13] Vasudevan and R. Yerraballi. Spike, “Engineering
Open Problems Compt. Math., Vol. 1, No. 2, September malware analysis tools using unobtrusive binary-
2008. instrumentation”, Proceedings of the 29th Australasian
Computer Science Conference, pages 311–320, 2006.
[5] Uppal, D., Mehra, V., & Verma, “Basic survey on
Malware Analysis, Tools and Techniques”, International [14] Valli C, Brand M., “Malware Analysis Body of
Journal on Computational Sciences & Applications (IJCSA) Knowledge”, Paper presented at the 6th Australian
Vol.4, No.1, February 2014. Digital Forensics Conference; Edith Cowan University;
Mount Lawley Campus; Western Australia; 2008.
[6] Andrii Shalaginov, Sergii Banin, Ali Dehghantanha,
Katrin Franke, “Machine Learning Aided Static [15] Vinod P, Laxmi V, Gaur M S., “Survey on
MalwareAnalysis: A Survey and Tutorial”, Malware Detection Methods”.
arXiv:1808.01201v1 [cs.CR] 3 Aug 2018.
[16] Brand M, Valli C, Woodward A, “A Threat to cyber
[7] M. Schultz, E. Eskin, F. Zadok, and S. Stolfo,“Data resilience: A malware rebirthing Botnet”, The Proceedings
Mining Methods for Detection of New Malicious of the 2nd International Cyber Resilience Conference;
Executables”, Proceedings of 2001 IEEE Symposium on 2011.
Security and Privacy.
[17] Brand, M., “Analysis avoidance techniques of
[8] C. Bishop, “Pattern Recognition and Machine Learning”, malicious software”, [Online]. Available:
New York: Springer, 2006. https://ro.ecu.edu.au/theses/138. [Accessed: 10- Oct- 2018]
[9] T. Subbulakshmi and A. F. Afroze, "Multiple learning
based classifiers using layered approach and Feature [18] Business Analytics, ' Essentials of Machine Learning
Selection for attack detection," 2013 IEEE International Algorithms (with Python and R Codes) ‘,. [Online].
Conference ON Emerging Trends in Computing, Available: https://www.analyticsvidhya.com/blog/2017/09/
Communication and Nanotechnology (ICECCN), common-machine-learning-algorithms/. [Accessed: 10- Oct-
Tirunelveli, 2013, pp. 308-314. doi: 10.1109/ICE-
CCN.2013.6528514. 2018]

[10] Microsoft Documentations.