Professional Documents
Culture Documents
In Computer Science
By
Naveen Donepudi
December 2022
i
The thesis of Naveen Donepudi is approved
X X
Dr. Robert McIlhenny Date
X X
Dr. Jeff Wiegley Date
X X
Dr. Wonjun Lee , Chair Date
Acknowledgement
I would like to thank to my lovely parents and my hard-working mentors who supported me a
lot to perform and complete my thesis with perfection. Moreover, I also pay special thanks to
the committee chair persons that include a penal of doctors and professors of California State
ii
University Northridge’s College of Engineering and computer science. I was able to complete
my thesis only due to my hard work and diligent support and services from my teachers. The
all supported me a lot and give me confidence to meet my goals over the course and also
complete my thesis at high. Overall, my parents, and teachers had played a crucial role for
Although my mentors and friends were extremely loyal towards my goal. Therefore, they
have make sure that all points in the thesis are completed by me without any difficulty. These
members deserve my high success rank. It is only due to their sincere efforts and struggle so
In the end, I would like to thanks to my parents, who were extremely sincere and loyal with
times. I was able to complete my thesis due to their efforts too. My parents are my real
support my
pillars.
Table of Contents
Acknowledgement ...................................................................................................................
iii
Abstract ..................................................................................................................................
viii
iii
1.1. Introduction ................................................................................................................. 1
iv
2.3.6 Comparative Analysis of Research Paper ...................................................................
12
3.8. Algorithms…….
…………………………………………………………………….20
v
4.2. Eliminating Noisy Features ...........................................................................................
29
4.3. Techniques for Cleaning the Data Used in Machine Learning ………………………31
References ................................................................................................................................
44
List of figures
Abstract
It is highly important to detect a file if there is any malware is present or not. Due to increase
in malware, a lot of problems are created and companies are losing their important data and
facing various problems. The next point is that malware can easily create a lot of damage to
the system by slowing it down and also crypt many files in the personal computer. In this
report, there is comprehensive information about the versatile framework about machine
vi
learning algorithms. By using these algorithms, it is possible to detect Malware. The reason is
that by using these algorithms, it is easy to distinguish between clean and malware files.
According to this, it is aiming to minimize the number of false positive present in the data.
Therefore, the required report will present various ideas behind the framework by working on
one-sided perceptron. For this paper I have applied Convolutional Neural Network (CNN)
and Recurrent Neural Network (RNN). Moreover, these algorithms are tested on medium size
datasets regarding clean and malware files that are present behind this framework. All of
these files are submitted to a scaling-up process that can easily detect large datasets with
clean and malware files. Furthermore, there is also little bit information about the previous
researches done by the researcher over the last decade. It can be observed due to malware,
there is high threat to various files and it is causing a huge problem for the organizations.
Another two important methods are used for malware detection that include decision tree and
Recurrent Neural Network (RNN), medium sized dataset, decision tree, random forest.
vii
Chapter 1 Introduction
1.1.Introduction
In today's networked world, every system is at risk of being attacked by malicious cyber
actors. Attackers now have access to automated technologies that are more powerful, and
new dangers surface virtually every second. Because of this, maintaining adequate
In today's contemporary digital world, one of the most important challenges is malicious
software, and unfortunately, the situation is only becoming worse. Malware is software
that is intentionally designed to cause damage to a computer system or network for the
purpose of espionage or financial gain. Malware has lately begun to target embedded
computational platforms like those found in IoT devices, medical equipment, and E&I
today is characterized by its complexity, and many strains have the ability to change not
only their code but also their behavior in order to avoid detection. Because there is such a
protections that are based on signatures; rather, a more complete toolbox is necessary.
Malware families have common traits that may be seen in their numerous manifestations,
either statically or in real time. Malware families may be isolated using these motifs.
Static analysis refers to methods that examine harmful files' contents without running
them, whereas dynamic analysis takes into consideration the behavioral features of
malicious files while doing activities like information flow tracing, function call tracking,
and dynamic binary fingerprinting. These static and behavioral artifacts may be employed
contemporary malware. This allows for the identification of more complex malware
attacks, which would otherwise defy signature-based detection methods. When it comes
1
to identifying newly-published malware, often known as zero-day malware, strategies that
are based on machine learning are better than those that are based on signatures. In
addition, deep learning algorithms that are able to do feature engineering in an implicit
manner may further improve the process of feature extraction and representation.
The state of cyber security in today's world calls for an ongoing procedure that involves
collecting and comparing millions of data points across all of the people and
infrastructure that make up your firm. It is quite clear that relying just on humans will not
be sufficient in this context; you will also want the assistance of machine learning in order
to see patterns and anticipate potential security concerns in large data sets. In this article, I
will discuss the significance of machine learning (ML) and provide an example of one
1.2.Problem statement
Due to increase in internet demand, the risk of cyberattack is increased. Therefore, many
organizations and people are facing problem regarding malware files in the system. Due to
these malware files it is extremely difficult to save or protect important data from the personal
computer systems. For this purpose, there are a lot of techniques can be used to detect
malware present in the computer system. According to this, some traditional techniques for
detecting malware are unable to perform proper actions to detect it. Due to this, some
malware that are present in various forms can be entered in the system undetectable. It means
that traditional malware detectors are failed to detect any malware present in the system.
Under these problems, there is a need of such malware detector that can easily detect any
malware present in the system. It is possible to detect malware by using machine learning
algorithms. For the future, machine learning algorithms will be helpful for detecting Malware
present in the systems. There are various kind of machine learning algorithms are used.
2
However, for conducting this research, only few and important machine learning algorithms
will be used. From this, one of the important machine learning algorithms are convolutional
neural network and recurrent neural network. Moreover, another to important machine
learning algorithms include are decision tree and random forest. All of these algorithms will
be tested on the required data to test the accuracy level of certain algorithm for detecting any
malware.
Aim
The main aim of this research is to detect malware by using various machine learning
algorithms
Objectives:
The objective of this research is to analyse different malware detection machine learning
methods in detail.
• To explain and analyse the use of Convolutional Neural Network and Recurrent
Neural
3
1.4.Background
In the past the Malware is detected by using standard signature technique-based methods. Due
to this, it is difficult to detect all kind of Malware. The reason is that the malware application
contains multiple polymorphic layers present in it. Due to these layers, it is difficult to detect
malware and also use side mechanism to automatically update themselves towards a new
version in less time. Through this, such Malware were extremely difficult to detect by any
antivirus software. Therefore, classic methods used for detecting malware are not good
Neural Network?
1.6.Research Framework
Every company or organization is linked with the internet. It means that the use of internet is
increased with its accessibility with smartphone and personal computers. Another point is that
the internet is also using convenient application in different industries like marketing,
transaction, automation, and development. Due to this, the rate of crime is increased because
of low internet security because different miscellaneous software applications are used by the
companies. Therefore, different cyber-attacks are increased on the systems. To overcome this
4
problem, malware detection software applications are playing an important role. The reason is
that it will help the system to refrain from any miscellaneous sites so cyber-attacks can be
avoided. These software applications will help the PC to prevent such information sharing
that contains any virus. It means before getting information about the working or process of
1.7.Research Significance
According to this, various perspectives that are linked with the behaviour parameters must be
considered through detection method of selected malware. For this purpose, there is a need to
important to identify the required file that is attached with the source destination. From this, a
researcher D’Angelo Palmieri (2019) had shown important results regarding malware
systematic consideration for collecting information about virus attack. Due to this, it is simple
to determine with various techniques. On the other hand, the data consideration done through
statistical observation has created a huge structure with normal profiles for selecting
conditioner processes in computer systems. The required description of the selected data is
induced through the process of security determination and it is used for identifying profile
1.8.Summary
Malware detection is extremely important for the organization. Through this, it is possible for
them to identify the virus present in the data. From this, there are a lot of techniques of
malware detection are present. This report will detect malware by using various machine
5
learning algorithms. It will detect malware by using convolutional neural network, recurrent
2.1. Introduction
On the internet, the malware, as well as the malicious code, is a most important
security problem, whereas millions of websites also launched drive through the download of
exploits against vulnerable hosts ( Kolbitsch et al, 2009). The part of the exploits against
vulnerable host which is the victim of the machine is typically used for downloading execute
of malware programs. Often the program is bots that joined forces as well as also turn into the
botnet. But this type of botnet is used through the miscreants who launch DOS (denial of
services) and also send spam emails and the host scam pages.
mitigation and detection abilities which permit one to take out malicious code running on
system-based model of detection. Such models are regularly physically made crafted
signatures into the interruption system of detection or bot identifiers. Other models are
traffic. Such as a few frameworks attempt to distinguish bots via searching for comparative
association designs. Though network-based identifiers are helpful practically speaking, they
experience an ill effect of various constraints. Initial, a malware program has numerous
6
that such finders can't watch a movement of a malicious program straightforwardly yet need
The malware detectors of host-based which have also advantaged and they could also
observe a complete set of different actions where malware programs performed. Therefore, is
possible to identify the code of malicious before their executions and it is, unfortunately,
current host-based approaches of detection and its very important shortcomings. The
significant problem which has different techniques and it relies on ineffective models,
whereas ineffective models do not capture properties of intrinsic for the malicious program
and some actions are merely picked artifacts for a specific instance of malware. Also, the
investigation is costly, as well as in this manner, not reasonable for displacing AV scanners
elective way to deal with the model malware manner. Specifically, a few frameworks depend
nonexclusive design. Although location results are promising, these frameworks bring
situations (such as, at antivirus organizations or by security analysts), yet they are too
wasteful to even consider being conveyed as identifiers on end-host (Li & Stolfo et al, 2007).
According to the author Dan Lo et al (2016), it is conducted that; about the effective
and efficient detection of malware, there are different kinds of cyber-attacks like botnets,
DOS, malware and the search positioning. The cyber-attack increased growth exponentially
7
as well as it is also compromised by stolen information as well as damage critical structures
that are also produced the important losses for the average of $345,000 as per incident. The
growth of the internet is the only reason for malware where the proliferation is increased and
new malicious development programs become easier. There are 317 million new pieces of
malware which are also created in last year’s and it means that there are only one million new
a threat is released on every day. The malware is the trends which increase and its remaining
has the greatest securities which treat and faced through the different user of the computer.
Therefore, it has the necessity for malware detection which is automatic as well as it also has
a classification which is permits and created tools like the Norman Sandbox, CWS and box
(Dan Lo et al , 2016),.
According to the Author Shin et al (2012), it is conduced that, there is effective and
efficient bot malware detection. For the detection of bots, there is a lot of malware detection
approaches that are proposed for the host network and both of these approaches have clear
disadvantages and advantages. The host network which is cooperated for the detection
framework and it is also attempting to overcome the shortcoming for both approaches thus it
is also keeping both of advantages and effective and efficient which is based on intrinsic
characteristics of bots. Because of their attributes, the Author locates a few helpful highlights
at the host and system level. For productivity, the Author accomplishes insubstantial
associations.
Subsequently, the Author can channel a larger part of favourable projects, as well as a
spotlight on not many suspicious programmed programs reaching DNS servers plus start
According to the author Samantray & Tripathy et al (2018), it is conduced that the
malware might be different in the method of propagation as well as infection but it could also
be produced by similar symptoms. The Malware is infected through computers which might
The types of the malware along with the malware creators, concealment, symptoms, is
and it differs depending on the kind of malware. The regular side effects or issues of a
well as relatively more slow PC, issues in building up an availability with a system, smashing
adjustments of a document, arbitrary event of the strange records, programs and so on.,
9
running and reconfiguring of the projects without anyone else, unapproved killing of the
In the realm of cyber security, the analysis of harmful software is among the top
priorities. Malware analysts investigate and analyze dangerous software in order to better
protect against it in the future. This includes identifying the malware's characteristics and how
it operates, as well as assessing the virus's effect. "Practical malware analysis" outlines the
two primary methods for dealing with malware: static analysis and dynamic analysis.
determine a program's function without actually running the program itself. It is important to
accomplishing. This is accomplished by importing the executable code of the program into a
disassembler. You need to be proficient in assembly code and have a solid understanding of
Either the virus is observed in real time or the system is evaluated after the execution
has been finished in order to do a dynamic analysis of the malware. The more advanced
version of this malicious software makes use of a debugger in order to explore its present
state. Any device, whether digital or physical, that enables one program to monitor and
bytes at explicit areas inside an executable, a standard articulation, a hash estimation of paired
information, or some other configuration made through the Malware expert who can precisely
10
distinguish Malware cases. Mark based systems are generally used to supplement before
portrayed word reference-based methods. For obscure Malware or Malware that often change
their practices, word reference-based methods will come up short or make a ton of bogus
positive as well as negative identification results. This procedure can supplement word
reference based-strategies as marks are then separated much of a time as well as been added
instruments to obtain ongoing marks for most recent Malware examples. Malware discovery
is dependent on conduct systems chips away attending to conduct of vindictive code. These
which they are implanted, a port wherein they get to a framework or factual abnormalities in
might be needed for additional time in correlation by either static or word reference-based
the IT staff tries to re-image the computer to better understand the malware and the ways that
can be used for its prevention. Although cost-effective this may not be the best possible
solution. Some malicious software directly attaches to system BIOS and can remain on the
device even if it is re-imaged. A software company by the name of the “Hacking Team”
developed a tool which attaches itself to a computer’s UEFI and it reinstalls itself even if the
hard is wiped clean. Due to this, malware can be present even though the end-user or IT staff
is unaware of it. Some types of viruses can change the firmware when being installed hence
not being able to be detected by any anti-virus software. This methodology may be
costeffective in the short term does not work well in the long term as it can impose fines on
organizations for not being HIPAA compliant. In case of malware detection and prevention in
11
an organization, proper documentation is necessary so that this does not happen to anyone
else again and if this happens, they know what measures to take. One of the ways to
overcome malware is that software companies should provide bounties for finding out
software flaws.
The flaw is that software development takes more time than finding bugs (Kuntz et al , 2017).
12
6. 2009 N-Grams based file signatures for KNN algorithm Good detection ratio is
malware detection only achieved for higher
values of N.
7. 2008 Learning and Classification of Support Vector Machine Relies on single program
Malware Behavior execution of a malware
binary.
8. 2006 Learning to Detect Naive bayes, Support The relative performance
and vector machine of methods used in this
Classify paper was not as good as
Malicious the previous one.
Executable in the
Wild
9. 2008 Metamorphic Virus: Analysis and Random decryption Some viruses cannot be
Detection algorithm (RDA) detected even in an
emulated environment
10. 2006 Machine Learning for Computer Adaptive statistical The adversary can defeat
Security compression algorithms the computer that learns
how to extract signatures
for detecting computer
worms.
11. 2019 Malware Detection using Machine Random forest and Does not apply any
Learning and KNN recurrent neural
Deep Learning Algorithm networks for
malware detection.
12. 2017 Malware usin Machi Learnin SVM, Decision Tree, Use of signature-based
detectio g ne g Naive Bayes and Multi- method which is
n Algorithms Naive Bayes Algorithm traditional.
13. 2017 Malware Detection and Evasion with Heuristic, Artificial Use of traditional
Machine Learning Techniques: A Intelligence, Behavior, methods for malware
Survey Signature Based Methods detection.
14. 2012 Malware Detection Module using Decision Tree, Random Some methods of
Machine Learning Algorithms to Forest, Naive Bayes machine learning are not
Assist in Centralized Security in appropriate due to heavy
Enterprise Networks processors.
13
Chapter 3: Methodology
3.1. Introduction
For detecting malware, machine learning method with one-sided perceptron will be used. It
will be applied on the required dataset for detecting malware from different files present in
the systems.
The required methodology will use various machine learning algorithms to solve out the
problem of malware in the computer systems in the form of files. These algorithms will be
applied by designing a database according to the dataset. After this, analysis and design is
This was not always the case, despite the fact that malware and virus are today almost
generally thought to be identical with one another. The word "malware" refers to a broad
range of dangerous programs, albeit it does not include viruses in their entirety. Computer
viruses are the most common form of malware. Computer viruses are harmful programs that
are designed to secretly enter a computer system or network of computers and then destroy it
once they have gained access. It was previously believed that viruses were more well-known
forms of malicious software than more recent threats such as Trojan horses, keyloggers, and
worms. In contrast to viruses, malicious software is not designed to replicate itself but rather
to accomplish one or more predetermined goals. Malware is now used as a catch-all term for
14
all forms of malicious software, such as malvertising (which is advertising that is
repercussions that malware may have, several measures have been developed in order to curb
its propagation. Each strategy has both advantages and cons depending on how it is
implemented. The most successful strategies are hybrid ones, which combine elements of
many different techniques. However, this is not always a realistic expectation. In this section,
we will present a brief summary of the various anti-malware approaches currently in use and
explain why we believe machine learning to be the most effective course of action.
The program, which protects PCs running Microsoft operating systems by default,
also goes by the name Windows Defender. The AMSE scans every executable file on a
Antimalware services rely on AMSE files to carry out their functions. AMSE files
may either host malware so it can be investigated in action or block it from infecting the host
system. Antimalware software will often kick off the AMSE procedure upon system start up.
(IT) infrastructure and personal computers (PCs) from malicious software. In order to protect
users against malicious software, detect it, and remove it, antimalware software does a
15
In the past, signature-based approaches were considered to be the gold standard for
determining whether or not software was harmful. It does this by using a database that is
specifically designed to hold the one-of-a-kind IDs of known harmful applications. The
findings of a signature analysis performed by a malware analyst are added to this database
after each time the study is performed. This approach has a lot of limitations, one of which is
that contemporary malware may experience polymorphism, which leads it to vary its
signature. This is one of the downsides that this method possesses. The most significant
limitation is time, which is an essential factor in the sphere of malware research since each
hour following the propagation of a virus may suggest hundreds of machines that have been
compromised. Methods that rely on signatures may often detect user-initiated attacks, but
Utilizing methodologies that are behavior-based allows for the possibility of malware
being discovered in an interactive fashion. The identification of actions that are not typical
requires the use of algorithms that are based on rules. If the file performs a predetermined set
of activities, such as looking for a sandbox or virtual machine to run in or disabling security
protections, then it is considered a malicious file. The examination of the file takes much
more time and resources than signature-based approaches, which may have an effect on the
performance of the computer used by the end user. Additionally, this method is helpful for
spotting improved variants of existing malware that adhere to comparable processes or are
members of the same family. However, it is ineffective against genuine zero-day malware that
families.
execution of all programs is prevented, with the exception of those that have been expressly
allowed by the system administrator. In spite of the fact that, from a privacy aspect, it seems
16
to be safe, its availability is subject to considerable constraints. This strategy is good for
organizations that limit employee access to certain software; however, it is not suited for end
consumers, who often utilize software that is safe but is not "trustworthy" enough. When a
reliable piece of software is discovered to have a security hole, one further cause for worry is
The study of using algorithms for data analysis, pattern detection, and the use of these
patterns for subsequent data sample prediction and decision making is what is meant by the
term "machine learning." Because they are reliant on rules and the advice of experts,
behaviorbased solutions may not be beneficial in situations in which the malware in issue
makes use of a method that was not expected or comes from a family that is unrelated.
Machine learning has the advantage of being able to detect zero-day malware. This is
accomplished by the analysis of a large number of benign and malicious files, after which the
algorithm is given the opportunity to learn the pattern that differentiates the two types of files.
In this scenario, the involvement of professionals is restricted to the selection of the most
important components; after this, the machine will replicate the specialists' work. If we take a
look at PE files, we can essentially categorize ML approaches into the following three groups:
without running the executable file. According to [4], there are some people who believe that
static features aren't all that useful. However, we believe that these features are ideal for our
industry because our primary objective is to identify malicious code, and fundamental static
analysis can determine whether or not a file poses a threat. The extraction of a great number
of useful properties may be accomplished with relative ease and low cost. The first
17
component is the PE header, which has its whole outlined in the aforementioned paper [8].
The data folders include references to a variety of different types of tables, including exported
and imported tables, resource tables, exceptions tables, debugging statistics, certificate tables,
and relocation tables. Data directories highlight a variety of information in addition to the
machine type that is immediately apparent. This information includes the section count,
symbol count, code size, initialized data size, entry point position, and more. Entropy was
used in one study to measure how unpredictable the code was in order to draw a conclusion.
property such as software or to make malware more difficult to interpret." In [10], they made
use of suggestive strings like as URLs, IP addresses, names of special file systems, and names
of items in the Windows registry. This wide variety of easily available attributes offers
approaches.
[11] Has conducted trials with eight distinct machine learning algorithms by means of making
API windows calls, with the SVM (normalized poly kernel) approach proving to be the most
it requires more work and time in comparison to static models and calls for the use of a virtual
environment in order to carry out the executable file in an accurate manner. Several research
[12] have made use of a broad array of dynamic aspects, such as API calls, system calls,
instruction traces, registry changes, and writes to memory, amongst many others. This
research [12] was conducted with the intention of assessing the value of static feature models,
dynamic feature models, and hybrid models. Opcode sequences and API window invocations
are the two primary components that have been used for the purpose of training Hidden
18
Markov Models. [Case insensitive] Either one may be determined statically by doing an
analysis of the code as a whole while a program trace is being performed, or dynamically by
amassing data on the actual execution path that is being taken. IDA Pro, a combination
disassembler and debugger, was used to create the assembly file, which contains the opcode
sequences and API calls that can be retrieved from the file. After the operands, labels,
directives, and other components of opcode sequences have been eliminated, the only thing
that is left is the mnemonic opcode that reads "call, push, call, add, etc." When it comes to
API calls, they have compiled a list of API call names and removed the parameters.
Neural Networks are a kind of ML Algorithm that see widespread use, with
applications ranging from Arabic Handwriting Recognition [15] to the prevention of Wireless
Ad-Hoc Networks [14]. "Featureless model leveraging end-to-end deep learning neural
network" is the title of one of the most current research, which was released on October 25,
2017, making it one of the most recent studies. Because it relies on raw byte sequences rather
than static or dynamic features, the approach is referred to as "Malware Detection by Eating a
Whole EXE." This moniker was given to it because it detects malicious software by
consuming the whole executable. They employed a massive amount of data, ranging from a
minimum of around 0.5M up to a high of 2,011,786 binary samples, of which more than half
are examples of malicious software. The emphasis that the model places on bytes as separate
units within a stream has led to the creation of the first network architecture that is able to
process raw byte sequences that include more than two million steps. The major purpose of
this investigation is to investigate the extent to which the problem may be solved without
using any specific knowledge of the area. The experiment was accurate 94% of the time and
had a 98.1% AUC. Because of the significant dependence on memory that is required for its
creation, this strategy's computational limitations are one of its most significant drawbacks.
19
Model training on those 2 million observations takes around two months to complete when
3.8. Algorithms:
RNN Algorithm:
The RNN algorithm is used to detect malware because it is able to identify patterns in
data that may be indicative of malware. This can be done by training the algorithm on a
dataset of known malware samples and then using it to analyze new data. If the algorithm
detects patterns that are similar to those in the training data, it can be assumed that the new
By analyzing a large dataset of known malware, the algorithm can learn to recognize
the characteristics that are often associated with malicious software. This allows it to flag
new, unknown programs that exhibit similar behavior, which may be indicative of malware.
different behaviors. Some common behaviors that RNNs may be configured to detect include
unusual system activity, unexpected network traffic, and unusual file access patterns.
RNNs can detect these behaviours by looking for patterns in the data that indicate unusual
activity. For example, if there is a sudden increase in network traffic, this could be indicative
of an attack. Alternatively, if there are unusual file access patterns, this could be indicative of
One way to incorporate these three types of behaviors into an RNN model is to create
separate RNNs for each behavior. Each RNN would be trained on data that represents that
20
behavior. For example, the RNN for detecting unusual system activity could be trained on
data that includes system logs and activity data. The RNN for detecting unusual network
traffic could be trained on data that includes network traffic data. And the RNN for detecting
unusual file access patterns could be trained on data that includes file access data.
CNN Algorithm:
CNNs are becoming increasingly popular for malware detection as they are capable of
constantly changing and evolving and manual feature engineering may not be able to keep up.
Furthermore, CNNs are capable of processing images and other representations of malware,
which may provide more information than traditional methods of malware analysis.
Detecting malware using CNN algorithms can be done in several ways. First, the CNN
algorithm can be used to analyze the behavior of a system, such as system calls, network
traffic, and other system events, to detect anomalies that could indicate the presence of
malware. Second, the CNN algorithm can be used to examine the content of files and
applications to detect suspicious patterns that could indicate the presence of malware. Third,
the CNN algorithm can be used to analyze the patterns of malicious activity over time to
detect new, previously unknown threats. Finally, the CNN algorithm can be used to compare
previously known malware signatures to newly detected malware samples to detect the
Decision Tree:
Decision trees can be used to detect malware files. The goal of a decision tree is to identify
the presence of malicious code in a file. The process starts with a set of labeled data. This data
is used to create a decision tree that can be used to classify new data points. The decision tree
can be used to make predictions about the presence of malicious code in a file. The decision
21
tree can also be used to identify areas of the code that are more likely to contain malicious
code.
Random Forest:
Random Forest is a powerful machine learning algorithm that can be used for malware
detection. The algorithm works by training a set of decision trees on a set of labeled malware
samples. Each tree will learn to detect different characteristics of a malware sample, such as
code size, system calls, and code complexity. Once the trees have been trained, they can be
used to detect new samples of malware. This can be done by feeding new samples into the
Random Forest and observing how the trees react. If the reaction is consistent with a
The use of models that are trained by machine learning enables specialists in the field
of cybersecurity to rapidly detect threats and categorize them for further investigation.
Machine learning is a technique that evaluates clusters of requests or traffic that have similar
characteristics. This technique may be used to find anomalies in the network. Machine
learning algorithms may be able to detect malware in transit if they continually analyze data.
It does this by anticipatorily blocking potentially harmful action after detecting it as such.
The appropriate machine learning model has the ability to recognize previously
unknown forms of malware that are attempting to run on endpoints. It is able to detect
22
unknown dangerous files and events by comparing them to the features and behaviors of
known malware. This allows it to recognize unknown malicious files and events. Methods of
machine learning include dimensionality reduction, also known as the process of decreasing a
high number of dimensions to a smaller number; clustering, also known as the classification
of data into groups that have similar characteristics; and statistical sampling. They may also
help in the establishment of statistical baselines, which may shed light on what kinds of
behavior are considered "normal" and what kinds are considered "abnormal." [Creation of
statistical baselines] It is possible for us to identify data anomalies in this manner and get
The articles that are included in this Special Issue will be of the highest quality and
will focus on the most current advancements in the domains of original research as well as
reviews.
Malware is a kind of malicious software that, without the knowledge or consent of the
system's user, compromises the system's security, integrity, or functionality in order to fulfill
an objective that is harmful to the attacker. Malware may take many forms and some common
examples include viruses, worms, trojan horses, rootkits, backdoors, botnets, spyware, and
software in order to recognize previously encountered dangers and inhibit the execution of
such threats.
The anti-virus software makes use of a database that is based on signatures in order to
detect malicious software. Every piece of known malware is assigned a signature that is
entirely unique to it in this section. Before a particular piece of malicious software can be
identified by the program, it is necessary to create a signature for that virus. Each piece of
malware has its own unique signature, which consists of a very short string of bytes. This
23
signature may be used to identify the piece of malware. After the files have been analyzed by
the anti-virus software, a signature is generated, and the system checks to see whether that
signature is already present in the database. It is determined to be harmful when the signature
of a virus is discovered inside a file. This approach correctly detects previously discovered
malware; but, it will be unable to detect any new or previously unknown malware since its
signature will not be stored in the database. In addition, even when using known malware,
attackers may use a variety of strategies to prevent detection by security measures such as
firewalls, gateways, and antivirus software. These strategies may include obfuscation,
polymorphism, and encryption, among others. The insertion of dead code, the remapping of
code, and combining many obfuscation tactics into a single one are all examples of common
obfuscation strategies [1]. These safeguards might be readily circumvented if the malicious
actors who created them just created variants of the original virus. According to [17], there
are hundreds of new dangerous samples released into the market every every day, which
malware. If you want to discover worms, using an approach that is dependent on signatures is
probably not going to compromise your security against zero-day assaults [22].
Therefore, in order to solve this issue, both static and dynamic analytic approaches are
applied. The latter of these may detect a novel angle on an existing threat, which the former
did not. The classification of unknown malware into preexisting families is accomplished via
the use of analysis characteristics. In dynamic analysis, malicious code is executed whereas it
Finding patterns in data may be accomplished via the use of static analysis, and the
types of data that can be retrieved include strings, n-grams, byte sequences, opcodes, and call
graphs. Through the use of disassembler tools, it is possible to generate assembly instructions
24
for a Windows executable. Memory dumper programs are used to extract protected code and
evaluate packaged executables, both of which may be difficult to deconstruct without the
assistance of specialized expertise or tools. Protected code is often kept in the memory of the
system. When doing an analysis on an executable, it is possible that you will first need to
unpack and decrypt it. Although it might be made more difficult by techniques like as
considered to be superior than reverse compilation. [3], [16], [18], [19]. When the source
code is converted to the binary format, important specifics such as the sizes of data structures
and the names of variables are lost [7]. In order to circumvent these limits, dynamic analysis,
Through a technique known as dynamic analysis, the possibly harmful code is run in
an environment that is under strict supervision. Execution analysis is carried out with the use
of many instruments, including Process Monitor, Process Explorer, Wireshark, and Regshot
[1]. Tracking occurs for a variety of actions, including function calls, parameters, data
transfers, traced instructions, and more [7]. Not only is it possible to record data such as
network traffic, execution time, and file system changes, but the process also yields a great
deal of information that may be used to derive a variety of helpful metrics [3]. Performing a
dynamic analysis does not involve the executable being decompiled. This procedure, on the
other hand, is timeconsuming and expensive. Because some viruses could behave differently
in a simulation compared to how they do in the real world, it makes it more difficult to hunt
them down [1]. In addition, the operations of the virus might be triggered by a variety of
additional elements, one of which being the current system date. Malware that is aware of the
execution settings and computing environment may thus easily defy dynamic analysis tools.
[Citation needed] When the files are executed, the activities that they do are stored in a
feature vector-space that later serves to dynamically depict the files' behaviors.
25
Despite the fact that either static or dynamic methods may be used for analysis,
multiple machine-learning algorithms have been developed in order to automate the analysis
and classification stages of malware. As a result, the number of samples that need to be
manually examined by specialists has been reduced. As a result of this, we were only able to
gather a limited number of samples. Algorithms of machine learning like as clustering and
classification are used to the problem of grouping unknown malware into families. These
methods employ the findings of static and/or dynamic analysis as their data source. The
classification of malware that has not been classified before relies heavily on this method.
The detection of malware is a classification problem, and the goal of this problem is a
file. A number of different machine-learning techniques, such as Naive Bayes, Decision Tree,
and Support Vector Machine, amongst others, are used in the process of classifying malware.
This kind of machine learning utilizes the files themselves as the dataset; the label, on the
other hand, signals whether or not the file in question is harmful. There are several collections
of data that are used for the objectives of instruction as well as evaluation. The data that is
included in the training dataset are used to train one model. In order to get more reliable
outcomes, the model is validated in a scenario known as cross-validation. After the model has
been trained, it is applied to the testing dataset for evaluation. In this particular instance, the
model is unable to comprehend the labels. The labels for each and every file are determined in
an automated fashion. The accuracy may then be determined based on the proportion of files
that have been correctly tagged. These machine learning approaches are successful;
nevertheless, they are based on shallow learning architectures, which may be improved
further using deep learning's higher capacity for automated feature learning [2].
In deep learning, features are extracted from the original data in a hierarchical fashion,
starting at the bottom and working their way up via a series of layers [20]. Each "layer" in this
setup is responsible for classifying characteristics and relaying that information to the next
26
layer up the hierarchy. Each succeeding layer attempts to construct a more complicated
feature using the data from the preceding layer as building blocks, and so on through the
whole stack. The last layer of the model is able to determine whether the file in question
contains malicious code since these features get somewhat aggregated as they are passed
down. In contrast to typical machine learning approaches, deep learning models may extract
characteristics on their own without any further input. In this research, we evaluate the
performance of machine learning and deep learning on the dataset and draw comparisons
between them.
The main feature taken from the imported file will be encoded with virus extraction and taken
from the previous altered file. Due to this, it is easy to apply the projected dataset for tasing
value from the initial Addy for highlighting vector with file data. Therefore, the data set will
analyze the machine learning algorithms and draw various results regarding detecting
malware.
27
Chapter 4 Dataset
significant amount of data, it is not feasible to develop a model using machine learning. The
data is vital and harmful to consider in relation to malware because of its context.
Malware in binary format can be collected, but since it is in executable form, doing so
carries some inherent risk. When dealing with executable files, it is necessary for the analyst
to set up a virtual computer and carefully check or extract features from the virus. Even
though there are now 30,386,102 virus samples that may be downloaded from
A sizeable dataset was made accessible to the general public by Microsoft in 2015 as
part of the "Microsoft Malware Classification Challenge" hosted on Kaggle [5]. The dataset
contains 20,000 malicious samples that come from 9 different families. These samples are
provided in binary form as well as in the disassembled assembly format (.asm) using the IDA
Pro disassembler. Although a great number of research papers have made use of this dataset,
we were unable to include it into our investigation because of its enormous size (400 GB), as
well as the lack of any harmless files contained within it. This publication aggregates a list of
28
citations to over fifty unique research publications and theses that all make use of the dataset.
The majority of publicly accessible malware datasets are somewhat small, despite the
fact that there are various static and dynamically extracted feature datasets available. Because
it comprises 5210 samples, of which 2722 are dangerous and the other samples are benign,
ClaMP (Classification of Malware using PE Headers) [7] served as a great beginning point
for our testing. ClaMP is a publication that was released in 2016 and has 69 extracted
features.
These features include things like md5, size, entropy, fileInfo, VirusTotal report, file type, and
more.
This choice has been reexamined in light of the publication of EMBER (Endgame
Malware Benchmark for Research) on April 16, 2018, the latest version of the benchmark.
characteristics, all of which were extracted from PE files. The dataset contains 800,000
records, 440,000 of which are malicious, 420,000 of which are benign, and 300,000 of which
are unlabeled [8]. These records may be utilized to create a semi-supervised model or for
other studies. In the experiment section, further information on the characteristics and
Data mining algorithms take as input information from the real world, which may be
problems. This issue will always exist, but any data-driven business must find a solution.
Human error and the fallibility of data-gathering instruments both contribute to inaccuracies in
collected data. Noise refers to the unintended fluctuations. Data noise can be problematic for
29
machine learning algorithms if it is not properly trained, as the algorithm may mistake it for a
The effects of various forms of noise on datasets are shown in the following figure.
Because of this, the quality of the analysis process as a whole may be jeopardized if
the dataset in question was noisy. The signal-to-noise ratio is the primary metric that analysts
and data scientists use to measure the quality of data. The following is a diagram that
30
Source: machinecurve.com
Source: www.machinecurve.com
Therefore, it is necessary for every data scientist to handle the noise in the dataset,
Completely Doing Away With the Background Noise Technique for the Encoding of
Automatic Data
variation of auto-encoders, are of great use. The fact that they can be taught to recognize
certain noise in a signal or collection of data enables them to be used as de-noisers. In this
application, the noisy data serves as the input, and the de-noised data is created as the output.
Encoders and decoders are both necessary parts of auto-encoders. The encoder is responsible
for converting incoming data into an encoded form, while the decoder is responsible for
reverting the data to its original condition. De-noising auto-encoders are built with the
31
intention of forcing the hidden layer to pick up more robust features via some kind of
psychological manipulation.
After that, the auto-encoder is taught to recover the original data from the damaged one while
The acronym PCA stands for "Principal Component Analysis" (Principal Component
Analysis)
Principal components analysis (PCA) is a statistical technique that uses the orthogonal
property to divide a set of potentially connected variables (linked variables) into a set of
variables that are not related to one another (uncorrelated). A new group of independent
removing noise while maintaining the integrity of the key information. The principal
component analysis (PCA) is a geometric and statistical approach that projects an input signal
or data set along numerous axes in order to minimize the dimensionality of the signal or data.
To have a better understanding of the notion, see it as the projection along the X-axis of a
point that is located in the XY plane. It is now permissible to disregard the Y-axis noise plane.
removes the axes that contain the noisy data, principal component analysis is a technique that
may be used to clean up noisy input data. In this investigation, the principal component
analysis (PCA) is used to carry out a two-stage noise reduction technique. The PCA takes in
Data scientists nowadays have a great lot of anxiety over the process of separating
signal from noise due to the possible performance difficulties it may bring. These concerns
include the possibility of overfitting altering the behavior of a machine learning algorithm. It
is possible for an algorithm to begin the process of generalization by using noise as a pattern.
32
Since noise degrades the quality of your signal or dataset, getting rid of it or reducing it is
your best option to improve things. The problem of noisy data has been addressed with a
variety of potential solutions. It's possible that we may find a solution to this problem by
33
Chapter 5 testing
5.1. Introduction
We used the EMBER dataset to train deep learning and machine learning algorithms. In deep
learning, we used Convolutional Neural Network (CNN) and RNN (Recurrent Neural
Network), and in machine learning, we used Random Forest and Decision Tree algorithms.
After training the model, we tested it by using NORMAL and Malware files which we
downloaded from the ‘VIRUS TOTAL’ website. The model was able to accurately predict
In the screen above, the first two selected files are the testing files. "antialias.exe" is the
normal file, and "trojan.exe" and bot are virus file. The screen below shows the project
output.
34
In above screen in selected text we are uploading python packages
We read data from the ember folder, split the dataset into train and test parts, and then find the
total malware classes available in the dataset. After executing the code, we get the screen
below.
35
We can see the total dataset size and then the size of the train and test data. The dataset
contains 3 different types of classes where 0 is the normal class and the others are the
malware classes. In the graph, the x-axis represents class names and the y-axis represents
their count in the dataset. In the next screen, we will train the dataset using a RNN model.
36
We check to see if the model is trained, and if it is, we load the RNN model. If not, the model
is built from scratch. After the model is trained, we get predictions and model accuracy from
On the above screen, we print the RNN model summary with feature details and then print the
training model accuracy and accuracy on test data. Below, the screen shows the RNN LOSS
and ACCURACY for each epoch.
37
The x-axis in the graph above represents epochs and the y-axis represents accuracy and loss
values. It is evident from the graph that as accuracy increases, loss values decrease. The
screen below shows the training dataset with machine learning algorithms.
We are training the ember dataset with random forest and decision tree machine learning
algorithms and then calculating accuracy on test data.
38
Looking at the screen, we can see that both algorithms have the same accuracy of 60%. The
graph below shows the accuracy comparison between RNN, Random Forest, and a decision
tree. We took test PE files and then performed predictions.
We check to see if the model is trained, and if it is, we load the CNN model. If not, the model
is built from scratch. After the model is trained, we get predictions and model accuracy from
39
On the above screen, we print the CNN model summary with feature details and then print the
training model accuracy and accuracy on test data. Below, the screen shows the CNN LOSS
and ACCURACY for each epoch.
The x-axis in the graph above represents epochs and the y-axis represents accuracy and loss
values. It is evident from the graph that as accuracy increases, loss values decrease. The
screen below shows the training dataset with machine learning algorithms.
40
Looking at the screen, we can see that both algorithms have the same accuracy of 60%. The
graph below shows the accuracy comparison between CNN, Random Forest, and a decision
tree. We took test PE files and then performed predictions.
In the screen above, you can see that I took the 'antialias.exe' file as input and then used a
classifier to predict the file. The output showed that the file was 'NORMAL'. In the screen
below, I took a virus PE file and the model gave the prediction result shown.
41
I tested the model with a virus file and it outputted that the file was a MALWARE file. You
Comparision of Accuracy Results B/w CNN, RNN, Decision Tree and Random
Forest We are getting more accuracy in CNN (approximately 85%) compare to remaining
algorithms that means RNN, Decision Tree and Random Forest.
Algotithm Accuracy
CNN 86%
RNN 70%
Decision Tree 60%
Random Forest 60%
42
Chapter 6: Conclusion
6.1. Conclusion
Summing up all the discussion from above, it is concluded that malware attacks can create a
lot of problems and issue to the whole personal computer. Therefore, it is important to detect
malware and also remove it from the system. Due to this, the whole system will be protected
from it. The main aim of this research is to use machine learning technique to detect malware
samples present in the system. If malware detection is used properly with the tough constraint
then false positive rate must be zero. The results are showing that the required goals are not
achieved and it is extremely close to the main goal but there is still non-zero false positive
rate. Due to this framework, the particular part of the system is considered a highly
mechanism. Therefore, it is become simple to detect malware from the files by using machine
learning approaches. Another point is that decision tree and Random Forest algorithms are the
best one because they can easily detect malware from the large dataset. If these algorithms are
improved, then it is possible to obtain required results about malware detection. Compare to
these four Algorithms CNN, RNN, Decision Tree and Random Forest. I got more accuracy in
CNN. So, CNN is the best algorithm among these four for malware detection.
References
Journal of Information Security, vol. 5, no. 2, pp. 56-64, 2014. 2. W. Hardy, L. Chen, S. Hou,
Y. Ye and X. Li, "DL4MD: A deep learning framework for intelligent malware detection",
Dehghantanha and K. Franke, "Machine learning aided static malware analysis: A survey and
43
4. M. Sikorski and A. Honig. “Practical Malware Analysis”. Published by No Starch Press,
https://www.kaggle.com/c/malware-classification.
MalwareAnalysis Techniques and Tools", Journal in ACM Computing Surveys, vol. 44,
2012.
entropy/.
of static, dynamic, and hybrid analysis for malware detection”, Journal of Computer
Virology and Hacking Techniques, vol. 13, no. 1, pp. 1–12, Dec. 2015.
13. R. Tian, L. Batten and S. Versteeg, "Function Length as a Tool for Malware
44
Classification", Proceedings of the 3rd International Conference on Malicious and Unwanted Software,
14. Manal Abdullah, Afnan Agal, Mariam Alharthi and Mariam Alrashidi. Arabic
16. A. Sung, J. Xu, P. Chavez and S. Mukkamala, "Static analyzer of vicious executables
(save)", Proceedings of the 20th Annual Computer Security Applications Conference (ACSAC
17. Y. Ye, T. Li, S. Zhu, W. Zhuang, E. Tas, U. Gupta, et al., "Combining File Content and
File Relations for Cloud Based Malware Detection", Proceedings of ACM International Conference on
Knowledge Discovery and Data Mining (ACM SIGKDD), pp. 222-230, 2011.
19. P. Beaucamps and E. Filiol, "On the possibility of practically obfuscating programs -
deep learning approach", ITS IEEE Transactions on, vol. 16, no. 2, pp. 865-873, 2015.
45
21. B. Kolosnjaji, A. Zarras, G. Webster and C. Eckert, "Deep learning for classification of
22. M. Siddiqui, M. C. Wang and J. Lee, "Detecting Internet Worms Using Data Mining
Techniques", Journal of Systemics Cybernetics and Informatics, vol. 6, pp. 48-53, 2009
46