You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/337086856

A Survey on Different Approaches for Malware Detection Using Machine


Learning Techniques

Chapter · January 2020


DOI: 10.1007/978-3-030-34515-0_42

CITATIONS READS

3 953

2 authors, including:

Reeja S R
VIT University
16 PUBLICATIONS   30 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Video noise removal View project

Machine learning View project

All content following this page was uploaded by Reeja S R on 21 September 2020.

The user has requested enhancement of the downloaded file.


A Survey on Different Approaches
for Malware Detection Using
Machine Learning Techniques

S. Soja Rani(&) and S. R. Reeja

Dayananda Sagar University, Bangalore, India


soja.naveen@gmail.com, reeja-cse@dsu.edu.in

Abstract. Malwares are increasing in volume and variety, by posing a big


threat to digital world and is one of the major alarms over the past few years for
the security in industries. They can penetrate networks, steal confidential
information from computers, bring down servers and can cripple infrastructures.
Traditional Anti-Intrusion Detection/Intrusion prevention system and anti-virus
softwares follow signature based methods which makes the detection of
unknown or zero day malwares almost impossible. This issue can be solved by
more sophisticated mechanisms in which, static and dynamic malware analysis
can be used together with machine learning algorithms for classifying and
detecting malware. Through this paper we present a survey on the different
techniques for concealment and obfuscation used to make sophisticated malware
as well as the different approaches used in malware detection and analysis.

Keywords: Malware analysis  CyberSecurity  Machine learning

1 Introduction

Data breaches, ransomware attacks, targeted botnet as well as malware attacks are in
the daily news these days. Brain, the first PC virus in 1986 to the highly obfuscated
Wannacry in 2017, the malware epidemic is continuing its alarming expedition. Due to
the advancements and increased use in technology, we can observe a continuous
evolution of malwares in volume, variety and velocity. Malicious software or in short,
Malware, can be code, scripts, or any other content that are designed to interrupt the
normal operation or gather information illegally that leads to loss of privacy, gain
unauthorized access to system resources, and other abusive behaviour. Computer
viruses, spyware, Trojan horses, worms, adware, botnets, rootkits etc. come under the
huge umbrella of malware which forms the integral component of almost all the data
breaches. Attackers are also using more tools, like polymorphic malware and zero-day
malwares, to evade the current malware detection tools. The wide spread use of World
Wide Web is also an inevitable reason behind the increase in threat from malware.
There are multiple doors for the adversary to enter the enterprise network which are
guarded by perimeter security tools such as firefalls, antivirus, network based and host
based intrusion detection/prevention tools which at times may not be able to distinguish
between genuine user and an adversary. Once the adversary enters the enterprise

© Springer Nature Switzerland AG 2020


P. Karrupusamy et al. (Eds.): ICSCN 2019, LNDECT 39, pp. 389–398, 2020.
https://doi.org/10.1007/978-3-030-34515-0_42
pkarrupusamyphd@gmail.com
390 S. Soja Rani and S. R. Reeja

network and gains knowledge about the network, they unwind their plan of action and
attack the target. During their course of action in the network, the leave behind some
changes in the data or signal which can be detected using data science based tools that
can raise alerts.

2 Malware Camouflage Evolutions

The advancements in the malware code by concealing the appearance have become a
serious challenge for the antivirus companies. On the basis of the concealment tech-
nology used malwares can be classified as Encrypted, Oligomorphic, Polymorphic and
Metamorphic Malwares.

2.1 Encrypted Malware


Encrypted malware has two basic sections in their structure: a decryption loop and
main body. Decryption loop or the Decryptor code is responsible to decrypt and
encrypt the program of the main body, the actual malware, which is meaningless
otherwise. After getting into the host, virus begins its action where the decryptor loop
executes first to decode the main body into machine executable code. Since the
decryption module remains the same, these viruses are detectable by analyzing the
decryption module. Different keys are used for each infection to hide the signature and
to make the detection process harder.

2.2 Oligomorphic Malwares


Next step in the concealment tactics for defending the short comings of encrypted
malware led to the development of Oligomorphic malwares where the decryptors are
mutated from one variant to other. This type of malware can generate no more than few
hundreds of decryptors which are randomly chosen for a new victim but still remains
detectable with signatures. Once signatures of all the decryptors are made, these
malwares cannot evade signature based detection techniques.

2.3 Polymorphic Malwares


In order to overcome the limitations of Oligomorphic malwares the next concealment
technique came into prevalence is the polymorphic malwares which can create countless
number of distinct decryptors. Here the presentation of the code is made unique con-
stantly from one variant to other. When the malware executes, a new decrytpor is
generated which joins with the encrypted malware body to construct a new malware
variant [1]. Polymorphic malwares utilize different code obfuscation approaches such as
substitution of instructions to mutate its decryptor or insertion of junk codes etc. to build
a new variant for a new victim which is done using a mutation engine or obfuscation
engine. Even though these malwares are able to hide from signature matching tech-
niques, their body after decryption appears the same as well as behaviour which can be
used as the source for detection. The detection tools for the same adopts the emulation

pkarrupusamyphd@gmail.com
A Survey on Different Approaches for Malware Detection 391

technique where the malware is executed in an emulator and signatures can be con-
structed efficiently and can be detected using the conventional detection mechanisms.

2.4 Metamorphic Malwares


The metamorphic malware is the most novel approach in the 2nd generation of mal-
wares. Different from the previous camouflage generations, metamorphic virus has no
encrypted part because of which it does not need decryptor part, but employs a
mutation engine which mutates the whole malware body rather than the decryptor.
Each new copy of this type may have different size, code sequence, structure and
syntactic properties, but the behaviour remains the same. Since a professional meta-
morphic malware does not leave a single pattern vulnerability to make it detectable, the
defence software should be highly sophisticated and built on heuristics and behaviour
based analysis and detection techniques.

2.5 ZeroDay Malware


ZeroDay Malware is the one which invades a software vulnerability for which there is
currently no available defence or fix. Malware makes use of the vulnerability present in
the system to perform adversarial actions on the system which can compromise the
confidentiality, integrity, or availability of the system. Zeroday malwares cause sig-
nificant damage by exploiting the delay that can happen between the delivery of the
malware and the development of the counter-measures.

3 Approaches for Malware Analysis

Malwares can be analyzed through various methods and can be broadly categorized
into two – static and dynamic analysis. Malware analysis gives a detailed and well
understanding about the functioning of the malware as well as what can be done in
order to eliminate the threats of the malware.

3.1 Static Analysis


It is the priliminary malware analysis technique where malware code is decompiled and
examined. Static analysis examines malware binary without actually running it or
without viewing the actual code or instructions. Different techniques and tools can be
used to provide information about its functionality and collect information to produce
simple signatures which is the unique identification for the binary file. Commonly used
signatures include file name, DLLs called by the malware, URLs accesses, MD5
checksums or hashes, file type, file size etc.

3.2 Dynamic Analysis


As the complexity and sophistication of malware increases, it becomes hard to analyse
the malware using the static signatures. Dynamic analysis (also known as behavior

pkarrupusamyphd@gmail.com
392 S. Soja Rani and S. R. Reeja

analysis) executes malware in a controlled and monitored environment to observe its


behavior. During dynamic analysis the Malware is executed in a controlled, isolated
virtual environment like Cuckoo Sandbox for studying the malware thoroughly and to
understand its functionality with out damaging your system. After the execution some
features or indicators which can be used in detection are extracted. The features
revealed with basic dynamic analysis can include IP addresses, domain names, registry
keys, file path locations or can be any additional files located on the system or network.
Many tools are available in order to make the dynamic analysis efficient and safe like
Netcat, Wireshark, Regshot InetSim, ApateDNS, Procmon, etc.

4 Approaches for Malware Detection

The available collection of the techniques for malware analysis and detection inclusive
of those adopted by the industries and those that are not can be categorized into four
approaches - Static Signature based approach, Static Behavior based approach,
Dynamic Signature based approach, Dynamic Behavior based approach.

4.1 Signature Based Detection


Unique sequences of bytes called signatures are extracted, once the malware is iden-
tified and then added to a database which may contain hundreds of millions of sig-
natures that are already identified as malicious objects. The techniques scans the file in
the system to find the defined malware signature, if found, an alert of the presence of
malware is sent. Signatures can be efficiently and quickly scanned and identified by
algorithms. Most of the commercially available anti-malware products use this method
as the primary technique for identifying malicious objects. The signature based mal-
ware detection deploying machine learning can be learned by analyzing the assembly
features or the binary features.
Signature based detection can generally identify only previously known and
identified malware. Even though it is easy to use, scanning and pattern matching
becomes costly as the malware signature database is increasing exponentially [2].
Another major reason being today’s advanced malware can alter its signature to avoid
detection. The technique is reactive in nature and thus is unable to identify threats or
attacks from the new malwares. The Cisco 2017 Annual CyberSecurity Report states
that 95% of malware files they analyzed weren’t even 24 h old, indicating the
prevalence of zero day attacks.

4.2 Behaviour-Based Malware Detection


Behaviour-based malware detection analyses the given sample based on its behaviour
or actions. Abnormal and unauthorised actions and run time activities of the given
sample in the sandboxed environment make it malicious or at least suspicious. There
Installing rootkits, registering for auto start, disabling security controls, attempt to
discover a sandbox environment are all behaviours that can point towards a malicious
behaviour. Behaviour based detection can either use static analysis or dynamic

pkarrupusamyphd@gmail.com
A Survey on Different Approaches for Malware Detection 393

analysis. If the malicious behaviour is evaluated as it executes, then it is called dynamic


analysis. Malicious intent can also be evaluated by static analysis where the system is
worried only about the structure and program of the malware. Behaviour based
detection is the leading technology today as it can detect metamorphic and zero day
malwares in near real time. Machine learning based behaviour-based detection systems
either use API calls or assembly features deployed by the malware for learning.
There are still a few important limitations of behaviour based detection. If malware
understands that it is running in a sandbox or virtual environment, it will try to hide its
malicious intensions by curtaining its malicious activity. Second drawback is that the
malware may take its own time to unleash the malicious activity which is limited in a
virtual environment resulting in latency in dynamic analysis. Behaviour based systems
are prone to false alarm [3], which may make the system more vulnerable by taking the
real malware as another false alarm. It has been demonstrated that anomaly detection-
based techniques are susceptible to mimicry attacks.

4.3 Hybrid Malware Detection Approach


Both signature and behaviour-based malware detection has its own pros and cons. Even
though static and dynamic analysis are strong mechanisms that can stand alone to for
malware analysis, another genre which can reap the benefits is the hybrid approach
which uses both static features and dynamic features for analysis. The best security will
come from utilizing both technologies. This technique is more beneficial to reverse
engineer complex malwares. True intentions and capabilities of a malware can be
analyzed better using a hybrid approach.
Information security industry is often misled by vendors promising next-generation
firewalls and other sophisticated security software. Most of these sophisticated anti-
malware systems rely on the decades old signature-based approach for malware
detection that is not up to the level of detecting today’s metamorphic malware and zero-
day attacks. Behaviour based malware detection strategies should be used to augment
the signature based systems so that the sensitive data or critical operations of the
organizations remain protected. Researchers propose the use of hybrid approach to
augment the results of standalone detection techniques so as to reduce high probability
of false alarm [4].

5 Review of the Malware Detection Approaches

Here few current researches in malware detection are analyzed together with the
machine learning algorithms used. Automating the CyberSercurity industry leveraging
data science using machine learning algorithm is gaining popularity in recent years. By
making use of learning algorithms in anti malware industry we can not only detect the
known malwares but also act as knowledge for the detection of new variants of mal-
ware including the polymorphic and zeroday malwares. This technique does not have
to replace the existing standard detection methods, but surely can act as an add-on
improvising the detection probability. Since machine learning techniques are more
computationally demanding when compared with the standard existing systems, it may

pkarrupusamyphd@gmail.com
394 S. Soja Rani and S. R. Reeja

not be suitable for end users but can be promisingly implemented at the enterprise
gateway level to act as a central anti-malware engine. Even though the infrastructure
can turn out to be costly, but it can help in an efficient and promising way by protecting
valuable enterprises data from the security threat and can prevent immense financial
damages.

5.1 Review of the Selected Signature-Based Detection Approaches


Signature based detection approaches reduces the overhead of the system as well as the
execution time taken for malware detection or prediction. This approach can be used in
all three target environments including Windows based systems, embedded systems
and smart phones.
Wang and Wang [5] presented a machine learning based automatic malware
recognition framework using support vector models (SVMs). Behavioral signatures
were used to train the SVM classifier. The system performed good and the classifi-
cation error decreased with increasing size of the test data. For different sizing (N) of
malware samples, the prediction accuracy of malware detection goes up to 98.7% with
N = 100.
Another strategy was proposed by Santos et al. [6] to identify malware files. Fre-
quency of occurances of opcode groupings was the feature used in this study. The
importance and the recurrence of each opcode grouping were evaluated. The paper also
presented the result and validation which proved that the new system was able to detect
unknown malware files as well.
Martín et al. [7] in their paper illustrated the use of third part calls to ignore the
effects of the different camoufalging strategies used by the different malware families
for obfuscation. They combined clustering and multi target advancement to create a
classifier in view of specific behaviours carried out by outsider or third party call
groups. This analyzer, named MOCDroid, achieves a precision of 95.15% in test with
1.69% of false positives which is much better than the commercial antivirus engines
from VirusTotal.
Hellal and Ben Romdhane [8] in their research utilized static analysis to form a
technique for mining data from graph to recognize variations of malware. MCFSM,
Minimal contrast frequent sub graph miner, a novel algorithm was proposed by them
for identifying the seen malicious behavioural patterns which can distinguish between
malicious and benign programs. The proposed method displayed high malware
detection rates combined with low false positive rates. The behavioural signatures
requirement was also limited for the proposed method.
Fan et al. [9] proposed a system to detect malicious files based on instruction
sequences extracted from the data sample to build a sequence mining algorithm. The
identified patterns were used to train an All-Nearest-Neighbor (ANN) classifier for
malicious and benign classification. The method utilizing the pattern mining appended
by ANN classifier proved to be successful even for polymorphic and zeroday malware
samples.
Boujnouni et al. [10] proposed a new approach which was very effective against
attacks like polymorphism and metamorphism which are the techniques used by
malware creators in order to obfuscate the code. The approach used N-grams to train an

pkarrupusamyphd@gmail.com
A Survey on Different Approaches for Malware Detection 395

improved version on Support Vector Domain Description. Experimental results were


evaluated after the successful classification of several hundreds of malware and benign
files to confirm the accuracy and feasibility of the proposed method.

5.2 Review of the Selected Behavior-Based Detection Approaches


Dynamic analysis can performed by monitoring various runtime execution features
including API call hooking, binary hooking, data flow analysis, instruction trace,
multiple path execution, running in sandbox or virtual machine, using machine learning,
etc. Even though the approach can be used in all the three target environments, most of
the research studies have been conducted in the smart phone environment.
Ye et al. [11] put forward a heterogeneous deep learning framework for autono-
mous malware detection. They monitored the runtime portable executable files and
extracted the Windows API calls which served as the feature for the learning system.
Deep learning framework comprised of an AutoEncoder which was driven by multi-
layer Boltzmann machines together with a layer of associative memory. The method
took advantage of both supervised and unsupervised learning so as make use of the
unlabeled data sets in the training phase which trained the deep neural network. Data
was collected from Comodo Cloud Security Center on which experiments was done
and the results proved better precision and low false positives compared with shallow
learning systems.
Bayer et al. put forward TTAnalyze, a tool for analyzing Windows executables.
The approach monitors the behaviour by running the binary in an open source PC
emulator, Qemu and used Windows native system calls and Windows API function
calls as features for the learning system [12].
In his work [13] Willems et al. made use of API hooking and dynamic linked library
injection techniques in their anti malware tool, CWSandbox. The software was built for
malware detection for Win32 family of OS. Behavioural analysis was performed on
features like registry manipulation, network communications, file system changes and
operating system interactions. The first five bytes of the application programming
interface was replaced with unconditional jump for implementing the hook.
Mohaisen et al. [14] addressed the shortcomings in the then existed systems
through the introduction of AMAL, an automated and behavior-based malware analysis
and labeling system. The system has to parts. The first subsystem, AutoMal runs the
files in virtualized environments and has a collection of tools to collect the behavioural
features that characterize the malware like file usage, memory, network, and registry.
MaLabel makes use of the collected features to build classifiers trained by data samples
which can be used to classify the files into groups of similar behaviour. Experiments
was conducted on medium sized samples of size 4000 as well as on large scale samples
of size 115,000 over 13 months and the results was promising a precision of 99.5% and
recall of 99.6% for certain families’ classification, and more than 98% of precision and
recall for unsupervised clustering.
Norouzi [15], in their proposal for malware detection presents a data mining
classification approach which relies on behavioural features of the malicious software
using dynamic analysis. The executive XML file is used to extract features and

pkarrupusamyphd@gmail.com
396 S. Soja Rani and S. R. Reeja

provides input to the WEKA tool. The performance is evaluated using a real case study
data set using WEKA tool to illustrate the performance efficiency as well as training
data and test.

5.3 Review of the Selected Hybrid Detection Approaches


Eskandari et al. [16] introduced an hybrid approach called HDM-Analyzer, which uses
both the major approaches to analyze an executable file. While static analysis analyze
the file in the program source code level, dynamic analysis extracts the features of the
file by observing program’s activities during its execution time. The proposed HDM-
Analyzer takes into account the advantages associated with both types of analysis thus
reducing the analysis and detection time as well as maintaining the required level of
precision and accuracy of classification using Bayesian network, Naive Bayes, Lazy
K-Stare.
Yuan et al. [17], in their paper proposed to combine both the static and dynamic
analysis features of Android apps and used deep learning to automatically detect the
malicious ones called DroidDetector which was successful in classifying apps as
malicious and benign. The experimental results proved that deep learning turned out to
be efficient for detecting and classifying malware and that the efficiency of the software
increased with the variety and volume of data. 96.76% detection accuracy was achieved
which concluded that DroidDetector was much efficient than the traditional machine
learning techniques.
Dali et al. [18], in their paper is making use of machine learning algorithms to learn
the difference in features between malicious and non malicious apps automatically.
They proposed a deep learning based application, DeepFlow, for detecting malware in
which data flows in the Android applications is analysed. The experimental results
showed high detection F1 score of 95.05%, which outperformed the classic machine
learning based approaches.
Ding et al. [19], in their paper proposes an Application Programming Interface
based association mining rules for identifying malicious files. One of the approaches is
to use efficient rule quality by removing the APIs that cannot become frequent items.
Finding association rules with strong discrimination ability is another approach for the
same. The results proved that the use of these new approaches resulted in faster running
of OOA and that the time complexity for data mining is reduced by 32% and the time
complexity for classification is reduced by 50%.
A hybrid approach with both signature and behaviour based approaches driven by
machine learning algorithms for detecting malicious apps was introduced by Rehman
et al. [20] in their research. The Android apps are reverse engineered to extract manifest
files, and binaries. Classification algorithms including Decision trees, SVM, KNN and
W-J48 were used to perform classification experiments. Results showed that SVM was
the better classifier in case of binaries and KNN in case of.xml files. The proposed
system was tested on benchmark datasets and results proved better accuracy in malware
detection.

pkarrupusamyphd@gmail.com
A Survey on Different Approaches for Malware Detection 397

6 Conclusion

This paper presents a thorough study on the evolution of the concealment techniques as
well as obfuscation methods employed in the generations of malware. The paper also
focuses on the two analysis methods of malware – static analysis and dynamic analysis.
The paper also concentrated on Signature based malware detection and Heuristics or
Behaviour based malware detection, the two detection strategies prevailing in the
industry. Owing to the advantages and disadvantages in both the detection strategies, a
combination approach has also been evolved, the hybrid detection strategy. The paper
also tries to review a few literatures of the malware detection approaches using machine
learning. The reviewed papers are classified into the above mentioned 3 categories -
(1) signature-based, (2) behaviour-based approaches and (3) Hybrid approach.

References
1. Digital Object Identifier: The effects of traditional anti-virus labels on malware detection
using dynamic runtime opcode. https://doi.org/10.1109/ACCESS.2017.2749538
2. Beaucamps, P.: Advanced polymorphic techniques. Int. J. Comput. Sci. 2(3), 194–205
(2007)
3. Wong, W., Stamp, M.: Hunting for metamorphic engines. J. Comput. Virol. 2, 211229
(2006)
4. Govindaraju, A.: Exhaustive statistical analysis for detection of metamorphic malware. [MS
Project], San Jose State University, US (2010)
5. Wang, P., Wang, Y.-S.: Malware behavioural detection and vaccine development by using a
support vector model classifier. J. Comput. Syst. Sci. 81, 1012–1026 (2015)
6. Santos, I., Brezo, F., Ugarte-Pedrero, X., Bringas, P.G.: Opcode sequences as representation
of executables for datamining-based unknown malware detection. Inf. Sci. 231, 64–82
(2013)
7. Martín, A., Menéndez, H.D., Camacho, D.: MOCDroid: multi-objective evolutionary
classifier for Android malware detection. Soft. Comput. 21, 7405–7415 (2017)
8. Hellal, A., Romdhane, L.B.: Minimal contrast frequent pattern mining for malware
detection. Comput. Secur. 62, 19–32 (2016)
9. Fan, Y., Ye, Y., Chen, L.: Malicious sequential pattern mining for automatic malware
detection. Expert Syst. Appl. 52, 16–25 (2016)
10. Boujnouni, M.E., Jedra, M., Zahid, N.: New malware detection framework based on N-
grams and support vector domain description. In: 2015 11th International Conference on
Information Assurance and Security (IAS), pp. 123–128 (2015)
11. Ye, Y., Chen, L., Hou, S., Hardy, W., Li, X.: DeepAM: a heterogeneous deep learning
framework for intelligent malware detection. Knowl. Inf. Syst. 54, 265–285 (2017)
12. Bayer, U., Moser, A., Krugel, C., Kirda, E.: Dynamic analysis of malicious code. J. Comput.
Virol. 2(1), 67–77 (2006)
13. Willems, C., Holz, T., Freiling, F.: Toward automated dynamic malware analysis using
CWSandbox. IEEESecur. Priv. 5(2), 32–39 (2007)
14. Mohaisen, A., Alrawi, O., Mohaisen, M.: AMAL: high-fidelity, behavior-based automated
malware analysis and classification. Comput. Secur. 52, 251–266 (2015)
15. Norouzi, M., Souri, A., Samad Zamini, M.: A data mining classification approach for
behavioral malware detection. J. Comput. Netw. Commun. 2016, 9 (2016)

pkarrupusamyphd@gmail.com
398 S. Soja Rani and S. R. Reeja

16. Eskandari, M., Khorshidpour, Z., Hashemi, S.: HDM-analyser: a hybrid analysis approach
based on data mining techniques for malware detection. J. Comput. Virol. Hacking Tech. 9,
77–93 (2013)
17. Yuan, Z., Lu, Y., Xue, Y.: DroidDetector: android malware characterization and detection
using deep learning. Tsinghua Sci. Technol. 21, 114–123 (2016)
18. Dali, Z., Hao, J., Ying, Y., Wu, D., Weiyi, C.: DeepFlow: deep learning-based malware
detection by mining Android application for abnormal usage of sensitive data. In: 2017 IEEE
Symposium on Computers and Communications (ISCC), pp 438–443 (2017)
19. Ding, Y., Yuan, X., Tang, K., Xiao, X., Zhang, Y.: A fast malware detection algorithm based
on objective-oriented association mining. Comput. Secur. 39(Part B), 315–324 (2013)
20. Rehman, Z.-U., Khan, S.N., Muhammad, K., Lee, J.W., Lv, Z., Baik, S.W., Shah, P.A.,
Awan, K., Mehmood, I.: Machine learning assisted signature and heuristic-based detection
of malwares in Android devices. Comput. Electr. Eng. 69, 828–841 (2017)

pkarrupusamyphd@gmail.com
View publication stats

You might also like