Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML

A PROJECT REPORT (CS321)
ON
MALWARE DETECTION USING MACHINE LEARNING

A report submitted in partial fulfilment of the requirement for the award of
the degree of
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND ENGINEERING
Submitted to:
Dr. Atul Kumar Srivastava
Assistant Professor
Submitted by:
Anagh Sharma, CSE A
Kartikey Sharma, CSE A
Vishal Bora, CSE ML
Harendra Singh Bisht, CSE A
SCHOOL OF COMPUTING
DIT UNIVERSITY, DEHRADUN
(State Private University through State Legislature Act No. 10 of 2013 of Uttarakhand and approved by UGC)
Mussoorie Diversion Road, Dehradun, Uttarakhand - 248009, India.

2021
DECLARATION
We here by certify that the work, which is being presented in the Phase 1 report, entitled
Malware Detection Using Machine Learning, in partial fulfilment of the requirement for the
award of the Degree of Bachelor of Technology and submitted to the DIT University is an
authentic record of our work carried out under the guidance of Dr. Atul Kumar Srivastava.
Date: 23-10-2021
Signature of the Candidates
1. Anagh Sharma 2. Kartikey Sharma 3. Vishal Bora 4. Harendra Singh Bisht
Signature of Guide
2
ACKNOWLEDGEMENT
This project is a culmination of invaluable guidance and encouragement from various people at
DIT University.
We would first like to thank our guide, Dr. Atul Kumar Srivastava for his encouragement and
guidance throughout the project. We are wholeheartedly thankful to him for giving us his valuable
time & attention and for providing us with regular feedback to help us in progressing our project
in time. Then we would also like to thank our friends and family for their support.
Anagh Sharma, Roll Number: 190102260

Kartikey Sharma, Roll Number: 190102261
Vishal Bora, Roll Number: 190178033
Harendra Singh Bisht, Roll Number: 190102041
Group ID: CSE19-G58
3
ABSTRACT
With the rise in complexity of cybersecurity techniques and a simultaneous growth of

sophistication with which cybercriminals write and obfuscate malware, the research and industry
investment in exploring machine learning techniques to develop more robust security solutions
have also increased exponentially.
This project has the following two goals: (1) to explore the effectiveness of traditional machine
learning methods along with deep learning approaches to identify malicious software, and (2) to
identify light weight model(s) that produce minimal computational load on user devices.
The static and dynamic features of a file are to be used to analyse a file’s nature using several
configurations of feature selection. The files are to be either classified as malware or safe, or further
subclassified into various of malware types.
4
TABLE OF CONTENTS
CHAPTER PAGE NO.
Declaration 2
Acknowledgement 3
Abstract 4
Chapter 1 – Introduction 7
1.1. Malicious software 7
1.2. Machine learning 8
1.3. Malware detection using machine learning 9
Chapter 2 – Phase 1: Description 11
2.1. Study of datasets 11
2.2. Portable Executable file format 14
2.3. Feature engineering 16
Chapter 3 – Tools and Technologies 19
Chapter 4 – Summary: Phase 1 20
Chapter 5 – Project Pathway 20
Bibliography 21
Annexure –
Implementation and Code 22
Plagiarism Report: 9% Only
5
List of Figures
FIGURE NAME PAGE NO.
Fig 1: Malware classes in MMCC dataset 12

Fig 2: Malware sample files 13
Fig 3: Malware classes in Malimg dataset 14
Fig 4: PE file structure 15
Fig 5: Feature engineering in machine learning pathway 16
Fig 6: Binary file to PNG representation 16
Fig 7: Generation of tensor image data using Keras 17
Fig 8: A sample of PNG malware image representation 17
Fig 9: Splitting the dataset into test and train subsets 18
Fig 10: Phase 1: Summary 20
Fig 11: Project pathway 20
6
Chapter 1 – Introduction
In recent years, the increased usage and widespread understanding of computer systems has led to
a growth in the attempts to exploit the systems for data and money, or simply with the intent of
vandalism. According to Cybersecurity Ventures, ransomware attacks alone are projected to cost
over twenty billion US dollar in 2021 [1]. This threat has led to a growth in the malware analysis
services sector that seeks to offer solutions which can help classify software as malicious or not.
A study by MarketsandMarkets Research Pvt. Ltd. concluded that between 2019 and 2024, market
for malware analysis shall grow from three to twelve billion US dollars [2].
1.1. Malicious software
Malicious software, or malware for short, is a program that is designed to compromise a digital
system. Malware specimens were initially made for experimentation purposes to study the
vulnerabilities in computer system architecture and programming. Malware has rapidly evolved
from targeting personal computers to almost every device used today, from mobile phones to
ATMs. The diversity of malware and the complexity with which malware is launched will only
increase in future, as shown by various studies.
Various organizations broadly classify malware in following categories:
1. Worms: A malware which self-replicates and spreads to other computers through a

computer network. Worms can corrupt and steal data, install backdoors for hackers, and
consume memory and bandwidth.
2. Virus: The term “virus” was coined by Frederick B. Cohen in his Ph.D. thesis in 1986. He
defined the term as: “A program that can infect other programs by modifying them to
include a, possibly evolved, version of itself.”.
3. Trojan horse: A software which disguises itself as a legitimate program and when
downloaded and installed, is used to disrupt or damage data or a network.
7
4. Spyware: A malware that installs itself on a computer to monitor the behaviour and to steal
sensitive information from a user. Spyware can also be used to grant remote access of the
compromised system to predators.
5. Adware: This type of software is designed to help companies generate more revenue by
automatically displaying advertisement banners and pop-ups while another program is
running.
6. Ransomware: A program which threatens to perpetually block or publish personal user data
unless a ransom is paid. The attacker does so by using a disguised link to trick the user into
downloading the malware file which then encrypts the user data. The data can then only be
unlocked through a secret key which is usually promised to be revealed upon payment.
1.2. Machine learning
Machine learning is a part of the broader subject of artificial intelligence which is applied to enable
a machine to use real-world data in order to solve a problem, i.e., machine learning is a
combination of statistics, applied mathematics, and computer science. Machine learning allows
computers to automatically improve their performance from experiences and without any
additional programming to make those improvements. A model training algorithm is used to
generate a model which can generalise well on unseen data.
Machine learning is classified into the following three paradigms:
1. Supervised Learning: This methodology uses real-world datasets which consist of training
features and associated labels to generate an inferred function which can then be used to
label unseen and unlabelled data.
2. Unsupervised Learning: In unsupervised learning approach, an unlabelled and
uncategorized dataset is used to train a machine learning model which must discover
patterns to associate features with labels by itself.
3. Reinforcement Learning: This approach involves training through reward and punishment
of the behaviour of an intelligent agent, which takes actions with the goal of maximizing
cumulative reward.
8
Deep learning, is a subfield of machine learning wherein, features are extracted from raw data
progressively, from lower to higher degree of abstraction. Deep learning algorithms are composed
of multiple layers arranged in a hierarchical manner of increasing complexity.
1.3. Malware detection using machine learning
Earlier, the methodology used for malware detection involved manual configuration of malware
fingerprints by analysts and security experts. This signature-based malware detection was used
extensively throughout the industries and involved detection through manually configured and
regularly updated pre-execution rules. These rules were based on the features of files such as
fragments of code and several other properties of a file.
But this methodology eventually became obsolete due to the substantial increase of the volume of
malware produced that made manual configuration of fingerprints unrealistic since even a small
change in a file’s properties could render these rules ineffective.
This has led to the exploration of machine learning and deep learning-based malware detection
approaches which are able to use intelligent solutions in order to keep up with the evolution of
malware. The following methods are used in malware detection:
1. Static analysis: Features of a program are extracted and used to predict its nature without
executing the code. Such features include but are not limited to file format, binary data of
the file, text strings, etc. This method is least computationally expensive but could fail to
detect malware if it uses effective code obfuscation.
2. Dynamic analysis: This method takes a behaviour-based approach in examining an
executable file. The executable file is run and its behaviour is studied on either an air-
gapped machine, a virtual machine or a sandbox. The behaviour includes API calls,
memory writes, registry changes, etc. This method is used only after static methods have
been exhausted.
3. Hybrid malware detection: This method combines both static and dynamic methodologies.
9
Deep learning-based approaches do not require expertly selected feature configurations based on
domain knowledge. Instead, deep learning involves approaches that include:
1. Extracting a feature vector to represent the executable.

2. Examining grayscale image pixel intensity data that is produced using the binary content
of a file.
3. Studying API call traces.
4. Examining binary representation of a file, etc.
10
Chapter 2 – Phase 1: Description
Following the machine learning pathway, the first step in the first phase of this project is to explore
various datasets suitable for this study. The datasets explored in this project include executable
files, binary representations of those files, and metadata information. The executable files are
studied without their execution as this study focuses on static malware detection.
This is followed by an examination of features from a dataset to assess which features could be
most useful to train a generalized machine learning model. At this stage, this is done by learning
from the results of related work in this field. In the further stages, features will be selected based
upon usefulness measured by training models and observing their performance on unseen data.
2.1. Study of datasets
This is a fundamental stage of the machine learning pathway and involves identifying appropriate
data sources and available datasets. The following are some of the attributes of a dataset which are
used to assess the usefulness of a dataset:
1. The size of the dataset

2. The source and age of the dataset
3. The age of the dataset
4. Feature representation
5. The number of unlabelled samples
11
The following datasets were studied in this project:
1. Microsoft Malware Classification Challenge (BIG 2015) [3]: This dataset was published
by Microsoft as part a malware classification challenge hosted on Kaggle in 2015. When
uncompressed, this dataset contains over half a terabyte of data. The dataset consists of
bytecode and disassembly code from over twenty thousand malware files.
The malware samples which are represented in this dataset belong to over nine malware
families. These malware families, or classes, are shown in Fig 1.
Ramnit Lollipop
Kelihos_
Gatak
ver3
MMCC
DATASET Vundo
Obfusca
tor.ACY
Kelihos_ Simda
ver1
Tracur
Fig 1: Malware classes in MMCC dataset
The MMCC malware samples have following specifications:
a. A name: MD5 hash value to uniquely identify the file

b. A label: Integer representation of one of the nine malware families
12
Each malware sample has two files associated with it which are described in Fig 2.
.bytes file .asm file
Hexadecimal
Metadata
representation
information of
of a sample’s
the sample
binary content
PE header is Includes
removed to function calls,
ensure sterility strings, etc.
Fig 2: Malware sample files
2. Sophos-ReversingLabs 20 million dataset [4]: This dataset is hosted on Amazon S3 storage

service and contains approximately over eight terabytes of data. This is a large-scale dataset
comprised of almost twenty million files with pre-extracted features and metadata, labels
derived from various sources, along with “tags” related to every malware file which serve
as additional targets. Additionally, the dataset contains over ten million disarmed malware
samples.
The SoReL-20M dataset contains the following data for each malware sample:
a. Features of the samples that are derived in accordance with the format of the EMBER
2.0 dataset.
b. Labels for each sample which are obtained by using both external as well as internal
Sophos sources.
c. PE metadata of malware files obtained using pefile library.
d. Binary files of malware samples.
3. Malimg (Nataraj et al., 2011) dataset [5] [6]: This dataset contains PNG image
representation of nearly nine thousand malware files. Over 25 malware families are
represented in this dataset.
13
The malware families represented in this dataset are shown in Fig 3.
Fig 3: Malware classes in Malimg dataset
2.2. Portable Executable file format
The Portable Executable, or PE format is employed for executable/dll files in the Windows
environments and was first used in Windows NT operating system. The contents of the file are
composed in a linear manner. Features are extracted from this file to train a machine learning
model which is used in static malware analysis.
The structure of a PE file is described in the Fig 4.
14
Fig 4: PE file structure [7]
15
2.3. Feature engineering
At this stage, the raw data of the samples is transformed into features which can be used to
effectively train a machine learning model.
Raw Data Features Model Output
Feature
Fitting Inference
Engineering
Fig 5: Feature engineering in machine learning pathway
In this instance of feature engineering, the hexadecimal content of the malware file samples is
transformed to produce a PNG image representation of the malware file. These PNG files will be
used in the later phases of this project to train a deep learning model.
Malware Binary
Data PNG Image
Hexadecimal
Representation
Fig 6: Binary file to PNG representation
Feature engineering is a continuous process in the machine learning pathway and newer features
will be explored throughout the later stages as this project evolves. The image representation of
malware files is only an example of possible features.
16
The PNG images as such cannot be directly fed to the model since there is a great variance in sizes
of malware images. These images need to be processed first so that they are of a common scale.
This helps a deep learning model to converge faster.
The ImageDataGenerator.flow_from_directory() method of the TensorFlow python library is used

to generate this tensor image data as shown in Fig 7. This demonstration uses only ten samples
each from all the malware classes. In later stage of this project, thousands of samples will be
processed to train a deep learning model.
Fig 7: Generation of tensor image data using Keras
A sample of normalized tensor image data created using the TensorFlow python library is shown
in Fig 8. This data is used to train a deep learning model.
Fig 8: A sample of PNG malware image representation
Feature engineering is the most time-consuming stage of the machine learning pathway and
involves deep statistical analysis of the data. An exhaustive study of the features on the basis of
domain knowledge and with respect to specific machine learning models is required.
17
Finally, after feature engineering the resultant data is split into the following two randomly
generated subsets using the train_test_split() method of the Scikit-learn python library:
1. Training set: Used to train a machine learning model to help it understand the relation
between the features and labels.
2. Testing set: The performance of a model is assessed using this subset by the use of several
evaluation metrics.
Fig 9: Splitting the dataset into test and train subsets
18
Chapter 3 – Tools and Technologies
Hardware Requirement:
1. CPU : Intel core i5 8th generation or better
2. GPU : Preferred
3. RAM : 8 GB or better
Software Requirements:
1. Anaconda computer program including all essential machine learning tools.
Tools Used:
1. Anaconda: A python and R distribution which is primarily used for data science and
machine learning applications. The primary objective of this tool is to make package
management simple. It includes data-science packages adapted to Windows, Linux and
macOS.
2. Keras: An open-source python library works as interface for the TensorFlow python
library. Keras provides high level APIs for machine learning.
3. JupyterLab: An interactive environment which helps in integrating code, data and notebook
in a single interface.
4. Amazon S3: Amazon Simple Storage Service or Amazon S3 in cloud service that is used
for object storage and is provided by the Amazon Web Services. Practically infinite amount
of data can be stored and accessed from anywhere through Amazon S3.
19
Chapter 4 – Summary: Phase 1
The first phase of this project follows the machine learning procedure as described in Fig 20.
Fig 20: Phase 1: Summary
Chapter 5 – Project Pathway
The project on malware detection using machine learning is divided into three phases as described
in Fig 21.
Fig 21: Project pathway
20
Bibliography
[1] D. Braue, "Global Ransomware Damage Costs Predicted To Exceed $265 Billion By 2031," 3 June
2021. [Online]. Available: https://cybersecurityventures.com/global-ransomware-damage-costs-
predicted-to-reach-250-billion-usd-by-2031/.
[2] businesswire, "Global Malware Analysis Market Expected to Grow with a CAGR of 31% During the
Forecast Period, 2019-2024 - ResearchAndMarkets.com," businesswire, 13 December 2019.
[Online]. Available: https://www.businesswire.com/news/home/20191213005123/en/Global-
Malware-Analysis-Market-Expected-to-Grow-with-a-CAGR-of-31-During-the-Forecast-Period-2019-
2024---ResearchAndMarkets.com.
[3] Microsoft, "Microsoft Malware Classification Challenge (BIG 2015)," Microsoft, 2015. [Online].
Available: http://arxiv.org/abs/1802.10135.
[4] Sophos, "Sophos-ReversingLabs (SOREL) 20 Million sample malware dataset," Sophos-

ReversingLabs, 14 December 2020. [Online]. Available: https://ai.sophos.com/2020/12/14/sophos-
reversinglabs-sorel-20-million-sample-malware-dataset/.
[5] L. &. K. S. &. J. G. &. M. B. Nataraj, Malware Images: Visualization and Automatic Classification,
2011.
[6] H. Mallet, "Malware Classification using Convolutional Neural Networks — Step by Step Tutorial,"
27 May 2020. [Online]. Available: https://towardsdatascience.com/malware-classification-using-
convolutional-neural-networks-step-by-step-tutorial-a3e8d97122f.
[7] Wikipedia, "Portable Executable," [Online]. Available:

https://en.wikipedia.org/wiki/Portable_Executable.
[8] Kaspersky, "Machine Learning for Malware Detection," [Online]. Available:

https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-
Learning.pdf.
21
Annexure – Implementation and Code
Malware Detection Using Machine Learning, Phase 1
October 23, 2021
1 A PROJECT ON MALWARE DETECTION USING MA-

CHINE LEARNING
Group Members: Anagh Sharma | Kartikey Sharma | Vishal Bora | Harendra Singh Bisht
Group ID: CSE19-G58
Importing relevant libraries

[1]: import os
from math import log
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import random
Defining the path to the dataset folder

[2]: root = 'C:
‹→\\Users\\ANAGH\\Documents\\College\\Group_Project\\Datasets\\Microsoft␣
‹→Malware Classification Challenge\\MMCC'
Displaying malware classes of the dataset

[3]: for subdir, dirs, files in os.walk(root):
for d in dirs:
print(f'Malware Class: {d}')
Malware Class: 1_Ramnit

Malware Class: 2_Lollipop
Malware Class: 3_Kelihos_ver3
Malware Class: 4_Vundo
Malware Class: 5_Simda
Malware Class: 6_Tracur
Malware Class: 7_Kelihos_ver1
Malware Class: 8_Obfuscator.ACY
Malware Class: 9_Gatak
Converting hexadecimal representation of the file’s binary content to PNG image
22
[4]: for subdir, dirs, files in os.walk(root):
for d in dirs:
print(f'Iterating through folder of class: {d}')
def convert(array,name):
print('Converting: '+name)
b = int((array.shape[0]*16)**(0.5))
b = 2**(int(log(b)/log(2))+1)
a = int(array.shape[0]*16/b)
array = array[:a*b//16,:]
array = np.reshape(array,(a,b))
img = Image.fromarray(np.uint8(array))
img.save(os.path.join(subdir, d)+'\\'+name+'.png', "PNG")
return img
files = os.listdir(os.path.join(subdir, d))

for counter, name in enumerate(files):
if '.bytes' != name[-6:]:
continue
f=open(os.path.join(subdir, d)+'/'+name)
array=[]
for line in f:
c = line.split()
if len(c)!=17:
continue
array.append([int(i,16) if i!='??' else 0 for i in c[1:] ])
convert(np.array(array),name)
del array
f.close()
Iterating through folder of class: 1_Ramnit

Converting: 01kcPWA9K2BOxQeS5Rju.bytes
Converting: 04EjIdbPV5e1XroFOpiN.bytes
Converting: 05EeG39MTRrI6VY21DPd.bytes
Converting: 05rJTUWYAKNegBk2wE8X.bytes
Converting: 0AnoOZDNbPXIr2MRBSCJ.bytes
Converting: 0AwWs42SUQ19mI7eDcTC.bytes
Converting: 0cH8YeO15ZywEhPrJvmj.bytes
Converting: 0DNVFKwYlcjO7bTfJ5p1.bytes
Converting: 0DqUX5rkg3IbMY6BLGCE.bytes
Converting: 0eaNKwluUmkYdIvZ923c.bytes
Iterating through folder of class: 2_Lollipop
Converting: 01IsoiSMh5gxyDYTl4CB.bytes
Converting: 02JqQ7H3yEoD8viYWlmS.bytes
Converting: 02K5GMYITj7bBoAisEmD.bytes
Converting: 02MRILoE6rNhmt7FUi45.bytes
Converting: 02zcUmKV16Lya5xqnPGB.bytes
Converting: 05aiMRw13bYWqZ8OHvjl.bytes
Converting: 05IXcWGxvnkto4sq17zZ.bytes
23
Converting: 05Kps4iFw8mOLJZQrb1H.bytes
Converting: 065EZhxgbLRSHsB87uIF.bytes
Converting: 06aLOj8EUXMByS423sum.bytes
Iterating through folder of class: 3_Kelihos_ver3
Converting: 04BfoQRA6XEshiNuI7pF.bytes
Converting: 04cvLCVPqBMs6yn5xGlE.bytes
Converting: 04QzZ3DVdPsEp9elLR65.bytes
Converting: 04sJnMaORYc1SV5pKjrP.bytes
Converting: 06arUi9q3wHS2C8RZxeB.bytes
Converting: 06KfrF7ltESna2ZHPVp5.bytes
Converting: 06osXqPUVM1HbvBGNncT.bytes
Converting: 07nrG1cLKUPxjOlWMFiV.bytes
Converting: 09bfacpUzuBN5W3S8KTo.bytes
Converting: 0aSTGBVRXeJhx5OcpsgC.bytes
Iterating through folder of class: 4_Vundo
Converting: 0qPGt4cRVk9NoiJgubf2.bytes
Converting: 1bL4yiwCUvSOg7tBJudf.bytes
Converting: 1eOaAY4fpV38LIdhxl95.bytes
Converting: 1FacC02JPfxSdXeD7MEw.bytes
Converting: 1gx83bLB4PSsYIKCTlZt.bytes
Converting: 1PQFYMSBLAO9TmKk2Zhj.bytes
Converting: 1S9ui2XqltCJAOGUPw7v.bytes
Converting: 1yC7BzWHgtI2FibhQ0km.bytes
Converting: 2CfJMa5HIn6D1d9EXbpe.bytes
Converting: 2g4C03AeqoZR6ctiF1Qr.bytes
Iterating through folder of class: 5_Simda
Converting: 0qjuDC7Rhx9rHkLlItAp.bytes
Converting: 1IpWLz6eyhVxDAfQMKEd.bytes
Converting: 1KB3Z7gd5aN4Xmx8W0sf.bytes
Converting: 2aHfrLhcPTj5GnFZXUCN.bytes
Converting: 2pwjzv6eGEb8QmHPfxSc.bytes
Converting: 2qAtoGOuMQZdmH3y7bEY.bytes
Converting: 3m8kb5ILPrHcMC1o9Nht.bytes
Converting: 3zZpqyclD9B2v5Qas18m.bytes
Converting: 40KRbGeQZ8PwcUgt5joa.bytes
Converting: 4UTMdcZkxzLvwygO8EuK.bytes
Iterating through folder of class: 6_Tracur
Converting: 02IOCvYEy8mjiuAQHax3.bytes
Converting: 02mlBLHZTDFXGa7Nt6cr.bytes
Converting: 03nJaQV6K2ObICUmyWoR.bytes
Converting: 05LHG8fR3iPn6agIo9z7.bytes
Converting: 08BX5Slp2I1FraZWbc6j.bytes
Converting: 09CPNMYyQjSguFrE8UOf.bytes
Converting: 09sXMJUHwQWVanrhzAoT.bytes
Converting: 0BZQIJak6Pu2tyAXfrzR.bytes
Converting: 0Cq4wfhLrKBJiut1lYAZ.bytes
Converting: 0df4cbsTBCn1VGW8lQRv.bytes
Iterating through folder of class: 7_Kelihos_ver1
24
Converting: 09LXtWxm1EbK5uVqcQS3.bytes
Converting: 0ACDbR5M3ZhBJajygTuf.bytes
Converting: 0b5LqcWix3J4fGIEhXQu.bytes
Converting: 0BIdbVDEgmPwjYF4xzir.bytes
Converting: 0eN9lyQfwmTVk7C2ZoYp.bytes
Converting: 0hZEqJ5eMVjU21HAG7Ii.bytes
Converting: 0KgE6ksUeytoHfl2cT4r.bytes
Converting: 0LAXajqhQy7po16dw8Tx.bytes
Converting: 0M7aSiE9csDzkmfKheVt.bytes
Converting: 0PlfqyKM1JtYZx2me5FU.bytes
Iterating through folder of class: 8_Obfuscator.ACY
Converting: 01SuzwMJEIXsK7A8dQbl.bytes
Converting: 04hSzLv5s2TDYPlcgpHB.bytes
Converting: 0aKlH1MRxLmv34QGhEJP.bytes
Converting: 0aVNj3qFgEZI6Akf4Kuv.bytes
Converting: 0aVxkvmflEizUBG2rMT4.bytes
Converting: 0BFIPv1rO83whtpMYyAs.bytes
Converting: 0BY2iPso3bEmudlUzpfq.bytes
Converting: 0C4aVbN58O1nAigFJt9z.bytes
Converting: 0cTu2bkefOAJqIhYUWFK.bytes
Converting: 0fhnXI9ESr4jgWmkiaTe.bytes
Iterating through folder of class: 9_Gatak
Converting: 01azqd4InC7m9JpocGv5.bytes
Converting: 01jsnpXSAlgw6aPeDxrU.bytes
Converting: 04mcPSei852tgIKUwTJr.bytes
Converting: 07ECKjDTyQLnabNoxrIH.bytes
Converting: 0AV6MPlrTWG4fYI7NBtQ.bytes
Converting: 0bjN3Kgw5OATSreRmEdi.bytes
Converting: 0co46B8IkPt2UN3HSaw7.bytes
Converting: 0CPaAXtyswrBq83D6VEg.bytes
Converting: 0dauMIK4ATfybzqUgNLc.bytes
Converting: 0dhL8Jvcswa7U1qHiDS5.bytes
Generating normalized tensor image data using Keras

[5]: from keras.preprocessing.image import ImageDataGenerator
#Generating DataSet
the_data = ImageDataGenerator().flow_from_directory(directory=root,␣
‹→target_size=(512,512), batch_size=100)
imgs, labels = next(the_data)
Found 90 images belonging to 9 classes.
Train-Test Split
[6]: from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(imgs/255.,labels,␣
‹→test_size=0.3)
25
[7]: print(f'Shape of the tuple holding image data: {imgs.shape}')
print(f'Shape of the tuple holding labels of the image data: {labels.shape}')
Shape of the tuple holding image data: (90, 512, 512, 3)

Shape of the tuple holding labels of the image data: (90, 9)
Plotting randomly selected PNG images from the generated dataset

[8]: images = imgs
[9]: if type(images[0]) is np.ndarray:

images = np.array(images).astype(np.uint8)
[10]: fig = plt.figure(figsize=(14, 14))

rows = 3
columns = 3
for i in range(0,9):
fig.add_subplot(rows, columns, i+1)
plt.axis('off')
k = random.randint(0,89)
plt.title(list(the_data.class_indices.keys())[np.argmax(labels[k])])
plt.imshow(images[k])
26
27
MALWARE DETECTION USING MACHINE LEARNING
ORIGINALITY REPORT
9 %
SIMILARITY INDEX
6%
INTERNET SOURCES
2%
PUBLICATIONS
6%
STUDENT PAPERS
PRIMARY SOURCES
1
www.coursehero.com
Internet Source 3%
2
Submitted to DIT university
Student Paper 2%
3
Submitted to South Bank University
Student Paper 1%
4
Submitted to The Robert Gordon University
Student Paper 1%
5
scholarworks.sjsu.edu
Internet Source <1 %
6
link.springer.com
7
speakerdeck.com
8
Yixuan Ma, Shuang Liu, Jiajun Jiang, Guanhong
Chen, Keqiu Li. "A comprehensive study on
<1 %
learning-based PE malware family
classification methods", Proceedings of the
29th ACM Joint Meeting on European
Software Engineering Conference and
Symposium on the Foundations of Software
Engineering, 2021
Publication
9
Submitted to Metropolitan Community
College
<1 %
Student Paper
10
ai.sophos.com
11
arun-aiml.blogspot.com
12
repositori.udl.cat
13
"Big Data Analytics", Springer Science and
Business Media LLC, 2018
<1 %
Publication
14
Sumit S. Lad, Amol C. Adamuthe. "Malware
Classification with Improved Convolutional
<1 %
Neural Network Model", International Journal
of Computer Network and Information
Security, 2020
Publication
15
journals.riverpublishers.com
16
www.hindawi.com
Exclude quotes On Exclude matches Off
Exclude bibliography On

Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML

Uploaded by

Copyright:

Available Formats

A PROJECT REPORT (CS321)

MALWARE DETECTION USING MACHINE LEARNING

Mussoorie Diversion Road, Dehradun, Uttarakhand - 248009, India.

Anagh Sharma, Roll Number: 190102260

With the rise in complexity of cybersecurity techniques and a simultaneous growth of

CHAPTER PAGE NO.

Plagiarism Report: 9% Only

FIGURE NAME PAGE NO.

Fig 1: Malware classes in MMCC dataset 12

1.1. Malicious software

Various organizations broadly classify malware in following categories:

1. Worms: A malware which self-replicates and spreads to other computers through a

1.2. Machine learning

Machine learning is classified into the following three paradigms:

1.3. Malware detection using machine learning

1. Extracting a feature vector to represent the executable.

2.1. Study of datasets

1. The size of the dataset

Fig 1: Malware classes in MMCC dataset

The MMCC malware samples have following specifications:

a. A name: MD5 hash value to uniquely identify the file

.bytes file .asm file

2. Sophos-ReversingLabs 20 million dataset [4]: This dataset is hosted on Amazon S3 storage

Fig 3: Malware classes in Malimg dataset

2.2. Portable Executable file format

The structure of a PE file is described in the Fig 4.

Raw Data Features Model Output

The ImageDataGenerator.flow_from_directory() method of the TensorFlow python library is used

Fig 7: Generation of tensor image data using Keras

Fig 8: A sample of PNG malware image representation

Fig 9: Splitting the dataset into test and train subsets

Fig 20: Phase 1: Summary

Chapter 5 – Project Pathway

Fig 21: Project pathway

[4] Sophos, "Sophos-ReversingLabs (SOREL) 20 Million sample malware dataset," Sophos-

[7] Wikipedia, "Portable Executable," [Online]. Available:

[8] Kaspersky, "Machine Learning for Malware Detection," [Online]. Available:

Malware Detection Using Machine Learning, Phase 1

October 23, 2021

1 A PROJECT ON MALWARE DETECTION USING MA-

Importing relevant libraries

Defining the path to the dataset folder

‹→Malware Classification Challenge\\MMCC'

Displaying malware classes of the dataset

Malware Class: 1_Ramnit

Converting hexadecimal representation of the file’s binary content to PNG image

files = os.listdir(os.path.join(subdir, d))

Iterating through folder of class: 1_Ramnit

Generating normalized tensor image data using Keras

imgs, labels = next(the_data)

Found 90 images belonging to 9 classes.

Shape of the tuple holding image data: (90, 512, 512, 3)

Plotting randomly selected PNG images from the generated dataset

[9]: if type(images[0]) is np.ndarray:

[10]: fig = plt.figure(figsize=(14, 14))

You might also like