You are on page 1of 18

PHASE 1, MIDTERM

Data Collection and Feature Engineering

Under the guidance of: Dr. Atul K. Srivastava, Assistant Professor, DIT University

A PROJECT ON Presented by:

MALWARE DETECTION USING Anagh Sharma


Kartikey Sharma

MACHINE LEARNING
Vishal Bora
Harendra Singh Bisht
INTRODUCTION
Recent growth in complexity of cybersecurity solutions and a simultaneous growth of
sophistication with which cybercriminals write and obfuscate malware, the research
and industry investment in exploring machine learning techniques to develop more
robust security solutions have also increased exponentially.
This project has the following two goals:
1. To explore the effectiveness of traditional machine learning methods along with
deep learning approaches to identify malicious software.
2. To identify light weight model(s).
PROJECT PATHWAY

Study and Data wrangling, Model training, Ensemble model Model selection,
compilation of visualization and collection of training, compilation of
various datasets. feature results and collection of results, final
engineering. additional results and analysis of the
feature additional study and report
engineering. feature writing.
engineering.

STAGE 01 STAGE 02 STAGE 03 STAGE 04 STAGE 05

PHASE 1 PHASE 2 PHASE 3


PHASE ONE OBJECTIVE 1: DATA COLLECTION
This is a fundamental stage of the machine learning pathway and involves identifying
appropriate data sources and datasets. Several datasets suitable for the exploration
of the effectiveness of machine learning algorithms in malware identification have
been studied in this process. Following examples of attributes of a dataset are used
to assess its usefulness:
1. the size of the dataset
2. the source and age of the dataset
3. feature representation
4. the number of unlabeled samples
The dataset being studied at this stage in this project is from Microsoft Malware
Classification Challenge (BIG 2015). In later phases of this project other datasets
may also be explored. The MMCC dataset contains over twenty thousand malware
samples. The following malware families are represented in this dataset.

Ramnit Lollipop
Kelihos
Gatak _ver3

Obfusca MMCC DATASET Vundo


tor.ACY

Kelihos
Simda
_ver1
Tracur
The MMCC malware samples have following specifications:
- A name: MD5 hash value to uniquely identify the file
- A label: Integer representation of one of the nine malware families

Each malware sample has the following two associated files in the dataset:

.bytes .asm
file Hexadecimal
representation
file Metadata
of a sample’s information of
binary content the sample

PE header is Includes
removed to function calls,
ensure sterility strings, etc.
However, this raw data is not ready to be used in machine learning…feature engineering is essential !
PHASE ONE OBJECTIVE 2: FEATURE ENGINEERING
In this process, the raw data of the samples is transformed into features that can be
used to effectively train a machine learning algorithm.

Raw Data Features Model Output

Feature
Engineering Fitting Inference
Feature engineering is the An exhaustive study of the
most time consuming stage features on the basis of
of the machine learning domain knowledge and
pathway and involves with respect to specific
deep statistical analysis of machine learning models
the data. is required.

https://xkcd.com/1838/
In this instance of feature engineering, the hexadecimal content of the malware file
samples is transformed to produce a PNG image representation of the malware.
These PNG files will be used in the later phases of this project to train a deep
learning model.

Malware
Binary Data PNG Image

Hexadecimal
Representation
Feature engineering is a continuous process in the machine learning pathway and
newer features will be explored throughout the later stages as this project evolves.
The image representation of malware files is only an example of possible features.
The PNG images as such can not be directly fed to the model since there is a great
variance in sizes of malware images. These images need to be processed first so that
they are of a common scale. This helps a deep learning model to converge faster.
The ImageDataGenerator.flow_from_directory() method of the Tensorflow python
library is used to generate this tensor image data.
This demonstration uses only ten samples each from all the malware classes. In later
stage of this project, thousands of samples will be processed to train a deep
learning model.
Following is a sample of normalized tensor image data created using the Tensorflow
python library. This data is used to train a deep learning model.
Finally, after feature engineering the resultant data is split into the following two
randomly generated subsets using the train_test_split() method of the Scikit-learn
python library:
- Training set : Used to train a machine learning model to help it understand
the relation between the features and labels.
- Testing set : Used to evaluate the performance of a model by the use of
several evaluation metrics.
SUMMARY: PHASE 1, MIDTERM

SOREL-20M, The dataset Hexadecimal Tensorflow Train set is


MMCA, contains malware data library is used used to train a
Malimg, and hexadecimal is converted to to normalize model and test
other datasets malware data PNG images to PNG image set to evaluate
were studied. and metadata. train a deep malware data. it.
learning
model.
MMCA DATASET PNG MALWARE TENSOR IMAGE TRAIN - TEST
STUDY OF DATASETS IMAGES
SELECTED DATA SPLIT
REQUIREMENTS: PHASE 1, MIDTERM
Python 3.0 Processor:
or higher i3 or higher
Scikit-learn, RAM: 8GB

HARDWARE
SOFTWARE
Tensorflow or more
and other Storage:
libraries. 500GB or
JupyterLab more
BIBLIOGRAPHY
Datasets:
1) Microsoft Malware Classification Challenge (BIG 2015)
https://www.kaggle.com/c/malware-classification/overview
2) Sophos-ReversingLabs (SOREL) 20 Million sample malware dataset
https://github.com/sophos-ai/SOREL-20M#a-note-on-dataset-size
3) Malimg (Nataraj et al., 2011)
https://www.dropbox.com/s/ep8qjakfwh1rzk4/malimg_dataset.zip?dl=0

Research Papers and Articles:


1) Malware images: visualization and automatic classification
https://doi.org/10.1145/2016904.2016908
2) The rise of machine learning for detection and classification of malware: Research developments, trends and challenges
https://doi.org/10.1016/j.jnca.2019.102526
3) Malware Classification using Convolutional Neural Networks
https://tinyurl.com/s996a59w
GROUP INFORMATION
Group ID : CSE19-G58
Project Title : Malware Detection using Machine Learning

Group Members:

Name SAP ID Roll Number Class Email ID


Anagh Sharma 1000013506 190102260 CSE – A 1000013506@dit.edu.in
Kartikey Sharma 1000012691 190102261 CSE – A 1000012691@dit.edu.in
Vishal Bora 1000013353 190178033 CSE – ML 1000013353@dit.edu.in
Harendra Singh Bisht 1000013129 190178033 CSE – A 1000013129@dit.edu.in
THE END

You might also like