Professional Documents
Culture Documents
Under the guidance of: Dr. Atul K. Srivastava, Assistant Professor, DIT University
MACHINE LEARNING
Vishal Bora
Harendra Singh Bisht
INTRODUCTION
Recent growth in complexity of cybersecurity solutions and a simultaneous growth of
sophistication with which cybercriminals write and obfuscate malware, the research
and industry investment in exploring machine learning techniques to develop more
robust security solutions have also increased exponentially.
This project has the following two goals:
1. To explore the effectiveness of traditional machine learning methods along with
deep learning approaches to identify malicious software.
2. To identify light weight model(s).
PROJECT PATHWAY
Study and Data wrangling, Model training, Ensemble model Model selection,
compilation of visualization and collection of training, compilation of
various datasets. feature results and collection of results, final
engineering. additional results and analysis of the
feature additional study and report
engineering. feature writing.
engineering.
Ramnit Lollipop
Kelihos
Gatak _ver3
Kelihos
Simda
_ver1
Tracur
The MMCC malware samples have following specifications:
- A name: MD5 hash value to uniquely identify the file
- A label: Integer representation of one of the nine malware families
Each malware sample has the following two associated files in the dataset:
.bytes .asm
file Hexadecimal
representation
file Metadata
of a sample’s information of
binary content the sample
PE header is Includes
removed to function calls,
ensure sterility strings, etc.
However, this raw data is not ready to be used in machine learning…feature engineering is essential !
PHASE ONE OBJECTIVE 2: FEATURE ENGINEERING
In this process, the raw data of the samples is transformed into features that can be
used to effectively train a machine learning algorithm.
Feature
Engineering Fitting Inference
Feature engineering is the An exhaustive study of the
most time consuming stage features on the basis of
of the machine learning domain knowledge and
pathway and involves with respect to specific
deep statistical analysis of machine learning models
the data. is required.
https://xkcd.com/1838/
In this instance of feature engineering, the hexadecimal content of the malware file
samples is transformed to produce a PNG image representation of the malware.
These PNG files will be used in the later phases of this project to train a deep
learning model.
Malware
Binary Data PNG Image
Hexadecimal
Representation
Feature engineering is a continuous process in the machine learning pathway and
newer features will be explored throughout the later stages as this project evolves.
The image representation of malware files is only an example of possible features.
The PNG images as such can not be directly fed to the model since there is a great
variance in sizes of malware images. These images need to be processed first so that
they are of a common scale. This helps a deep learning model to converge faster.
The ImageDataGenerator.flow_from_directory() method of the Tensorflow python
library is used to generate this tensor image data.
This demonstration uses only ten samples each from all the malware classes. In later
stage of this project, thousands of samples will be processed to train a deep
learning model.
Following is a sample of normalized tensor image data created using the Tensorflow
python library. This data is used to train a deep learning model.
Finally, after feature engineering the resultant data is split into the following two
randomly generated subsets using the train_test_split() method of the Scikit-learn
python library:
- Training set : Used to train a machine learning model to help it understand
the relation between the features and labels.
- Testing set : Used to evaluate the performance of a model by the use of
several evaluation metrics.
SUMMARY: PHASE 1, MIDTERM
HARDWARE
SOFTWARE
Tensorflow or more
and other Storage:
libraries. 500GB or
JupyterLab more
BIBLIOGRAPHY
Datasets:
1) Microsoft Malware Classification Challenge (BIG 2015)
https://www.kaggle.com/c/malware-classification/overview
2) Sophos-ReversingLabs (SOREL) 20 Million sample malware dataset
https://github.com/sophos-ai/SOREL-20M#a-note-on-dataset-size
3) Malimg (Nataraj et al., 2011)
https://www.dropbox.com/s/ep8qjakfwh1rzk4/malimg_dataset.zip?dl=0
Group Members: