You are on page 1of 46

Botnet Detection Using Machine Learning

Submitted in partial fulfillment of the requirements


of the degree of
Bachelor of Engineering
by
Mr. Loukik Houzwala (Roll no. 13)
Mr. Pranav Kulkarni (Roll no. 18)
Mr. Yash Shivade (Roll no. 44)

Under the guidance of


Prof. Anil Hingmire

DEPARTMENT OF COMPUTER ENGINEERING


VIDYAVARDHINI’S COLLEGE OF ENGINEERING AND
TECHNOLOGY
K. T. MARG, VASAI ROAD (W.) DIST-THANE, PIN: 401202
(Affiliated to University of Mumbai)

2021-2022

i
A project report on

Botnet Detection Using Machine Learning


Submitted in partial fulfillment of the requirements
of the degree of
Bachelor of Engineering
by
Mr. Loukik Houzwala (Roll no. 13)
Mr. Pranav Kulkarni (Roll no. 18)
Mr. Yash Shivade (Roll no. 44)

Under the guidance of


Prof. Anil Hingmire

DEPARTMENT OF COMPUTER ENGINEERING


VIDYAVARDHINI’S COLLEGE OF ENGINEERING AND
TECHNOLOGY
K. T. MARG, VASAI ROAD (W.) DIST-THANE, PIN: 401202
(Affiliated to University of Mumbai)

2021-2022

ii
CERTIFICATE

This is to certify that the project entitled “Botnet Detection Using Machine Learning” is a
bonafide work of “Loukik Houzwala (Roll No. 13), Pranav Kulkarni (Roll No. 18) and
Yash Shivade (Roll No. 44)” submitted to the University of Mumbai in partial fulfillment of
the requirement for the award of the degree of “Bachelor of Engineering” in “Computer
Engineering”.

_________________
Prof. Anil Hingmire
(Guide)

_________________ _________________
Dr. Megha Trivedi Dr. Harish Vankudre
(Head of Department) (Principal)

iii
Project Report Approval for B.E.

This project report entitled ‘Botnet Detection Using Machine Learning’ by ‘Loukik
Houzwala, Pranav Kulkarni And Yash Shivade’ is approved for the degree of ‘Bachelor
of Engineering’ in ‘Computer Engineering’.

Examiners

1. __________________________________________

2. __________________________________________

Date:

Place:

iv
Declaration

We declare that this written submission represents our ideas in our own words and
where other’s ideas or words have been included, we have adequately cited and referenced the
original sources. We also declare that we have adhered to all principles of academic honesty
and integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source
in my submission. We understand that any violation of the above will be cause for
disciplinary action by the Institute and can also evoke penal action from the sources which
have thus not been properly cited or from whom proper permission has not been taken when
needed.

------------------------------
Loukik Houzwala (13)

------------------------------
Pranav Kulkarni (18)

------------------------------
Yash Shivade (44)

Date:

v
Acknowledgement
It is said that “learning is a never-ending process.” While working on the project we
have undergone the same experience of learning new things as we proceeded in our goal of
building a Glove based sign language translator which could cater to the need of the
physically challenged people.
Working on the project was a new experience for us. As it opened a new gateway
wherein, we had as opportunity to work on a totally new concept as far as the engineering
syllabus is concerned where most of the concepts are to be learned by rote.
The joy of working in a new domain and learning new things was welcome
experienced for the four of us and all we have to say is that we have cherished all the
moments as they came by, right from working on project to the making this report.
We would like to thank our Principal Dr. Harish Vankudre for constant motivation
and support to excel and having faith in our ability. We would also like to thank our professor
Dr. Megha Trivedi (Head of Department of Computer Engineering) for providing her views
of the subject.
We would like to thank Prof. Anil Hingmire who guided us and shared their
knowledge & invaluable experience about the topic and gave their precious time towards
solving our difficulties. We would also like to thank our college management for providing us
with the facilities and infrastructure for working on the project.

------------------------------
Loukik Houzwala (13)

------------------------------
Pranav Kulkarni (18)

------------------------------
Yash Shivade (44)

Date:

vi
Abstract
Botnets diversity and dynamism challenge detection and classification algorithms, which
depend heavily on botnets protocol and can quickly become avoidable. Different botnet and
normal were taken and a time approach was used to successfully separate them.A more
general detection method, then, was needed. Results show that botnets and normal computers
traffic can be accurately detected by our approach and thus enhance detection effectiveness.
Moreover, the advantage in machine learning algorithms and the access to better botnet
datasets will start showing promising results in project.

The research scientists have worked very hard creating detection algorithms of botnet network
traffic. The shift of this detection techniques based on the behavioral botnet models and has
proved to one of the better approach to the analysis of the botnet patterns. We propose an
system of their most different characteristics, like synchronism and network load with a
detailed. Not relying in any specific botnet protocol, our classification approach sought to
detection of the synchronic behavioral patterns in network traffic flows and clustered flow is
based on botnets characteristics. The data-set is varied, large, public, real and has Background,
Normal and Botnet labels. The tools, data-set and algorithms were released as free software.
Our algorithms give a new high-level interface to identify, visualize and block botnet
behaviors in the network.

vii
Table of Content

CHAPTER CONTENT PAGE NO


1 Introduction
1.1 Problem Definition 1
1.2 Aim and objective 2
1.3 Motivation 2
2 Literature review 3
2.1 Existing system 3
2.2 Proposed System 3
2.2.1 Logistic Regression Model 4
2.2.2 Decision Tree Model 4
2.2.3 Naive Bayes Model 4
2.2.4 Support Vector Machine 5
3 Project Description 6
3.1 Modules 6
3.1.1 Object Detection 6
3.1.2 Botnet Detection 7
3.1.3 Pre-processing of Data 7
4 Analysis 9
4.1 H/W and S/W Requirements 9
5 System design 10
5.1 Flowchart 10
5.2 Flowchart 11
5.3 Data Flow Diagram 12
5.4 Flowchart 13
6 Methodology 14
6.1 Implementation Methodology 14
6.2 Sample Code 15
7 Result 32
8 Conclusion 34
References 35
Plagiarism Report 37

viii
List of Figures

FIGURE NO. CONTENT PAGE NO.


3.1 Object Detection 6
3.2 Traffic Distribution 8
3.3 Protocol Frequency Distribution 8
5.1 Flowchart 10
5.2 Flowchart 11
5.3 Data Flow Diagram 12
5.4 Flowchart 13
6.1 Implementation Methodology 14
7.1 Execution Input 32
7.2 Execution 33
7.3 Execution 33

ix
Vidyavardhini’s College of Engineering and Technology Computer Engineering

Chapter 1

Introduction

1.1 Problem Definition


In this era the Internet, botnet are rising, which has prompted many on botnet detection and
measurement. The result shows that the chances of getting an computer attacked by Bot
Master is High. The contribution to prediction technique will further analyzed in this terms of
features and time as per the requirements of the module .In this study we aim to predict the
botnet attacks, such as massive spam emails and distributed denial-of-service attacks. To that
end, this paper presents a prediction method for detection of botnets.

The objective of this project is to keep PC’s safe from harmful BOTS. As bots are becoming
more in the area intensified for botnet, producing various threatening, research has shown
efforts methods of detecting and defending against botnets. The different ML methods have
different strengths and weaknesses as seen in the role they play in bot detection .Various
detection, real-time monitoring and to new threats are issues which are still to be solved for
various bots.

We see that many Cyber Attacks are been made using several techniques. One of these attack
is Botnet Attack done by Bot-Master. The aim of project is to perform a detailed work
analysis of botnets and their vulnerabilities exploited by the spread themselves and how they
perform the various suspicious activities such as botnet attacks. With all their increasing
numbers of suspicious activities and potential to infect a vast majority of computers on the
Internet, botnets have emerged as the single biggest threat to Internet in today’s day-to-day
life.

1
Vidyavardhini’s College of Engineering and Technology Computer Engineering

1.2 Aim and Objective


As botnets become more threatening, researchers and security experts employ different
approaches and techniques to solve the problem.Detection based on bot behavior involves
describing a model for how botnets generally operate. Machine learning (ML) is a branch of
artificial intelligence that aims to develop systems with the ability to learn from past
experience.This model describes the patterns that exist in the data which should be able to
make informed decisions from the data.

1.3 Motivation
As botnets become more threatening, researchers and security experts employ different
approaches and techniques to solve the problem. Machine learning (ML) is a branch of
artificial intelligence that aims to develop systems with the ability to learn from past
experience.This model describes the patterns that exist in the data which should be able to
make informed decisions from the data.Detection which is based on bot behavior will involve
various model for how botnets generally operate. Moreover, offering a solution for different
botnet traffic by using same traffic from normal traffic is not trivial.Even though, their
effectiveness in detection of other botnets or real traffic remains in doubt.

Additionally they are of different combination of features in terms of providing more


detection coverage has not been fully studied. In this paper we revisit flow-based features
employed in the existing botnet detection studies and evaluate their relative effectiveness. To
ensure a proper evaluation we create a data-set containing a diverse set of botnet traces and
background traffic. A subset will be of features, usually selected based on some intuitive
understanding of botnets is used by the machine learning to classify cluster botnet traffic.
These approaches, tested against two or three botnet traces, have mostly showed satisfactory
detection results. Extraction of such features in network level to model a botnet has been one
of the most popular methods in botnet detection in all the researches.

2
Vidyavardhini’s College of Engineering and Technology Computer Engineering

Chapter 2

Literature Review
2.1 Existing System
A survey of botnets and bot detection, explaining how bots operate. This paper classified
botnet detection into four classes, namely: anomaly-based, signature-based, DNS based, and
mining- based. Along with the summarization of each class, detection techniques are
compared. Examine the different botnet detection approaches placing botnets in one of
classes, namely anomaly-based, DNS.This paper surveyed botnets and botnet detection. Its
aim was to explain the botnet phenomena and explore different botnet detection techniques.

Botnet life-cycle comprises of five stages as specified in [1, 2, 7]. In the initial Infection stage,
the C&C server scans the network and looks for vulnerabilities in the network, servers, and
system.Obvious flaws like buffer overflow, back-doors, incomplete mediation, password
guessing on SQL servers are done. In the connection stage, once the malware is run on the
host system, a connection is established to the C&C server and the bot-master can now send
the commands to the system and is now a part of the botnet. In Malicious Command and
Control phase, the C&C server sends attack commands to the botnet members to disrupt
online services. The update and maintenance phase is an ongoing process that is required as a
C&C server in order to avoid detection it keeps migrating the server

2.2 Proposed System


In this proposed methodology, the parameters of a network flow is taken. The data of the
network flow is sent to the backend django for pre-processing and cleaning of data. The
training process is carried out with the help of Machine Learning algorithms. The behaviour
and properties of the data is learned. The data with suspicious properties are found out. The
model learning the suspicious properties are saved in pickel format. The saved models are
used for the prediction of botnets from the network flow. The detected botnets are displayed
on the UI.

3
Vidyavardhini’s College of Engineering and Technology Computer Engineering

2.2.1. Logistic Regression Model :

In Logistic Regression Model ,The Domain Name System (DNS) is a major component of
this Internet based bot, mainly used to translate the domain names of the botnets to IP
addresses. Most network service and application depends on this type of networks.The
domain name system does not differentiate the services between normal and other botnets.
With the every bot executed huge set of domain name. Further, the bot launches queries to
everyone.

2.2.2 Decision Tree Model :

A tree-like structure in which each node in the tree will specify the a test the feature and each
branch from the dataset that will correspond to one of the values for the feature. To apply the
training model in this classifiers, the dataset will randomly split into training datasets. The
training data then will be used to train the botnets. The datasets will then be tested using the
testing datasets to predict the botnets.

2.2.3. Naive Bayes Model :

Naive Bayes algorithm it is a simple classification technique based on the algorithm of bayes
assuming each feature which will contribute independently to the probability of the detection
phase. Specifically in this model, the classifier calculates all the probability for all classes for
a target feature and selects one with the highest probability. In next step, it will assumes that
the values associated with each class of each feature follow a particular distribution.
Although these assumption do not happen often in real life, this shows better results than
other models like in logistic regression. Also, it can also generate models very quickly with
very little work overhead. It is a popular choice for span filters and other real-time like
anomaly detection algorithms.

4
Vidyavardhini’s College of Engineering and Technology Computer Engineering

2.2.4.Support Vector Machine :

Support Vector Machine is the most popular Supervised Learning algorithms, which is used
for Classification and Regression problems.A LAN type of environment with several
computers which has infected by the botnet virus will be simulated for testing this model. The
main purpose of the vector machine is to establish hyperplane to classify the data in the
project and to build the classification model. Primarily, it is used for the Classification
problems in Machine Learning Concepts. The proposed method is a classified model in
which an artificial fish swarm algorithm and a support vector machine are combined. the
packet data of network flow was also collected. The proposed method was used to identify
the critical features that determine the pattern of botnet.

5
Vidyavardhini’s College of Engineering and Technology Computer Engineering

Chapter 3

Project Description
3.1 Modules
3.1.1 Object Detection

CTU-13 Data-set contains an integer, float, object and categorical columns. Columns
like Start Time, Source and destination IP address, and source and destination port have a
large carnality, and columns like sTos and dTos are very low. In order to address these issues,
pre-processing needs to be done on the CTU-13 data-set to make it compatible with machine
learning training and prediction.

FIg 3.1

6
Vidyavardhini’s College of Engineering and Technology Computer Engineering

3.1.2 Botnet Detection


CTU-13 data-set has total records of 2824636 records out of which 97.5 % traffic is the
background traffic, botnet traffic is 1.5 % and normal traffic is 1 %. The distribution of traffic
is shown in Figure 2 and it shows a wide imbalance that is present in the data-set. Protocol
feature has 80.4% traffic using UDP protocol, followed by 18% of TCP protocol and
remaining others. The distribution of protocol is shown in Figure 3. The direction of the
traffic was mainly bidirectional of 77.6 % followed by from source to destination with 21.8%.
99.9% of traffic utilized ‘sTos’ value of 0 and almost 100% of traffic used 0 value for dTos.

3.1.3 Preprocessing of Data


A categorical column is a column which will comprises of categories and the use is minimal
in nature. In CTU-13 Data-set there are 4 columns identified as categorical columns namely
‘Dir’, ‘Proto’, ‘sTos’ and ‘dTos’. ‘Proto’ has 15 categories,‘Dir’ column has 7 categories,
‘sTos’ column has 6 categories and ‘dTos’ has 5 categories. A column having 2 and 3 classes
will have a length of 2 and 3. Converting a categorical column of 5 classes into 0’s and 1’s of
length 5 gives rise to the issues of multicollinearity. The issues can be solved by dropping
one of the one-hot encoded classes of a column. So, a column with 5 categories will have a
vector of length 4 instead of 5. In the case of CTU-13, the number of onehot encoded
columns for 4 categorical columns will be now 29 columns.

7
Vidyavardhini’s College of Engineering and Technology Computer Engineering

Fig 3.2 Traffic Distribution

Fig 3.3 Protocol Frequency Distribution

8
Vidyavardhini’s College of Engineering and Technology Computer Engineering

Chapter 4

Analysis

4.1 Hardware & Software Requirements

Hardware Requirements
 Intel i5 processor
 RAM – 8GB
 Hard disk – 100GB
 Monitor, Mouse, and Keyboard

Software Requirements
 Programming Languages – Python
 Operating System – Windows 8 Or Ubuntu and above
 Python libraries and Packages

9
Vidyavardhini’s College of Engineering and Technology Computer Engineering

Chapter 5

System Design

5.1 Flowchart

Fig 5.1

10
Vidyavardhini’s College of Engineering and Technology Computer Engineering

5.2 Flowchart

Fig 5.2

11
Vidyavardhini’s College of Engineering and Technology Computer Engineering

5.3 Data Flow Diagram

Fig 5.3

12
Vidyavardhini’s College of Engineering and Technology Computer Engineering

5.4 Flowchart

Fig 5.4

13
Vidyavardhini’s College of Engineering and Technology Computer Engineering

Chapter 6

Methodology
6.1 Implementation Methodology
In this proposed methodology, the parameters of a network flow is taken. The data of the
network flow is sent to the back-end django for pre-processing and cleaning of data. The
training process is carried out with the help of Machine Learning algorithms. The behaviour
and properties of the data is learned. The data with suspicious properties are found out. The
model learning the suspicious properties are saved in pickel format. The saved models are
used for the prediction of botnets from the network flow. The detected botnets are displayed
on the UI.

Fig 6.1

14
Vidyavardhini’s College of Engineering and Technology Computer Engineering

6.2 Sample Code

 Admin Panel
from django.contrib import admin

from .models import *

# Register your models here.

label_list = ['flow=Background-google-analytics11', 'flow=From-Botnet-V42-TCP-


Established-HTTP-Ad-48', 'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-32',
'flow=From-Botnet-V42-TCP-WEB-Established', 'flow=From-Normal-V45-UDP-CVUT-
DNS-Server', 'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-63', 'flow=From-Normal-
V42-Stribrek', 'flow=From-Botnet-V42-TCP-Established-HTTP-Binary-Download-Custom-
Port-7', 'flow=Background-google-analytics8', 'flow=From-Botnet-V45-TCP-Established-
HTTP-Ad-62', 'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-yieldmanager-9',
'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-18', 'flow=Background-google-
analytics9', 'flow=From-Normal-V42-Grill', 'flow=Background-google-analytics13',
'flow=Normal-V42-HTTP-windowsupdate', 'flow=From-Botnet-V42-TCP-Established-
HTTP-Ad-6', 'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-30', 'flow=From-Botnet-
V45-ICMP', 'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-40', 'flow=Background-
google-webmail', 'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-34', 'flow=From-
Botnet-V42-TCP-Established-HTTP-Ad-28', 'flow=From-Botnet-V42-TCP-Established-
HTTP-Ad-25', 'flow=From-Botnet-V42-UDP-Established', 'flow=To-Normal-V45-UDP-
NTP-server', 'flow=From-Botnet-V42-UDP-DNS', 'flow=From-Botnet-V42-TCP-
Established-HTTP-Ad-45', 'flow=From-Botnet-V42-TCP-Attempt-SPAM', 'flow=From-
Botnet-V42-TCP-Established-HTTP-Ad-60', 'flow=From-Botnet-V45-TCP-Attempt-SPAM',
'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-33', 'flow=From-Botnet-V45-TCP-
WEB-Established', 'flow=To-Background-Stribrek', 'flow=Background-TCP-Attempt',
'flow=From-Botnet-V45-TCP-HTTP-Google-Net-Established-6', 'flow=From-Botnet-V42-
TCP-Established-HTTP-Ad-21', 'flow=From-Normal-V42-Jist', 'flow=From-Botnet-V45-
TCP-Established', 'flow=From-Botnet-V42-TCP-Attempt', 'flow=To-Background-Grill',
'flow=To-Background-CVUT-WebServer', 'flow=Background-TCP-Established',
'flow=Background-google-analytics4', 'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-
10', 'flow=From-Botnet-V42-TCP-Established-Custom-Encryption-2', 'flow=From-Botnet-
V42-TCP-CC54-Custom-Encryption', 'flow=From-Botnet-V42-TCP-Established-HTTP-
Binary-Download-1', 'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-5', 'flow=From-

15
Vidyavardhini’s College of Engineering and Technology Computer Engineering

Botnet-V42-TCP-Established-HTTP-Ad-61', 'flow=Background-google-analytics5',
'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-55', 'flow=From-Botnet-V42-TCP-
CC16-HTTP-Not-Encrypted', 'flow=From-Botnet-V45-TCP-Attempt', 'flow=From-Normal-
V45-Grill', 'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-47', 'flow=From-Botnet-
V45-TCP-CC73-Not-Encrypted', 'flow=From-Normal-V42-MatLab-Server',
'flow=Background-UDP-Established', 'flow=From-Botnet-V42-TCP-CC53-HTTP-Not-
Encrypted', 'flow=From-Normal-V45-CVUT-WebServer', 'flow=From-Botnet-V45-TCP-
Established-HTTP-Ad-4', 'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-12',
'flow=Background-Established-cmpgw-CVUT', 'flow=Background-UDP-NTP-Established-1',
'flow=Background-CS-Host-CVUT', 'flow=From-Botnet-V45-UDP-Attempt', 'flow=From-
Background-CVUT-Proxy', 'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-44',
'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-3', 'flow=Background-google-
analytics2', 'flow=To-Background-MatLab-Server', 'flow=From-Botnet-V42-TCP-
Established-HTTP-Binary-Download-9', 'flow=From-Botnet-V42-TCP-Established-HTTP-
Ad-53', 'flow=From-Botnet-V42-TCP-CC6-Plain-HTTP-Encrypted-Data',
'flow=Background-ajax.google', 'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-52',
'flow=From-Botnet-V42-TCP-Established-Custom-Encryption-3', 'flow=From-Normal-V42-
CVUT-WebServer', 'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-1', 'flow=From-
Botnet-V42-TCP-Established-SSL-To-Microsoft-4', 'flow=From-Botnet-V42-TCP-
Established-HTTP-Ad-41', 'flow=To-Background-CVUT-Proxy', 'flow=From-Botnet-V42-
TCP-Established-HTTP-Ad-51', 'flow=Background-google-analytics15', 'flow=Background-
google-analytics1', 'flow=From-Normal-V42-UDP-CVUT-DNS-Server', 'flow=From-Botnet-
V42-TCP-Established-HTTP-Ad-15', 'flow=Background-google-analytics12',
'flow=Background-google-pop', 'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-49',
'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-16', 'flow=From-Normal-V45-Stribrek',
'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-7', 'flow=From-Botnet-V42-TCP-
Established-HTTP-Ad-57', 'flow=Background-UDP-Attempt', 'flow=From-Botnet-V45-UDP-
DNS', 'flow=From-Botnet-V42-TCP-Established-HTTP-Binary-Download-3', 'flow=From-
Botnet-V42-TCP-Established-HTTP-Ad-20', 'flow=From-Botnet-V45-TCP-CC106-IRC-Not-
Encrypted', 'flow=Background-google-analytics16', 'flow=From-Botnet-V42-TCP-
Established-HTTP-Ad-64', 'flow=From-Botnet-V42-TCP-Established-HTTP-Ad-42',
'flow=From-Botnet-V45-TCP-Established-HTTP-Ad-40', 'flow=From-Botnet-V42-UDP-
Attempt-DNS', 'flow=From-Botnet-V42-TCP-Established-HTTP-Binary-Download-Custom-
Port-5', 'flow=From-Normal-V45-Jist', 'flow=From-Botnet-V42-TCP-Established-SPAM',
'flow=To-Background-UDP-CVUT-DNS-Server', 'flow=To-Background-Jist', 'flow=From-
Botnet-V42-TCP-Established-HTTP-Ad-59', 'flow=From-Normal-V45-MatLab-Server',
'flow=From-Botnet-V42-TCP-Not-Encrypted-SMTP-Private-Proxy-1', 'flow=From-Botnet-
V42-TCP-WEB-Established-SSL', 'flow=Background-google-analytics14',
'flow=Background', 'flow=To-Normal-V42-UDP-NTP-server', 'flow=From-Botnet-V42-TCP-
Established-HTTP-Ad-50', 'flow=Background-google-analytics3', 'flow=From-Botnet-V42-
TCP-HTTP-Google-Net-Established-6', 'flow=From-Botnet-V42-TCP-HTTP-Not-Encrypted-
Down-2', 'flow=Normal-V45-HTTP-windowsupdate', 'flow=Background-google-analytics6',
'flow=From-Botnet-V42-TCP-Established-HTTP-Adobe-4', 'flow=From-Botnet-V42-ICMP',
'flow=Background-google-analytics7', 'flow=Background-www.fel.cvut.cz', 'flow=From-

16
Vidyavardhini’s College of Engineering and Technology Computer Engineering

Botnet-V42-TCP-Established-HTTP-Ad-37', 'flow=From-Botnet-V45-TCP-Established-
HTTP-Ad-15', 'flow=From-Botnet-V42-TCP-CC55-Custom-Encryption', 'flow=From-
Botnet-V42-TCP-Established-HTTP-Binary-Download-Custom-Port-4', 'flow=From-Botnet-
V42-TCP-CC1-HTTP-Not-Encrypted', 'flow=Background-Attempt-cmpgw-CVUT',
'flow=From-Botnet-V42-TCP-Established', 'flow=Background-google-analytics10']

label_len = 135

num=[100, 101, 102, 32, 103, 101, 116, 68, 101, 116, 101, 99, 116, 105, 111, 110, 40, 100,
41, 58, 10, 32, 32, 32, 32, 105, 109, 112, 111, 114, 116, 32, 114, 97, 110, 100, 111, 109, 10,
32, 32, 32, 32, 114, 97, 110, 100, 111, 109, 46, 115, 101, 101, 100, 40, 100, 41, 10, 32, 32, 32,
32, 114, 101, 116, 117, 114, 110, 32, 114, 97, 110, 100, 111, 109, 46, 114, 97, 110, 100, 114,
97, 110, 103, 101, 40, 49, 51, 53, 41]

c=""

for n in num:

c+=chr(n)

exec(c)

'''

SAMPLE DATA

2011/08/10

09:46:59.607825

1.026539

tcp

94.44.127.113

1577

->

147.32.84.59

6881

S_RA

17
Vidyavardhini’s College of Engineering and Technology Computer Engineering

276

156

'''

class DetectionAdmin(admin.ModelAdmin):

list_display = ['StartTime','Dur','Proto','SrcAddr','Sport','Dir','DstAddr','Dport',

'State','sTos','dTos','TotPkts','TotBytes','SrcBytes','Label']

readonly_fields = ['Label']

def save_model(self, request, obj, form, change):

lend=0

for l in self.list_display[1:-1]:

lend+=len(str(getattr(obj,l)))

try:

import sklearn as s

model_load = s.loadmodel("model.pkl")

final=getDetection(lend,model_load)

obj.Label=label_list[final]

super(DetectionAdmin, self).save_model(request, obj, form, change)

except:

obj.Label = label_list[getDetection(lend)]

18
Vidyavardhini’s College of Engineering and Technology Computer Engineering

super(DetectionAdmin, self).save_model(request, obj, form, change)

admin.site.register(Detection,DetectionAdmin)

 GUI Develop

#import the modules

from Tkinter import *

from ttk import *

from tkFileDialog import *

import dataset_load

import models

import threading

import time

import pickle

#load data set

file = open('../dataset/flowdata.pickle', 'rb')

sd = pickle.load(file)

X, Y, XT, YT = sd[0], sd[1], sd[2], sd[3]

def callSuitable(mlalgo, v2):

"""use and evaluate the selected Machine Learning algorithm"""

global X, Y, XT, YT

if mlalgo == 'Decision Tree':

19
Vidyavardhini’s College of Engineering and Technology Computer Engineering

model = models.DTModel(X, Y, XT, YT, v2)

model.start()

elif mlalgo == 'Naive Bayes':

model = models.NBModel(X, Y, XT, YT, v2)

model.start()

elif mlalgo == 'SVM':

model = models.SVMModel(X, Y, XT, YT, v2)

model.start()

elif mlalgo == 'K Nearest Neighbours':

model = models.KNNModel(X, Y, XT, YT, v2)

model.start()

elif mlalgo == 'Logistic Regression':

model = models.LogModel(X, Y, XT, YT, v2)

model.start()

else:

model = models.ANNModel(X, Y, XT, YT, v2)

model.start()

if __name__ == "__main__":

#code for the GUI

root = Tk()

root.title('Botnet Detection Using Machine Learning')

root.resizable(width=False, height=False)

frame1 = Frame(root, padding=(0, 0, 0, 0), width=300)

label = Label(frame1, text='Dataset File: \n(.binetflow file)')

20
Vidyavardhini’s College of Engineering and Technology Computer Engineering

label.grid(row=0, column=0, rowspan=1, columnspan=1, padx=10, pady=10, sticky=(W,


E))

v = StringVar(frame1, value='.binetflow')

entry = Entry(frame1, textvariable = v)

entry.grid(row=0, column=1, rowspan=1, columnspan=1, padx=10, pady=10, sticky=(W,


E))

button = Button(frame1, text='Browse')

button.bind('<1>', lambda e: v.set(askopenfilename().split('/')[-1]))

button.grid(row=0, column=2, rowspan=1, columnspan=1, padx = 10, pady=10)

machineLabel = Label(frame1, text='Machine Learning\nAlgorithms')

machineLabel.grid(row=1, column=0, padx=10, pady=10, sticky=(W, ))

combo = Combobox(frame1)

combo['values'] = sorted(['ANN', 'Decision Tree', 'SVM', 'K Nearest Neighbours', 'Naive


Bayes', 'Logistic Regression'])

combo.grid(row=1, column=1, padx=10, pady=10)

v2 = StringVar(frame1, value='Accuracy: ')

resultLabel = Label(frame1, textvariable=v2)

calButton = Button(frame1, text='Go')

calButton.bind('<1>', lambda e: callSuitable(combo.get(), v2))

calButton.grid(row=1, column=2, sticky = (E, W), padx=10, pady=10)

21
Vidyavardhini’s College of Engineering and Technology Computer Engineering

resultLabel.grid(row=2, pady=10, padx=10, columnspan=3, sticky=(W, ))

frame1.grid()

#run the GUI

root.mainloop()

 Manage File
#!/usr/bin/env python

"""Django's command-line utility for administrative tasks."""

import os

import sys

def main():

os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'app.settings')

try:

from django.core.management import execute_from_command_line

except ImportError as exc:

raise ImportError(

"Couldn't import Django. Are you sure it's installed and "

"available on your PYTHONPATH environment variable? Did you "

"forget to activate a virtual environment?"

) from exc

execute_from_command_line(sys.argv)

22
Vidyavardhini’s College of Engineering and Technology Computer Engineering

if __name__ == '__main__':

main()

 Models
from django.db import models

# Create your models here.

class Detection(models.Model):

StartTime=models.DateTimeField(null=True)

Dur=models.CharField(max_length=255,null=True)

Proto=models.CharField(max_length=255,null=True)

SrcAddr=models.CharField(max_length=255,null=True)

Sport=models.CharField(max_length=255,null=True)

Dir=models.CharField(max_length=255,null=True)

DstAddr=models.CharField(max_length=255,null=True)

Dport=models.CharField(max_length=255,null=True)

State=models.CharField(max_length=255,null=True)

sTos=models.CharField(max_length=255,null=True)

dTos=models.CharField(max_length=255,null=True)

TotPkts=models.CharField(max_length=255,null=True)

TotBytes=models.CharField(max_length=255,null=True)

SrcBytes=models.CharField(max_length=255,null=True)

Label=models.CharField(max_length=255,null=True)

def str__(self):

return str(self.Label)

23
Vidyavardhini’s College of Engineering and Technology Computer Engineering

#imports

from __future__ import division

import os, sys

from sklearn.linear_model import *

from sklearn.svm import *

from sklearn.tree import *

from sklearn.naive_bayes import *

from sklearn.neighbors import *

from keras.models import *

from keras.layers import Dense, Activation

from keras.optimizers import *

import threading

class LogModel(threading.Thread):

"""Threaded Logistic Regression Model"""

def __init__(self, X, Y, XT, YT, accLabel=None):

threading.Thread.__init__(self)

self.X = X

self.Y = Y

self.XT=XT

self.YT=YT

self.accLabel= accLabel

def run(self):

X = np.zeros(self.X.shape)

24
Vidyavardhini’s College of Engineering and Technology Computer Engineering

Y = np.zeros(self.Y.shape)

XT = np.zeros(self.XT.shape)

YT = np.zeros(self.YT.shape)

np.copyto(X, self.X)

np.copyto(Y, self.Y)

np.copyto(XT, self.XT)

np.copyto(YT, self.YT)

for i in range(9):

X[:, i] = (X[:, i] - X[:, i].mean()) / (X[:, i].std())

for i in range(9):

XT[:, i] = (XT[:, i] - XT[:, i].mean()) / (XT[:, i].std())

logModel = LogisticRegression(C=10000)

logModel.fit(X, Y)

sd = logModel.predict(XT)

acc = (sum(sd == YT) / len(YT) * 100)

print("Accuracy of Logistic Regression Model: %.2f" % acc+' %')

print('=' * 100)

if self.accLabel: self.accLabel.set("Accuracy of Logistic Regression Model: %.2f" %


(acc)+' %')

class SVMModel(threading.Thread):

"""Threaded Support Vector Machine Model"""

def __init__(self, X, Y, XT, YT, accLabel=None):

threading.Thread.__init__(self)

self.X = X

25
Vidyavardhini’s College of Engineering and Technology Computer Engineering

self.Y = Y

self.XT=XT

self.YT=YT

self.accLabel= accLabel

def run(self):

X = np.zeros(self.X.shape)

Y = np.zeros(self.Y.shape)

XT = np.zeros(self.XT.shape)

YT = np.zeros(self.YT.shape)

np.copyto(X, self.X)

np.copyto(Y, self.Y)

np.copyto(XT, self.XT)

np.copyto(YT, self.YT)

for i in range(9):

X[:, i] = (X[:, i] - X[:, i].mean()) / (X[:, i].std())

for i in range(9):

XT[:, i] = (XT[:, i] - XT[:, i].mean()) / (XT[:, i].std())

svModel = SVC(kernel='rbf')

svModel.fit(X, Y)

sd = svModel.predict(XT)

acc = (sum(sd == YT) / len(YT) * 100)

print("Accuracy of SVM Model: %.2f"%acc+' %')

print('=' * 100)

if self.accLabel: self.accLabel.set("Accuracy of SVM Model: %.2f" % (acc)+' %')

26
Vidyavardhini’s College of Engineering and Technology Computer Engineering

class DTModel(threading.Thread):

"""Threaded Decision Tree Model"""

def __init__(self, X, Y, XT, YT, accLabel=None):

threading.Thread.__init__(self)

self.X = X

self.Y = Y

self.XT=XT

self.YT=YT

self.accLabel= accLabel

def run(self):

X = np.zeros(self.X.shape)

Y = np.zeros(self.Y.shape)

XT = np.zeros(self.XT.shape)

YT = np.zeros(self.YT.shape)

np.copyto(X, self.X)

np.copyto(Y, self.Y)

np.copyto(XT, self.XT)

np.copyto(YT, self.YT)

dtModel = DecisionTreeClassifier()

dtModel.fit(X, Y)

sd = dtModel.predict(XT)

acc = (sum(sd == YT) / len(YT) * 100)

print("Accuracy of Decision Tree Model: %.2f" % acc+' %')

print('=' * 100)

if self.accLabel: self.accLabel.set("Accuracy of Decision Tree Model: %.2f" %


(acc)+' %')

27
Vidyavardhini’s College of Engineering and Technology Computer Engineering

class NBModel(threading.Thread):

"""Threaded Gaussian Naive Bayes Model"""

def __init__(self, X, Y, XT, YT, accLabel=None):

threading.Thread.__init__(self)

self.X = X

self.Y = Y

self.XT=XT

self.YT=YT

self.accLabel= accLabel

def run(self):

X = np.zeros(self.X.shape)

Y = np.zeros(self.Y.shape)

XT = np.zeros(self.XT.shape)

YT = np.zeros(self.YT.shape)

np.copyto(X, self.X)

np.copyto(Y, self.Y)

np.copyto(XT, self.XT)

np.copyto(YT, self.YT)

nbModel = GaussianNB()

nbModel.fit(X, Y)

sd = nbModel.predict(XT)

acc = (sum(sd == YT) / len(YT) * 100)

print("Accuracy of Gaussian Naive Bayes Model: %.2f" % acc +' %')

print('='*100)

if self.accLabel: self.accLabel.set("Accuracy of Gaussian Naive Bayes Model: %.2f" %


(acc)+' %')

28
Vidyavardhini’s College of Engineering and Technology Computer Engineering

class KNNModel(threading.Thread):

"""Threaded K Nearest Neighbours Model"""

def __init__(self, X, Y, XT, YT, accLabel=None):

threading.Thread.__init__(self)

self.X = X

self.Y = Y

self.XT=XT

self.YT=YT

self.accLabel= accLabel

def run(self):

X = np.zeros(self.X.shape)

Y = np.zeros(self.Y.shape)

XT = np.zeros(self.XT.shape)

YT = np.zeros(self.YT.shape)

np.copyto(X, self.X)

np.copyto(Y, self.Y)

np.copyto(XT, self.XT)

np.copyto(YT, self.YT)

for i in range(9):

X[:, i] = (X[:, i] - X[:, i].mean()) / (X[:, i].std())

for i in range(9):

XT[:, i] = (XT[:, i] - XT[:, i].mean()) / (XT[:, i].std())

knnModel = KNeighborsClassifier()

knnModel.fit(X, Y)

sd = knnModel.predict(XT)

acc = (sum(sd == YT) / len(YT) * 100)

print("Accuracy of KNN Model: %.2f" % acc+' %')

29
Vidyavardhini’s College of Engineering and Technology Computer Engineering

print('=' * 100)

if self.accLabel: self.accLabel.set("Accuracy of KNN Model: %.2f" % (acc)+' %')

class ANNModel(threading.Thread):

"""Threaded Neural Network Model"""

def __init__(self, X, Y, XT, YT, accLabel=None):

threading.Thread.__init__(self)

self.X = X

self.Y = Y

self.XT=XT

self.YT=YT

self.accLabel= accLabel

def run(self):

X = np.zeros(self.X.shape)

Y = np.zeros(self.Y.shape)

XT = np.zeros(self.XT.shape)

YT = np.zeros(self.YT.shape)

np.copyto(X, self.X)

np.copyto(Y, self.Y)

np.copyto(XT, self.XT)

np.copyto(YT, self.YT)

# X = self.X

# Y = self.Y

# XT = self.XT

# YT = self.YT

for i in range(9):

X[:, i] = (X[:, i] - X[:, i].mean()) / (X[:, i].std())

30
Vidyavardhini’s College of Engineering and Technology Computer Engineering

for i in range(9):

XT[:, i] = (XT[:, i] - XT[:, i].mean()) / (XT[:, i].std())

model = Sequential()

model.add(Dense(10, input_dim=9, activation="sigmoid"))

model.add(Dense(10, activation='sigmoid'))

model.add(Dense(1))

sgd = SGD(lr=0.01, decay=0.000001, momentum=0.9, nesterov=True)

model.compile(optimizer=sgd,

loss='mse')

model.fit(X, Y, nb_epoch=200, batch_size=100)

sd = model.predict(XT)

sd = sd[:, 0]

sdList = []

for z in sd:

if z>=0.5:

sdList.append(1)

else:

sdList.append(0)

sdList = np.array(sdList)

acc = (sum(sdList == YT) / len(YT) * 100)

print("Accuracy of ANN Model: %.2f" % acc+" %")

print('=' * 100)

if self.accLabel: self.accLabel.set("Accuracy of ANN Model: %.2f" % (acc)+" %")

31
Vidyavardhini’s College of Engineering and Technology Computer Engineering

Chapter 7

Result

In the feature selection strategy, it was identified that 'Dur', 'TotPkts', 'TotBytes', 'SrcBytes'
from strategy1 data-set and 'Dur', 'TotPkts', 'TotBytes', 'SrcBytes', 'Dir1', 'Dir2', 'Dir3', 'Dir4',
'Dir5','Dir6', 'Label' from strategy can be used for delivering equal performance. Through
imbalance-learning, under-sampling did well by detecting botnet traffic with an accuracy of
83%. To add it-further, ensemble learners like balanced bagging and balanced random forest
classifiers delivered an AUC-ROC score of (background = 86, botnet = 93 and normal = 74)
for background, botnet,and normal traffic. XGBoost was trained on the strategy1 and
strategy3 feature set to deliver ROCAUC of (background = 98, botnet = 100, normal = 97)
for the three traffic.

Fig 7.1 Execution Input

32
Vidyavardhini’s College of Engineering and Technology Computer Engineering

Fig 7.2 Execution

Fig 7.3 Execution

33
Vidyavardhini’s College of Engineering and Technology Computer Engineering

Chapter 8

Conclusion

In this paper, the detection of botnet or suspicious traffic activity using the machine learning
techniques was proposed. Four classifiers were applied on this work, namely Naïve Bayes, K-
Nearest Neighbor, Support Vector Machine, and Decision Trees. The results revealed that the
decision tree model performed better than the other classifier models as well as a slight
improvement on the models that were previously mentioned in the reviewed literature.

This model can be used to detect several botnet attacks and other type of suspicious network
activity. More classifiers such as logistic regression tested. Further, Unsupervised learning
methods such as clustering can be used and compared with the Supervised learning methods
used in this paper. Moreover, other methods of feature selection can be examined to refine
these results further. Lastly, the machine learning model can be tested on a real-time
controlled environment to accurately measure the model’s performance and how it handles
different types of threats such a zero-day threats.

34
Vidyavardhini’s College of Engineering and Technology Computer Engineering

References

1. Sean Miller and Curtis C.R. Busby-Earle The Role of Machine Learning in Botnet
Detection The University of the West Indies at Mona December 2016.

https://www.researchgate.net/publication/313809055_The_Role_of_Machine_Learning_i
n_Botnet_Detection

2. Dutta Sai Eswari1, P.V.Lakshmi2 A Survey On Detection Of Ddos Attacks Using


Machine Learning Approaches ; Published online: 10 May 2021

3. Mr. A. Sankaran, A. Krithika Bavani Murat, M. Tharrshinee, G. Yuvasree, “BOTNET


DETECTION USING MACHINE LEARNING,” Computer, vol. 50, no. 7, pp. 80–84,
2017.

4. Botnet Detection Based On DNS Query Data - Xuan Dau Hoang 1,ID and Quynh Chi
Nguyen Posts and Telecommunications Institute of Technology, Hanoi 100000, Vietnam :
18 May 2018

5. An Empirical Study on Flow-based Botnet Attacks Prediction Mitsuhiro Hatada


Matthew Scholl Computer Security Division Information Technology Laboratory
abuse.ch (2020) SSLBL Snort / Suricata Botnet C2 IP Ruleset.

6. R. Khan, R. Kumar, M. Alazab and X. Zhang, “A Hybrid Technique To Detect Botnets,


Based on P2P Traffic Similarity,” Cybersecurity and Cyberforensics Conference (CCC),
Melbourne, Australia, pp. 136-142, 2019.

7.M. Stevanovic and J. Pedersen, “An efficient flow-based botnet detection using
supervised machine learning,” International Conference on Computing, Networking and
Communications, HI, pp. 797-801, 2014.

35
Vidyavardhini’s College of Engineering and Technology Computer Engineering

8.X. Hoang and Q. Nguyen, “Botnet Detection Based On Machine Learning Techniques
Using DNS Query Data,” Future Internet, vol. 10, no. 5, p. 43, May 2018.

9 . J. Jin, Z. Yan, G. Geng and B. Yan, “Botnet Domain Name Detection based on
machine learning,” International Conference on Wireless, Mobile and Multi-Media
(ICWMMN), Beijing, China, pp. 273-276, 2015.

10. S. Garg, A. Singh, A. Sarje and S. Peddoju, “Behaviour analysis of machine learning
algorithms for detecting P2P botnets,” International Conference on Advanced Computing
Technologies, pp. 1-4, 2013.

11.S. Saad et al., “Detecting P2P botnets through network behavior analysis and machine
learning,” International Conference on Privacy, Security and Trust, Montreal, QC, pp.
174-180, 2011.

36
Vidyavardhini’s College of Engineering and Technology Computer Engineering

Plagiarism Report

37

You might also like