Botnet Detection On Flow Data

UPTEC IT 19004
Examensarbete 30 hp
Juni 2019
Botnet detection on flow data

using the reconstruction error
from Autoencoders trained on
Word2Vec network embeddings
Kasper Ramström
Institutionen för informationsteknologi

Department of Information Technology
Abstract
Botnet detection on flow data using the reconstruction
error from Autoencoders trained on Word2Vec
network embeddings
Kasper Ramström
Teknisk- naturvetenskaplig fakultet

UTH-enheten Botnet network attacks are a growing issue in network security. These types of
attacks consist out of compromised devices which are used for malicious
Besöksadress: activities. Many traditional systems use pre-defined pattern matching methods for
Ångströmlaboratoriet
Lägerhyddsvägen 1 detecting network intrusions based on the characteristics of previously seen
Hus 4, Plan 0 attacks. This means that previously unseen attacks often go unnoticed as they do
not have the patterns that the traditional systems are looking for.
Postadress:
Box 536
751 21 Uppsala This paper proposes an anomaly detection approach which doesn’t use the
characteristics of known attacks in order to detect new ones, instead it looks for
Telefon: anomalous events which deviate from the normal. The approach uses Word2Vec, a
018 – 471 30 03 neural network model used in the field of Natural Language Processing and applies
Telefax: it to NetFlow data in order to produce meaningful representations of network
018 – 471 30 00 features. These representations together with statistical features are then fed into
an Autoencoder model which attempts to reconstruct the NetFlow data, where
Hemsida: poor reconstructions could indicate anomalous data.
http://www.teknat.uu.se/student
The approach was evaluated on multiple different flow-based network datasets

and the results show that the approach has potential for botnet detection, where
the reconstructions can be used as metrics for finding botnet events. However, the
results vary for different datasets and performs poorly as a botnet detector for
some datasets, indicating that further investigation is required before real world
use.
Handledare: Håkan Persson

Ämnesgranskare: Raazesh Sainudiin
Examinator: Lars-Åke Nordén
UPTEC IT 19004
Tryckt av: Reprocentralen ITC
Populärvetenskaplig sammanfattning
Botnätsattacker är ett växande problem i nätverkssäkerhet. Dessa attacker består av en-
heter som omedvetet utnyttjas för att utföra skadliga aktiviteter. Många traditionella
system använder fördefinierade mönsterigenkänningsmetoder för att upptäcka nätverks-
intrång baserat på egenskaper hos tidigare sedda attacker. Detta innebär att attacktyper
som har egenskaper som inte liknar tidigare igenkända intrång ofta går obemärkt förbi
eftersom de inte har de mönster som de traditionella systemen letar efter.
I denna rapport presenteras en metod som går ut på avvikelseigenkänning, som inte
använder kännetecknen för tidigare igenkända attacker för att upptäcka nya. Istället letar
metoden efter anomalier som avviker från det normala. Metoden använder Word2Vec,
en typ av neuralt nätverk som används inom språkteknologi och tillämpar den på NetFlow-
data för att producera meningsfulla representationer av nätverksparametrar. Dessa re-
presentationer tillsammans med statistiska parametrar matas sedan in i en Autoencoder-
modell som försöker rekonstruera NetFlow-data, där dåliga rekonstruktioner kan indi-
kera anomalier.
Metoden utvärderades på flera olika NetFlow-dataset och resultaten visar att tillväga-
gångssättet har potential för botnätsdetektion, där rekonstruktionerna kan användas som
mätvärden för att upptäcka botnätsaktivitet. Resultaten varierar emellertid för olika da-
taset och fungerar dåligt för botnätsdetektion på vissa dataset, vilket indikerar att ytter-
ligare utredning krävs innan metoden appliceras i verkliga situationer.
Contents
1 Introduction 1
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
2.1 Network security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 NetFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Publicly available NetFlow datasets . . . . . . . . . . . . . . . 6
2.3 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Word2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.1 Skip Gram model . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Related work 13
3.1 Network intrusion detection using classification models . . . . . . . . . 13
3.2 Network intrusion detection using anomaly detection models . . . . . . 15
3.3 Flow2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 IP2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Methods 19
4.1 Data selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 System structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.1 Word2Vec model . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.2 Autoencoder model . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 Results 27
5.1 Flow reconstruction error . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Precision-Recall curves . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 AUC scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6 Discussion 34
6.1 Generalizability of the system . . . . . . . . . . . . . . . . . . . . . . 34
6.2 Model architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.3 Dataset selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7 Conclusion 37
7.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.1.1 Threshold placement . . . . . . . . . . . . . . . . . . . . . . . 38
7.1.2 Advanced Autoencoder architectures . . . . . . . . . . . . . . 38
7.1.3 Use reconstruction error as metric for other models . . . . . . . 39
Glossary
AUC Area Under Curve. 13, 15, 16, 24, 25, 26, 27, 30, 33, 34
FN False Negative. 24, 25, 27
FP False Positive. 24, 25, 26, 27
FPR False Positive Rate. 13, 15, 16, 25, 26, 30
IDS Intrusion Detection System. 1, 2, 3, 4, 13, 16
LSTM Long short-term memory. 14, 15, 16, 30, 37, 38
MSE Mean Squared Error. 23, 24, 27, 34, 37, 38
NLP Natural Language Processing. 1, 2, 7, 15, 37
PCAP Packet capture. 4, 5, 13
TN True Negative. 25, 26
TP True Positive. 24, 25, 27
TPR True Positive Rate. 13, 15, 16, 24, 25, 26, 30
List of Figures
1 Intrusion Detection System example . . . . . . . . . . . . . . . . . . . 4

2 Neural network example . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 One-hot encoding example . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Word embeddings example . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Word2Vec Skip Gram model comparison . . . . . . . . . . . . . . . . 10
6 Autoencoder architecture example . . . . . . . . . . . . . . . . . . . . 11
7 System structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
8 Word2Vec and Autoencoder models used . . . . . . . . . . . . . . . . 22
9 Reconstruction error histogram plots part 1 . . . . . . . . . . . . . . . 28
10 Reconstruction error histogram plots part 2 . . . . . . . . . . . . . . . 29
11 Precision-Recall curve part 1 . . . . . . . . . . . . . . . . . . . . . . . 31
12 Precision-Recall curve part 2 . . . . . . . . . . . . . . . . . . . . . . . 32
13 AUC scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
List of Tables
1 Related work - supervised results . . . . . . . . . . . . . . . . . . . . . 15

2 Related work - unsupervised results . . . . . . . . . . . . . . . . . . . 17
3 CTU-13 scenarios used . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 CTU-13 evaluation splits used . . . . . . . . . . . . . . . . . . . . . . 25
1 Introduction
1 Introduction
With rapidly increasing amounts of vulnerable and connected devices, network security
becomes a more and more prevalent topic [8, 16, 24, p. 1-6]. Protecting against network
based attacks becomes increasingly important for corporations, governments and indi-
viduals as more devices are connected to the internet and attacks grow more complex
and harmful [28]. This means that traditional Intrusion Detection Systems (IDSs) that
rely on identifying similarities with previously seen attacks and rule based systems are
no longer able to keep up and identify malicious activity. These systems often use man-
ual threshold values based on statistical features which are a bad fit for systems with
large number of measurements and poorly understood behaviour.
This project proposes a solution which applies Natural Language Processing (NLP)
techniques which provide deep semantic relationships of language data [22, 23, 32] on
network traffic logs in order to extract similar features by treating network communi-
cation as words and sentences. By extracting this information, an Autoencoder neural
network can be trained on the network traffic logs in order to reconstruct normal data
and identify botnet data as anomalies without any predefined definitions of botnet data.
The rest of the report is outlined as follows: Section 2 provides background information
regarding network security, anomaly detection, neural networks, the NLP techniques
that are later applied to network data and Autoencoders. In Section 3 similar solutions
and related work is presented. Section 4 provides the proposed solution, evaluation
criterias and experiments for the system and Section 5 analyzes the results. Finally in
Section 6 the results are discussed and Section 7 provides a conclusion.
1.1 Purpose
This project aims to investigate the potential of combining advanced techniques from
the field of NLP with Autoencoders for identifying network threats in the form of bot-
net traffic, allowing network forensics to handle and protect against them accordingly.
Traditionally, this is done using pattern matching algorithms for finding known attacks
[8, p. 1-6]. This traditional approach allows domain experts to set up classification sys-
tems which handle and prevent previously known attacks based on their characteristics.
These systems can use rules or supervised machine learning algorithms which classify
data. However, they do not cope with anomalies, data with previously unseen prop-
erties that don’t fall into existing categories. Instead, unsupervised machine learning
models can be trained to identify what normal data looks like and then identify abnor-
malities, data which deviates from what the model deems as normal. The project is done
1
1 Introduction
in cooperation with Combient AB an IT company specialized in providing data science

solutions for industry problems. With this project, Combient wants to investigate the
feasibility and usefulness of a network anomaly detection system.
1.2 Delimitations
As this project aims to investigate the potential of Autoencoders in combination with

NLP techniques for botnet detection it does not focus on comparing different techniques
in order to find the best possible system for botnet detection. Similarly, it offers no so-
lution for on handling botnet events, such as preventing intrusions or giving instructions
on how to incorporate a botnet detector into an IDS. It also does not aim to offer a so-
lution for collecting network data or monitoring a network and run a botnet detection
system in real time. Instead, the project aims to use pre-existing datasets and analyze
and run experiments on them in an offline manner, where processing time and similar
metrics are not of special concern.
2
2 Background
2 Background
This section gives a brief introduction to the concepts and techniques related to net-
work security, including a short review of the area and an introduction to NetFlow and
anomaly detection. Additionally, a brief description is given of neural networks as well
as two specific neural network types: Word2Vec and Autoencoders.
2.1 Network security
Network security has gained tremendous traction the last few years due to the large
increase in data over the Internet [5, p.v-x]. With this large data increase, IDSs increas-
ingly leverage data science models in multiple stages of network security. This includes
directly preventing network attacks, visualizing network behaviour and providing in-
sights for network administrators. However, with network intrusions becoming increas-
ingly used and more sophisticated, traditional pattern recognition systems are becoming
inadequate.
In order to protect a network against malicious behaviour, one must first capture record-
ings of the network’s traffic. This is often done by using sniffing programs such as
WireShark, which listens to a network and records each packet sent and received [33, 2,
p.2-14, p.3-32]. This recording can be done at multiple locations in a network, at router-
level and device-level. A router-level sniffer records all packets sent to and from the
network, however it does not capture traffic sent between computers inside the network.
At device-level the sniffer records packets sent to and from the individual computer,
which allows for capturing traffic between computers inside the network, but loses the
capability of capturing traffic sent externally from the network.
Botnets are one of the most serious network threats currently and are growing heavily
in use [34, 1]. A botnet is a group of compromised connected devices controlled by
a malicious host (called botmaster) in order to perform some sort of network attack.
Due to the the fact that a botnet attack leverages previously benign devices which have
been compromised, they can be difficult to detect and prevent. Beigi et al. [1] show
that it is easy for these types of intrusions to circumvent traditional statistical features
used in pattern recognition systems by sending junk traffic, meaningless traffic that is
meant to distract systems from intrusions. Therefore, modern machine learning ap-
proaches which don’t rely on pre-set statistical patterns are more suitable and are being
increasingly used in IDSs which often leverage different subsystems with specific pur-
poses [26]. This could be a combination of anomaly detection methods for detecting
unknown network behaviour, such as previously unseen attacks and a supervised clas-
3
2 Background
sification model looking for known patterns of malicious behaviour [9]. Supervised
models for example are often used in anti-malware software which have pre-installed
intrusion detection patterns.
2.1.1 Anomaly detection
Anomaly detection is the process of finding outliers, data points that are unusual and
unexpected [8, p. 7-9]. To discover these abnormalities one must first have a sense of
what is normal in order to determine what is not. Normal behaviour is expected and
can therefore be predicted, even though it may be highly complex. The purpose of an
anomaly detector system is to discover anomalies. Natural extensions to this include
alert systems for informing about anomalies being discovered or systems that handle
anomalies and take appropriate action. In the context of network security, anomaly
detection can be used to detect fraudulent network behaviour, targeted attacks or other
types of network events that are unusual. It is then up to some other system or a network
administrator to take action upon the anomalous event. Figure 1 illustrates a typical IDS
where an anomaly detection model is used in conjunction with a classifier where both
models are fed input data. The classifier tries to match the input data to previously
known attacks whilst the anomaly detector determines whether the data is considered
normal, allowing it to find previously unknown anomalies.
Figure 1: Example IDS set up where input data is fed to both an anomaly detector and a
classifier which in turn report to an alarm system. The anomaly detector is responsible
for detecting new, unknown patterns while the classifier tries to find patterns similar to
previously known patterns.
2.2 NetFlow
NetFlow is a network traffic logging standard developed by Cisco [7, p. 30-33] which
is now used by multiple network hardware manufacturers. It is generally generated
4
2 Background
by network routers and switches, or by conversion of Packet capture (PCAP) data to

NetFlow. Originally intended to be used for billing purposes, it has also found large
traction when used for network analysis as it contains compact network information.
Since being developed and first released during the nineties [19, p. 10-12], NetFlow has
undergone many revisions, with version five being the most commonly used standard
and version nine being the latest. However, as version five does not support Internet
Protocol version 6 (IPv6), version nine becomes increasingly more useful. The latest
version also allows for a modular set of features to be logged.
The central part of NetFlow data is a flow, which is an aggregation of a sequence of
packets within a narrow timeframe that have the same source address, source port, des-
tination address, destination port and protocol. Along with these features, a flow at its
core also contains total packets sent, total bytes sent, start time, and end time. Further-
more, modifications can be made to version nine so that more parameters are logged.
Thus, flow data provides a compact view of network traffic within timeframes. A flow
can come in two different forms, as unidirectional, and bidirectional. Unidirectional
flows are aggregated as described above. Bidirectional flows are further aggregated,
where reverse-pairs of flows, where the source IP address and port become the destina-
tion IP address and port (and vice versa) are combined into a single flow. This effectively
results in half as many flows in a bidirectional dataset, compared to a unidirectional. Us-
ing bidirectional flows circumvents the problem of differentiating between a network’s
client and server, however due to aggregation, information is also lost [10].
A flow does not contain any of the packet payload normally present in PCAP. This can
be advantageous since a lot of traffic is encrypted nowadays, which renders the packet
payload useless regardless [26, 1]. Similarly, logging regular PCAP data can be a pri-
vacy concern due to sensitive information in non-encrypted packet payloads, which flow
data circumvents, although the privacy concern regarding IP address communication is
still present. For network forensics and analysis purposes, packet payloads can contain
valuable information which is lost when converted to flow format. However, PCAP logs
often grow too large and become difficult to parse, due to overwhelming amounts of
information [20]. Since a flow is a form of compact aggregation of PCAP data, it is
much smaller, easier to store and handle. Other forms of aggregation often remove too
much import traffic information in favor for smaller log files. NetFlow on the other hand
provides a good compromise between a compact representation yet detailed enough for
meaningful analysis.
Bilge et al. [3] have found that the two (rather contradictory) major difficulties with
botnet detection are the lack of big and rich datasets and that these mostly appear as
PCAP logs, which provide too much information to be easily processed. Instead, the
authors found that flow datasets should be used, which can be big and rich enough yet
is easily processable since its lack of packet payload.
5
2 Background
2.2.1 Publicly available NetFlow datasets
There are many flow based datasets publicly available which contain both regular net-
work traffic as well as malicious traffic, such as traffic caused by botnets, however, many
of them contain severe flaws which render them ineffective for network analysis [20].
Often, the datasets contain obsolete anomaly types, such as attacks that are no longer
relevant and the legitimate (regular) traffic is often missing or not representative of that
of real networks. Similarly, the datasets are often outdated and no longer resembles
modern network traffic. Another issue is that many of the datasets used in research pa-
pers are not disclosed due to privacy concern and difficulties labeling them. Labeling
a dataset requires expert knowledge and even then is not always reliable. To determine
a good dataset, the authors define some general criteria that must be met, namely the
dataset should be recent, labeled, rich, long and real:
Recent The value of a dataset decreases over time as network behaviour is evolving
and increasing in complexity. Therefore, a dataset much be recent in order to be repre-
sentative of real networks
Labeled All points in a dataset must be labeled, as detailed as possible to provide a

description for the flow.
Rich A dataset must be rich with many different types of network traffic and correlated
such that malware and regular traffic is captured at the same time, rather than injecting
one or the other afterwards in order to enrichen the dataset.
Long A dataset must span over enough time such that meaningful analysis can be
applied.
Real To be as close to a real network environment as possible, a dataset should be

formed from a real production or laboratory network. The opposite of this is a synthetic
dataset, which contains simulated or generated traffic. The downside of using a synthetic
dataset is that it is difficult to ensure the network’s topology realism.
An example of a dataset that fulfils these criterias is the CTU-13 dataset provided by
Stratosphere IPS, captured by CTU University, Czech Republic [20, 10]. The dataset is
split into thirteen different scenarios each having traffic from botnet attacks mixed with
6
2 Background
normal network traffic, in bidirectional format. The datasets were captured at router-
level in a laboratory environment with malware running on a subnetwork of infected
hosts. A downside of these datasets is that all traffic from the botnets are labelled as
hostile, even though some of it might be harmless. The CIDDS dataset is another dataset
which was purposefully modelled after the above criterias [31] and is similar to the
CTU-13 dataset.
2.3 Neural networks
A neural network is a type of machine learning model loosely modeled after the neu-
rons in the brain [6, p.8-19]. A neural network consists out of multiple layers which
are stacked together and these layers contain the neurons, often called nodes. A node
is basically a function with a set of parameters, often called weights [25, p.565-566].
When training a neural network, it is the nodes’ parameters that are updated according
to some algorithm in order to make better predictions. The layers of the model can be
separated into three different types, input layer, output layer and hidden layer. The input
layer is the entrypoint of the network, essentially where data enters the model. The out-
put layer is the final layer of the model, its output resulting in the model’s predictions.
Between the input and output layers are the hidden layers, a neural network with many
hidden layers is often called a deep neural network. Data flows through these layers
in the network, from input layer to hidden layer(s) and finally to the output layer, each
layer applying its own set of functions to the data before passing it to the next layer.
Neural networks come in many different forms of varying complexity, and a neural
network with many hidden layers or neurons can often produce more accurate results
at the cost of more parameters to be updated. This can cause the model to be more
difficult to optimize for a certain task, or even cause the network to over-optimize, ba-
sically memorizing its input, rendering it useless for making predictions on previously
unseen data. On the other hand, a shallow neural network with few layers or nodes often
produces worse predictions. It can therefore be difficult to find a network architecture
which makes the best predictions possible and avoids memorizing its input. A visual
illustration of a neural network can be seen in Figure 2.
2.4 Word2Vec
Word2Vec is a type of shallow neural network with one hidden layer originally devel-
oped by Google that produces word embeddings for text data which can improve the
quality of NLP algorithms by extracting meaningful semantic relationships in dense
vector representations [22, 23, 32].
7
2 Background
Figure 2: Neural network example illustration. Input data is passed into the input layer
and through each of the hidden layers and finally the output layer to produce a predic-
tion. Each layer contains nodes which apply functions on the data before passing it to
the next layer.
For a neural network to process text data, it must first be transformed into numeric vec-
tors. This is typically done using tokenization, where characters, words or sentences are
mapped to numerical tokens (scalar values) [6, p. 178-195]. Traditionally, one-hot en-
coding (illustrated in Figure 3) is a common approach used for turning text tokens into
vectors, where each token in the text is associated with a corresponding index in the vec-
tor. If performing tokenization on word-level, this means that for each word-token it will
have a ”1” in its corresponding index and a ”0” for every other index. This approach
does not retain any relation between sentences or words and produces very large and
sparse vectors. Word2Vec on the other hand produces word embeddings which are N -
dimensional vector representations of words where N is the number of nodes in the net-
work’s hidden layer [22]. The network is typically trained with a task such as predicting
a word given a sentence, or vice versa. After training, the actual network is discarded,
instead the weights of the network’s hidden layer are extracted. These weights produce
word embeddings, or word vectors, where each word has a corresponding vector. Since
the network is trained to predict a surrounding sentence or word in a sentence, with a
large enough corpus, similar words will likely appear in similar contexts. The hope of
this approach is that similar words will have word vectors close to each other, creating
compact representations of words. Compared to sparse one-hot encoding vectors where
each word adds a dimension, word embeddings have much fewer dimensions with con-
tinuous values. This results in a much more dense and feature rich representation where
the quality of the word embeddings generally increases with the number of dimensions
[23, 32]. A toy example of this rich representation can be seen in Figure 4.
8
2 Background
Figure 3: Illustration of using one-hot encoding on a dataset with one feature which
has six different classes. As can be seen this results in a new dataset with six different
features.
Figure 4: Simple vizualiation of a three-dimensional word embedding, showing the

relationship between genders and sibling type.
9
2 Background
2.4.1 Skip Gram model
The Skip Gram model is a type of Word2Vec model where the network is trained, given
a word, to predict its context, the surrounding K words, where K is set manually [22,
23, 32]. An extension of this technique is to use negative sampling which results in
much more computationally efficient weight updates for the neural network, whilst still
maintaining high performance. In this version the input to the model is both a word
and a context and the model is tasked to predict whether the word belongs to the given
context. With this technique, words and contexts are sampled based on their frequency
in the given dataset, where positive samples are defined such that the word belongs
within the context and negative samples are words out of context, i.e. random words
from the corpus. A comparison of the regular Skip Gram model and the Skip Gram
model with negative sampling can be seen in Figure 5.
Figure 5: The left part of this Figure illustrates the normal Skip Gram Word2Vec model
which takes as input a word w(t) and tries to predict the surrounding 2n context words.
The right part illustrates the Skip Gram model with negative sampling. This model takes
as input a word w(t) and a context word w(c) and tries to predict whether w(t) belongs
to the same context as w(c)
2.5 Autoencoders
An Autoencoder is a type of unsupervised neural network that is trained to copy its input
data to its output, i.e. reconstruct its input [11, 25, p. 493-516, p. 1004-1007]. The model
consists out of two distinct parts, an encoding part and a decoding part as illustrated in
10
2 Background
Figure 6. The encoding part consists out of the first few layers of the network which
typically encodes the input to a lower dimension. The latter part of the network is the
decoding part, which decodes the previously encoded input back to its original size.
In its most basic form, an Autoencoder has only one hidden layer with fewer nodes
than input features. This hidden layer encodes the input to a lower dimension and then
passes the encodings to an output layer which projects them to the original input size.
However, the encoding part can be done in several layers, for example reducing the
number of dimensions successively. Similarly, the decoding part can also be done in
multiple layers, increasing the number of dimensions for each layer. These type of
techniques can allow for more accurate reconstructions of complex data [11, p. 493-
516]. Due to the fact that Autoencoders most often have a bottleneck encoding layer,
they can not learn to completely copy input data, providing perfect reconstructions.
Instead they have to learn distinct feature representations of the input in order to make
accurate reconstructions.
Figure 6: Example Autoencoder architecture which takes as input a k-dimensional vec-

tor, encodes it through n layers, decodes it through m layers and finally produces a
k-dimensional vector as output.
Similarly to Word2Vec, after training an Autoencoder, the actual model can be thrown
away and the encoding layer can be extracted in order to get an encoded representation
of the input data for dimensionality reduction purposes [11, p. 493-516]. Another use
case is to calculate the difference between the original input and the reconstructed out-
put, resulting in a reconstruction error value. The larger the difference between the input
and the output, the worse the encoder performed the reconstruction. This reconstruc-
11
2 Background
tion error can be used for anomaly detection systems, where an Autoencoder is trained
to reconstruct normal input data. The assumption for such a system is that anomalous
data points should be much more difficult to reconstruct, resulting in a large reconstruc-
tion error [8, p. 30]. By setting threshold values for the reconstruction error, anomalies
can be defined as any data point which results in a reconstruction error larger than the
threshold value.
Autoencoders have been found successful in areas of anomaly detection where previ-
ously supervised approaches have been used, such as determining healthy behaviour of
supercomputers [4]. For datasets where clean data is not guaranteed, Robust Deep Au-
toencoders are supposedly suitable [39]. This type of Autoencoder adds a filter layer
which filters out inputs that are difficult to reconstruct, therefore not affecting the weight
updates and the ability to reconstruct normal inputs.
12
3 Related work
3 Related work
There are many examples of botnet detection approaches using different techniques
ranging from pattern-based detection and supervised machine learning systems to un-
supervised anomaly detection algorithms. Often these systems are part of a larger IDS
which leverages different approaches [9]. A supervised system uses the patterns of ma-
licious traffic in order to distinguish them from legitimate traffic. An anomaly detection
system on the other hand aims to understand the normal patterns of network traffic and
alerts on anything that deviates from its perception of normality.
A common difficulty for network analysis is the topic of processing IP addresses and
port numbers. Often aggregation based on IP addresses is performed in order to com-
pute insightful statistic features. However, when dealing with NetFlow data which is
already an aggregation of PCAP data, valuable information can be lost. To process IP
addresses and extract meaningful information from them, expert knowledge is required,
such as determining the hierarchical structure of IP addresses. Another approach is to
do IP geolocation estimation, which converts IP addresses to longitude and latitude co-
ordinates leveraging techniques such as neural networks [15, 13]. With IP geolocation
estimation, distance measurements can be made when comparing IP addresses, how-
ever the problem with parsing port numbers is still present. In Sections 3.3 and 3.4
two other approaches for extracting information from ip addresses and port numbers are
described.
Another difficulty when comparing different botnet detection approaches is discrepancy
of metrics and datasets used which makes it difficult to evaluate model performance.
Commonly used metrics (explained in detail in Section 4.5) include True Positive Rate
(TPR) (also known as Recall), False Positive Rate (FPR), Area Under Curve (AUC),
Precision, Accuracy and F1 (the harmonic mean of Precision and Recall), however,
information is often lost when using a single metric for evaluation. Therefore, it is
important to use a combination of metrics which complement each other. Section 3.1
describes some network intrusion approaches done using supervised techniques and
Section 3.2 describes unsupervised techniques. Some of these approaches were evalu-
ated on undisclosed datasets which makes it increasingly difficult to assess the quality
of the results.
3.1 Network intrusion detection using classification models
Botnet detection is commonly approached in a supervised manner, using machine learn-

ing models in order to detect botnet flows. The results of the approaches mentioned in
13
3 Related work
this section can be seen in Table 1.

Multiple neural network architectures have been used for different forms of network
attack detection, these range from shallow one-layer networks to more complex deep
Long short-term memory (LSTM) architectures. An LSTM neural network is a type
of network which is specialized on sequential data, using information from previous
timesteps as inputs to the current timestep [11, 25, p. 363-408, p. 570-571]. In a sense
they have a form of memory which allow them to capture time sensitive information over
long sequences [6, 29, p. 196-223, p. 537-549]. Idhammad et al. [14] employed a neural
network with a single hidden layer for classifying Denial of Service (DoS) attacks. The
model was evaluated on two publicly available datasets, the NSL-KDD dataset, based
on the KDD dataset from 1999, was used, which contains four types of DoS attacks
and a total of 148.517 samples. UNSW-NB15 was the other dataset used, generated in
2015, which has nine different attack types and 257.705 samples. Moradi et al. [24]
similarly used a neural network with two hidden layers for detecting two different types
of attacks on the DARPA public dataset from 1999. Also using the DARPA dataset, Tran
et al. [36] used a neural network for real time network attack detection. With a more
complex architecture, an LSTM neural network, Lin et al. [18] detected unauthorized
devices on undisclosed corporate datasets.
Decision tree and Random forest models have also been found useful for network at-
tack detection. A Decision tree is a tree-like model which branches into different (often
nested) outcomes based on threshold values [25, p. 546-547]. It contains conditional
statements based on these values and are therefore easy to interpret as one can tell ex-
actly why a certain choice was made. A Random forest is an ensemble of Decision
trees, which are often small and trained on separate features in order to avoid correla-
tion between trees [25, p. 552-553]. A Random forest uses the collective prediction of
its Decision trees in order to come up with a final prediction. Beigi et al. [1] merged
three publicly available datasets in order to create a single synthetic for botnet detection
using a Decision tree, with sixteen different types of botnets. Similarly, Bilge et al. [3]
used a Random forest for botnet detection on an undisclosed dataset. Also using a Ran-
dom forest, Stevanovic et al. [34] performed botnet detection on the publicly available
ISOP dataset from 2005, which contains three different types of botnets. Li et al. [17]
compared SVMs and Decision Trees for determining host roles in undisclosed campus
networks and found SVMs superior for the task. An SVM is a type of model which
constructs a set of hyperplanes in order to separate classes as distinctly as possible by
maximizing the distance between data points [25, p. 498-504]. Najafabadi et al. [26]
compared three different machine learning models for SSH brute force attack detection
on an undisclosed campus network. Decision trees, K-Nearest Neighbour (K-NN) and
Naive Bayes were compared. K-NN is a memory-based model which looks at the K
nearest points in its memory dataset that are closest to input X and classifies X as the
14
3 Related work
most frequent class in the nearest K points [25, p. 16-17]. Naive Bayes on the other
hand applies Bayes’ theorem with the naive assumption that there is conditional inde-
pendence between every pair of features and the label, which often is not the case [25,
p. 84]. In another paper, Najafabadi et al. [27] compared Decision trees and K-NN for
DoS detection on the publicly available SANTA dataset from 2015.
Source Model Score Dataset

Idhammad et al. [14] Neural network 99% TPR NSL-KDD
Idhammad et al. [14] Neural network 97% TPR UNSW-NB15
Moradi et al. [24] Neural network 90% Accuracy DARPA
Lin et al. [18] LSTM neural network 0.99 AUC Undisclosed
5.14% FPR
Tran et al. [36] Neural network DARPA
99.92% TPR
Merge of ISOT,
68% TPR
Beigi et al. [1] Decision tree ISCX 2012 IDS
3% FPR
and CTU-13
65% TPR
Bilge et al. [3] Random forest Undisclosed
1% FPR
95.73% TPR
Stevanovic et al. [34] Random forest ISOP
0.9596 F1
Li et al. [17] SVM 96% Accuracy Undisclosed
Li et al. [17] Decision Tree 99% Accuracy Undisclosed
Najafabadi et al. [26] Decision tree 0.9965 AUC Undisclosed
Najafabadi et al. [26] K-NN 0.9946 AUC Undisclosed
Najafabadi et al. [26] Naive Bayes 0.9966 AUC Undisclosed
0.9988 AUC
Najafabadi et al. [27] Decision tree 98.73% TPR Undisclosed
0.0282% FPR
0.9999 AUC
Najafabadi et al. [27] K-NN 98.83% TPR Undisclosed
0.0316% FPR
Table 1: This table shows the results of the different supervised intrusion detection
approaches described in Section 3.1
3.2 Network intrusion detection using anomaly detection models
Similarly to supervised approaches, unsupervised models come in various forms. The

results of the approaches mentioned this section can be seen in Table 2.
15
3 Related work
Mathur et al. [21] used a hybrid approach combining anomaly and classification de-
tection models to determine whether outgoing connections from internal IP addresses
were malicious. The approach used both publicly available lists with known malicious
IP addresses and an unsupervised similarity model which was deployed to determine
whether new IP addresses were similar to the blacklisted IP addresses. The authors
used this combination of the unsupervised model with the blacklist for botnet detection
on an undisclosed dataset. Radford et al. [28] utilized an LSTM neural network in order
to predict flow sequences in an unsupervised manner where poorly predicted sequences
were classified as anomalous. This was done using NLP techniques similar to that of
Word2Vec, where flows were encoded into tokens that form sentences between comput-
ers. The authors used this approach to detect four different network attacks including
botnet attacks, on the publicly available ISCX IDS dataset from 2012. Winter et al. [38]
applied a one-class Support Vector Machine (SVM) which is an unsupervised modifica-
tion of a traditional SVM that only predicts whether a data point is normal or an outlier.
It is trained oppositely to other forms of anomaly detectors which are usually trained
to learn the patterns of normal behaviour, instead, the model is only trained on outliers,
i.e. malicious flows. The authors evaluated the model on an undisclosed dataset with
multiple different attack types. Terzi et al. [35] used K-means, a model which attempts
to group data into different clusters based on proximity [25, p. 354], on the 10th sce-
nario of the CTU-13 dataset in order to separate normal flows from botnet flows into two
different clusters. Also using the CTU-13 dataset, Wang et al. [37] used a graph-based
unsupervised model for botnet detection on five different scenarios in the dataset.
3.3 Flow2Vec
Flow2Vec is a model for applying Word2Vec to network flows, such as NetFlow [12]. It
treats individual flows as words from the Word2Vec algorithm and surrounding words
(flows) as its context, thus a sequence of flows form a sentence. The approach tokenizes
each individual flow so that they are all represented as a single numerical token. From
these tokens, a Word2Vec Skip Gram model with negative sampling is trained to create
continuous vector representations of each flow. With this approach, since each individ-
ual flow is tokenized, no relationship between IP addresses or port numbers is preserved.
Even if both the IP addresses and port numbers are identical between flows, if the pro-
tocol (or any single one of the features) differ, they will be treated as two completely
different flows unless they appear within each others context (close in time). By using
Flow2Vec, one can create vector representations of flows which can then be fed into a
machine learning model or other systems that can use numerical inputs. However it is
discussable how useful this since no relationship between IP addresses or port numbers
is preserved.
16
3 Related work
Source Model Score Dataset

Mathur et al. [21] Hybrid model 92% TPR Undisclosed
LSTM
Radford et al. [28] 0.84 AUC ISCX IDS
neural network
0% FPR
Winter et al. [38] One-class SVM Undisclosed
98% Accuracy
99.6% Accuracy 10th scenario
Terzi et al. [35] K-means
27.6% FPR of CTU-13
1.7% FPR
7.7% TPR 1st scenario
Wang et al. [37] Graph-model
86% Precision of CTU-13
0.14 F1
2.2% FPR
4.6% TPR 2nd scenario
0.088 F1
11% FPR
12% TPR 6th scenario
0.21 F1
4% FPR
4.4% TPR 8th scenario
0.82 F1
3.4% FPR
8% TPR 9th scenario
0.14 F1
Table 2: This table shows the results of the different unsupervised intrusion detection
approaches described in Section 3.2
17
3 Related work
3.4 IP2Vec
IP2Vec is a model heavily inspired by Word2Vec and Flow2Vec for finding similarities
between IP addresses [30]. Its main concept relies on the assumption that IP addresses
are similar if they appear in similar contexts, where context is defined as other net-
work features in the same flow such as port numbers and protocol, and the words in
Word2Vec are the IP addresses. Just like Word2Vec maps words to continuous vectors,
IP2Vec creates continuous vector representations for IP addresses. Traditional distance
measures which have difficulties calculating distances with features of both categorical
and continuous values are thus able to operate solely on continuous values when applied
on flow or packet data. By using these distance measurements, close IP addresses in the
continuous vector space imply that the IP addresses are similar. As the authors state,
multiple IDSs exist which either discard IP addresses, applies domain knowledge or uses
aggregation methods in order to extract features from them. IP2Vec is able to extract
meaningful representations of IP addresses without aggregation or domain knowledge.
In the paper, Ring et al. [30] train the IP2Vec model on two publicly available datasets,
the 9th scenario of the CTU-13 dataset and their own CIDDS-001 dataset, both of which
contain botnet attacks. In the paper, the model is trained on source and destination IP
addresses, source and destination port numbers and the protocol of each flow, however,
the authors only extract vector representations for the IP addresses. The model was
trained for ten epochs with an embedding layer of size 32, resulting in a 32-dimensional
vector representation for each IP address. The authors found these continuous vector
representations of IP addresses to be superior for botnet detection compared to more
traditional statistical methods for detecting anomalous IP addresses. Several advantages
and disadvantages of the proposed model were acknowledged, with the main advantage
being that continuous vector representations of categorical network data allows for more
data mining and machine learning models to be applied, to more data. However, a disad-
vantage is that the behaviour of an IP address can change over time, which could cause
problems for the model and previously unseen IP addresses do not have continuous vec-
tor representations available. To solve the latter problem, the model could be retrained
regularly to account for new IP addresses or some default vector representation could
be applied which is the mean of all the known IP addresses.
18
4 Methods
4 Methods
This section describes the methods used for tackling the problem described in Section
1.1. This involves a process with multiple steps, including data selection and feature ex-
traction followed by the proposed machine learning models used for anomaly detection
and evaluation criteria for interpretation of given results.
4.1 Data selection
The datasets chosen for training and evaluating the models are scenarios 2 through 13
from the CTU-13 dataset, scenario 1 is excluded because of unavailability of access.
As described in Section 2.2.1, these are publicly available NetFlow datasets from real
networks with both normal network traffic and botnet traffic which were found to (unlike
many others) fulfill the criterias of a good network dataset by Malowidzki et al. [20].
For the purpose of training and evaluating, the botnet traffic is treated as anomalous and
everything else as normal, as done by the different approaches in Section 3.2. Table 3
shows the distribution of normal and bot flows in each scenario and the botnet used.
Scenario Total flows Normal flows Bot flows
2 1808122 1787181 (98.84%) 20941 (1.16%)
3 4710638 4683816 (99.43%) 26822 (0.57%)
4 1121076 1118496 (99.77%) 2580 (0.23%)
5 129832 128931 (99.31%) 901 (0.69%)
6 558919 554289 (99.17%) 4630 (0.83%)
7 114077 114014 (99.94%) 63 (0.06%)
8 2954230 2948103 (99.79%) 6127 (0.21%)
9 2087508 1902521 (91.14%) 184987 (8.86%)
10 1309791 1203439 (91.88%) 106352 (8.12%)
11 107251 99087 (92.39%) 8164 (7.61%)
12 325471 323303 (99.33%) 2168 (0.67%)
13 1925149 1885146 (97.92%) 40003 (2.08%)
Table 3: Scenarios from the CTU-13 dataset used for running experiments on.
4.2 Feature extraction
From each dataset, seven features are extracted for each flow, namely flow duration,
total packets, total bytes, bytes per second, bytes per packet, packets per second and
19
4 Methods
time since last flow, described below.
Flow Duration Flow duration is the total time between the first packet sent within a
flow and when the last packet was received, measured in seconds.
Total packets Total packets is the total number of packets sent bidirectionally within
a flow.
Total bytes Total bytes is the total number of bytes sent bidirectionally within a flow.
Bytes per packet Bytes per packet is the total number of bytes divided by total num-
ber of packets.
Bytes per second Bytes per second is the total number of bytes divided by the flow
duration.
Packets per second Packets per second is the total number of packets divided by the
flow duration.
Time since last flow Time since last flow is the duration between when the previous
flow was last seen and current flow was first seen.
The features are also standardized using
n
X xi ui
i=1
si
where n is the number of flows, xi is the ith flow of feature x, u is the mean value of
feature x, and s is the standard deviation of feature x. This is done so that the features
will cause similar reconstruction differences when comparing input and outputs for the
Autoencoder model.
20
4 Methods
4.3 System structure
The proposed anomaly detection system structure consists out of several parts, which
can broadly be separated into a preprocessing part and an anomaly detector part. The
first part, used for preprocessing, is a Word2Vec model, trained on features similar
to that of IP2Vec as described in Section 3.4. This allows for vector representations
of IP addresses, port numbers and protocols to be used as input features for a second
model, which Ring et al. [30] found superior to other approaches which often perform
some form of aggregation based on these features instead, or completely discards them.
This also avoids the requirement of expert knowledge in order to extract meaningful
information from these features, which could be network specific.
The embeddings produced by the Word2Vec model in conjunction with the extracted
features from Section 4.2 are then fed into an anomaly detector. Due to the complexity
of the problem, with almost a hundred features for each flow, an Autoencoder model is
used as an anomaly detector, which Ted Dunning [8, p. 30] claim to be suitable for such
difficult tasks. The Autoencoder is fed word vectors and extracted features as input
and attempts to reconstruct each flow. The assumption is that an anomalous botnet
flow should be much more difficult to reconstruct for the model, compared to a normal
flow. This allows for one-dimensional thresholds to determine whether a flow is to be
considered anomalous or not, where a large error (big difference between input and
reconstruction) is considered anomalous. A visualization of the system structure can be
seen in Figure 7.
Figure 7: This Figure illustrates the general system structure of the proposed solution.
It takes as input the NetFlow data, features are extracted and a Word2Vec model is then
trained to provide vector representations of the IP addresses, ports and protocols. These
features and vector representations are then fed into an Autoencoder which attempts to
reconstruct its input.
21
4 Methods
(a) Word2Vec model

(b) Autoencoder model
Figure 8: (a) shows the Word2Vec model used, which has an embedding layer size of
16. (b) shows the Autoencoder model used, which has an input and output layer size of
87, the 5 hidden layers have size 64, 32, 16, 32 and 64 respectively.
4.3.1 Word2Vec model
The chosen Word2Vec model is a Skip Gram version with negative sampling, due to
its high performance and computational efficiency as described in Section 2.4. As pre-
viously mentioned, this model takes a word and a word context as input and predicts
whether the given word belongs to the given context, an illustration of which can be
seen in Figure 8a. Before training the model, the flow data is preprocessed similarly to
that of IP2Vec, namely from each flow, the source and destination IP addresses, source
and destination port numbers and protocol are extracted. These are then tokenized into
numerical tokens so that they can be fed into a machine learning model. Unlike the
original Word2Vec model which has a context window which can be arbitrarily chosen
(often as the length of a sentence) [22, 23], each flow forms its own context and can
be interpreted as a sentence. From the tokenization of flow features, word-context pairs
are generated for training the Word2Vec model. The word is one of the above extracted
flow features and the context is another feature from the dataset. Each pair has a la-
bel which indicates whether the word belongs to the pair’s context or if it is a negative
sample. Thus, the word-context label generation process involves creating true samples
for each combination of features within a given flow, as well as creating twenty five
negative samples for each true feature, where a negative sample is a sample where the
word and context are taken from two separate flows. The full process, from flow to word
embeddings can be seen in Figure 8a.
Finally, after generating word-context pairs, the model can be trained and afterwards,
the model’s predictions are discarded, instead the embeddings from the model’s hidden
layer are extracted. An embedding size of sixteen dimensions is used, half of what
22
4 Methods
was used for IP2Vec [30], which could result in a less rich embedding representation.
However, unlike in IP2Vec, the port and protocol embeddings are not discarded, which
means that a richer representation from each flow can be extracted, even though each
individual embedding vector is smaller. These embeddings provide a word-to-vector
mapping, giving each word a sixteen-dimensional numerical vector representation. For
each flow the five extracted features can each be replaced by a sixteen-dimensional
vector that can be used as numerical features for the Autoencoder model.
4.3.2 Autoencoder model
Combining the features extracted from each flow, as described in Section 4.2 with
the vector representations of the IP addresses, port numbers and protocols from the
Word2Vec model results in a total of 87 features which are used as input for the Au-
toencoder model. The Autoencoder itself is a neural network, as described in Section
2.5 with five hidden layers. The first three hidden layers create the encoding part of the
network, which encodes the input vectors to a lower dimensional space whilst trying
to maintain the same feature representation. This encoding part is done successively,
the encoding layers having 64, 32 and 16 nodes respectively. The final two hidden lay-
ers create the decoding part of the network, responsible for decoding the input from its
lower dimensional transformation into its original size. This is done similarly to the
encoding part, with the decoding layers having 32 and 64 nodes respectively. The final
output layer has 87 nodes, the same amount as the input layer. A visualization of the
model’s architecture can be seen in Figure 8b. The model is trained by minimizing the
metric Mean Squared Error (MSE) defined in Equation 1 when reconstructing its input.
In this equation, n is the number of input features for the model, Yi is the ith input fea-
ture and Ŷi is the Autoencoder’s reconstruction of the ith input feature. A small MSE
value indicates a reconstruction that closely resembles the input.
n
1X
M SE = (Yi Ŷi )2 (1)
n i=1
4.4 Experiments
Each of the scenarios listed in Table 3 are run through the same experiment, which is
described below.
1. Features are extracted from each flow as described in Section 4.2.
23
4 Methods
2. A Skip Gram Word2Vec model with negative sampling is trained as described in

Section 4.3.1.
3. For each flow, the IP addresses, port numbers and protocol are replaced by the
embedding vector representations from the Word2Vec model.
4. The botnet flows are separated from the normal flows, into two different datasets,
dataset X containing the normal flows and dataset Xbot containing the botnet
flows.
5. Dataset X is shuffled and split into two different parts, one with 90% of the data,
called Xtrain , and the last 10% into a dataset called Xtest . The combined number
of flows for Xtest and Xbot can be seen in 4.
6. An Autoencoder model is trained on Xtrain as described in Section 4.3.2.
7. After training, the Autoencoder is used to make predictions (reconstructions) on

Xtest , resulting in dataset Xtest pred and the same is done for Xbot , resulting in
Xbot pred .
8. Finally, the reconstruction error is computed between dataset Xtest and Xtest pred
as well as between Xbot and Xbot pred for comparison.
As can be seen in step 6 in the above list, the Autoencoders are only trained on the
normal flows, not botnet flows. As described in Section 2.5, this allows the Autoencoder
to learn to reconstruct normal input data well, without being affected by outliers, which
should be difficult to reconstruct. To see that the model does not just memorize the
flows it was trained on, the normal flows are separated into a train and test set, Xtrain
and Xtest , where the training set is used for training the model and the test set is used
for evaluating the model together with the botnet flows, Xbot .
4.5 Evaluation
For evaluating the system, three main methods are used, the Precision-Recall curve,
the AUC score and a visualization of the reconstruction error histogram of the test and
botnet data.
The Precision-Recall curve illustrates the Precision and Recall scores for different thresh-
olds [25, p. 184-185]. In this setting, the threshold is based on the MSE values. Pre-
cision can be thought of as how large percentage of the marked anomalies are actu-
ally anomalies, A high score indicates that most of the anomalies marked are indeed
24
4 Methods
Scenario Total flows Normal flows Bot flows

2 199660 178719 (89.51%) 20941 (10.49%)
3 495207 468385 (94.58%) 26822 (5.42%)
4 114430 111850 (97.75%) 2580 (2.25%)
5 13795 12894 (93.47%) 901 (6.53%)
6 60059 55429 (92.29%) 4630 (7.71%)
7 11465 11402 (99.45%) 63 (0.55%)
8 300939 294812 (97.96%) 6127 (2.04%)
9 375242 190255 (50.70%) 184987 (49.30%)
10 226697 120345 (53.09%) 106352 (46.91%)
11 18073 9909 (54.83%) 8164 (45.17%)
12 34499 32331 (93.72%) 2168 (6.28%)
13 228520 188517 (82.49%) 40003 (17.51%)
Table 4: Flow count for each scenario’s evaluation dataset, a combination of Xtest and
Xbot
anomalies. The score tells nothing about anomalies that are marked as normal, there-
fore anomalies can go unnoticed by the model and still produce a high Precision score.
It is calculated using the True Positives (TPs) and False Positives (FPs) of a system’s
prediction. In this case, TPs is the number of anomalous data points correctly marked as
anomalous by the system and FP is the number of normal data points wrongly marked
as anomalous by the system. How the Precision score is calculated can be seen in Equa-
tion 2. Recall or TPR, is complementary to Precision and tells how large percentage
of the actual anomalies are marked as anomalies. As such, a high score indicates that a
large part of the anomalies are correctly identified, however it does not take into account
how many of the marked anomalies are in fact normal. Both Tran et al. [36] and Wang
et al. [37] note that for botnet detection, Recall is a much more important metric for
evaluation than Precision, since botnet flows may cause serious harm if gone unnoticed,
whilst false alarms (measured by Precision) are more tolerable. Recall is calculated us-
ing TPs and False Negatives (FNs), where FNs is the number of anomalous data points
incorrectly marked as normal by the system. Equation 3 shows how Recall is calculated.
Looking at the Precision-Recall curve gives an insight to where the best placement of a
threshold would have been in order to achieve certain Precision and Recall scores.
TP
P recision = (2)
TP + FP
TP
Recall (T P R) = (3)
TP + FN
25
4 Methods
Similarly to Precision-Recall curves, the AUC score is also based on computing scores
for different thresholds of a value. In the context of machine learning, AUC most com-
monly refers to the Area under the Receiver Operating Characteristic (ROC) curve [25,
p. 183-184]. The ROC curve illustrates the tradeoff between TPR and FPR for a set
of thresholds, where FPR can intuitively be thought of how large a percentage of all
the normal flows are mistakenly marked as anomalous by the system. It is therefore
desirable to have as low FPR as possible and it is calculated using the FPs and True
Negatives (TNs), where TNs is the number of normal data points correctly marked as
normal by the system.
The calculation for computing FPR can be seen in Equation 4 AUC is used to summarize
the ROC curve into a single score between 0 and 1. A high score is better and indicates
a low tradeoff between the TPR and the FPR, however, the score is not well suited for
imbalanced datasets, where one class label occurs much more frequently than another.
In this case, the Precision-Recall curve gives a more representative view of how well a
model predicts different classes.
FP
FPR = (4)
FP + TN
26
5 Results
5 Results
This section describes the results from the experiments described in Section 4.4 for each
scenario in Table 4 based on the evaluation criteria listed in Section 4.5, namely:
1. Reconstruction error visualization using histograms.
2. Precision-Recall curves for moving thresholds for botnet detection.
3. AUC scores.
These results are based on the Xtest and Xbot datasets for each CTU-13 scenario used,
explained in Section 4.4.
5.1 Flow reconstruction error
Figure 9 and 10 visualize the reconstruction error histograms for each individual dataset.
To make the histograms easier to interpret and see the difference of reconstruction error
between the regular flows Xtest and botnet flows Xbot , the histograms are normalized
with regards to height, such that the botnet flows and regular flows are of equal height.
The actual distribution between flows for each scenario can be seen in Table 4. For
scenarios 9-11, the distribution is almost equal between botnet flows and normal flows,
meaning that for these scenarios the histogram plots closely resemble reality, for other
scenarios, the normal flows are many more than the botnet flows. As can be seen in
Figure 9, botnet and normal flows seem to be easily separable in Scenarios 2, 3, 4 and
6, however Scenarios 5 and 7 are not as easily separable. Similarly, all Scenarios (8-
13) in Figure 10 seem to be easily separable by reconstruction error. There does not
seem to be a single value for all scenarios which can be used as threshold for separating
the botnet flows from normal flows perfectly. For scenarios 3, 4, 6 and 10 a threshold
between 0.15 and 0.2 seem to be possible as a separator, 0.1 could work for scenarios 8,
9 and 10 and possibly 0.3 for scenario 11 and 12. Scenarios 2, 5 and 7 all have very low
reconstruction errors compared to the other scenarios and a threshold that successfully
separates botnet flows from normal flows in these scenarios would have to be placed
around 0.04 and 0.075 when looking at the histograms.
These visualizations can provide insight to whether there seems to be a possibility of
distinguishing between botnet and normal flows using the reconstruction error metric,
however, to actually assess this possibility, it can be useful to look at other metrics as
well.
27
5 Results
(a) Scenario #2 (b) Scenario #3
(c) Scenario #4 (d) Scenario #5
(e) Scenario #6 (f) Scenario #7
Figure 9: Histogram plots of the reconstruction error (MSE) for Scenarios 2 through
7. The blue areas show the histogram of the regular flows and the red areas show the
histogram of the botnet flows.
28
5 Results
Figure 10: Histogram plots of the reconstruction error (MSE) for Scenarios 8 through
13. The blue areas show the histogram of the regular flows and the red areas show the
histogram of the botnet flows.
29
5 Results
5.2 Precision-Recall curves
Figure 11 and 12 visualize the Precision-Recall curves for each individual scenario.
These figures illustrate the Precision and Recall scores for varying thresholds of the
reconstruction error. This threshold can be thought of as putting a line that separates
the histograms in Figure 9 and 10. However, scores are calculated without normalizing
the scenarios, resulting in more representative visualizations. The best performance can
be seen in scenarios 6 and 9-13 which are all able to maintain a Precision score of 90-
100% with a similar Recall score, resulting in almost no tradeoff between finding all
botnet flows and avoiding false alarms (normal flows which are marked as anomalous).
This means that a threshold could be placed in such a way that it separates the normal
flows from the botnet flows which would produce very few FPs and FNs and a large
amount of TPs. This is comparable to the scores presented by other approaches in
Table 1 and 2, and much better than what Wang et al. [37] achieved on scenarios 1-
2, 6 and 8-9. However, it is important to note that the Precision-Recall curves look
at multiple threshold values, where the approaches in Section 3 do not (except when
presenting AUC scores). One way of comparing these results would be to think of the
approaches in Section 3 as evaluating their systems based on a single thresholds, while
the Precision-Recall curves in Figure 11 and 12 evaluate model potential across multiple
thresholds, resulting in an unfair comparison.
Scenarios 2-4 each produce decent results, showing no real tradeoff between Precision
and Recall, however they do not produce as high Precision scores, which would result
in a lot of false alarms. Scenarios 5 and 8 produce much worse Precision scores which
would indicate more false alarms than true alarms. Finally, the approach produces the
worst results on scenario 7, meaning that from the reconstruction error metric alone,
botnet flows and normal flows are inseparable. In order to set a threshold which correctly
marks botnet flows as anomalous for this scenario, one would have to accept an FPR of
nearly 100%.
5.3 AUC scores
Table 13 show the resulting AUC scores for each scenario. As can be seen, almost all
of them have an AUC score of roughly 0.98, comparable to the AUC scores achieved
by the supervised methods that can be seen in Table 1. Worse scores are achieved for
scenario 5 with a score of 0.91 and scenario 7 with 0.83. This indicates that for most
of the scenarios, there is a low trade off between FPR and TPR (Recall). However,
as stated in 4.5 the score is not suited for imbalanced datasets, which means that for
scenarios 2-8 and 12-13, the Precision-Recall curves are more representative of model
30
5 Results
Figure 11: Precision-Recall curve plots part for Scenarios 2 through 7.
31
5 Results
Figure 12: Precision-Recall curve plots part for Scenarios 8 through 13.
32
5 Results
performance. This can be especially seen for scenario 7 which achieves a relatively
high AUC score of 0.83, comparable to what Radford et al. [28] achieved with an
LSTM neural network approach. However when looking at the Precision-Recall plot in
Figure 11f, the Precision and Recall scores indicate poor performance on scenario 7.
Figure 13: AUC scores for each scenario
33
6 Discussion
6 Discussion
In this section, the methods used and the results achieved are discussed, with special
regard to the following three points:
1. How generalizable is the system? I.e. how well do the results from one experi-
ment transfer to another and are there some general rules that could be gathered
from the different experiments?
2. Possible alterations to the architectures of both the Word2Vec model and Autoen-
coder model.
3. The selected datasets, specifically how useful they are for evaluating the methods.
6.1 Generalizability of the system
As can be seen in Section 5, the results of the proposed system varies between datasets.
This raises the question why the botnet flows from some datasets are so much easier to
distinguish from normal flows than in other datasets. It would for example make sense
to look more closely at scenario 7, on which the system performed the worst when
looking at both the AUC scores and Precision-Recall curves.
Additionally, from looking at the histogram plots of the reconstruction error in Figure
9 and 10, with MSE values ranging from 0 to 1.5 it is clear that no static threshold
can be chosen which could satisfyingly separate normal flows from botnet flows in
all scenarios. A threshold of 1.2 would almost completely separate botnet flows from
normal flows in Scenario 10, but for every other scenario, the same threshold would not
be able to make any distinction between botnet flows and normal flows. A more clever
way of setting a threshold which is generalizable across datasets would be required to
deploy the method successfully in an unsupervised manner. A possible solution for
this is to use percentiles of the training set’s reconstruction error and then evaluate the
system using a percentile-based threshold on the test and botnet datasets. This idea is
further discussed in Section 7.1.1.
6.2 Model architectures
The chosen model’s, Word2Vec and Autoencoder both have hundreds of different possi-
ble parameters and architecture settings that could be tuned and altered. Some examples
of the possible alterations that could be made include:
34
6 Discussion
1. Changing the size of the Word2Vec model’s embedding layer, providing larger
or smaller vector representations. The Word2Vec model used has an embedding
layer size of 16, half of which was used for IP2Vec [30]. It is possible that this
size is too small to capture the similarities in IP addresses, ports and protocols suf-
ficiently. It could also be too large, making it difficult to measure how dissimilar
two vector representations are due to the large amount of values.
2. Increasing or decreasing the number of hidden layers in both the encoding and
decoding parts of the Autoencoder for better or worse reconstructions of the flows.
Since the system only looks at how well it is able to distinguish botnet flows from
normal flows it doesn’t matter how well the Autoencoder is able to reconstruct its
input. What matters is that it should be sufficiently worse at reconstruction botnet
flows such that a threshold can be set. This means that it could be interesting to
look at both more complex and architectures and simpler architectures.
3. On the same note as the previous item, the size of the Autoencoder’s hidden lay-
ers could be altered for similar effects as altering the number of hidden layers. A
larger number of nodes in the hidden layers would result in a smaller compression
of the input features, which would likely allow the Autoencoder to better recon-
struct its input. However, this could also result in the Autoencoder getting too
good at reconstructing inputs, resulting in similar reconstruction errors for both
normal and botnet flows. Similarly, fewer number of nodes could be used for an
increased compression. Clearly, it is a difficult problem to solve whether to in-
crease or decrease complexity by altering the number of nodes in each layer or
the number of layers in the model.
6.3 Dataset selection
The datasets used for evaluating the proposed system were determined to be of high
quality according to the criterias set up by Malowidzki et al. [20]. One of these criterias
claim that for a dataset to be good and representative of real networks, it should be
recent. The CTU-13 datasets were created in 2014 [10], which means that they are
five years old at the time of these experiments. With the ever evolving network traffic
and especially new forms of attack patterns for traditional attacks as well as botnet
attacks, it is questionable whether this dataset still holds up to the previously mentioned
criteras.Whether they still are to be considered recent and representative of real, modern
networks is therefore discussable and could be investigated. One way to circumvent this
issue at least slightly would be to also use other datasets, such as the CIDDS dataset,
briefly mentioned in Section 2.2.1 which was modeled after the CTU-13 datasets but
generated at a later time [31].
35
6 Discussion
The CTU-13 datasets are also very well labeled, with each flow being perfectly marked
as either a botnet flow or a normal flow. In a real setting this would likely not be the
case, or would cost too many resources in order to get perfect labels. Therefore it could
be interesting to review the dataset on another dataset, with worse labeling, or fabricate
worse labeling somehow, to see how the proposed approach would perform in such a
setting.
36
7 Conclusion
7 Conclusion
In conclusion:
1. An Autoencoder was used for reconstructing network flow data. From these re-
constructions, the MSE was calculated to determine the reconstruction error, i.e.
quantifying how well the Autoencoder managed to reconstruct each flow.
2. The reconstruction error was inspected to view the difference between how well
the model managed to reconstruct normal network traffic and botnet traffic, with
the hope of finding a clear distinction between the ability of reconstructing normal
traffic compared to botnet traffic.
3. The Autoencoder used multiple features as input from each flow, including statis-
tical features computed based on flow properties and 16-dimensional continuous
vector representations of IP addresses, port numbers and protocols.
4. The vector representations were computed using a Word2Vec model, trained to
find similarities between flow features, similar to how NLP techniques are used
to find related words in sentences.
The results, which evaluate the potential of the above approach for botnet detection
show that the reconstruction error can indeed be used as a metric for botnet detection.
However, performance varies between datasets and can likely not be used as the single
metric in a real setting for complete botnet detection, while maintaining a low false
alarm rate. The technique produces similar results to that of previously used approaches
and show that there is potential for this technique which could be incorporated into other
systems or be further developed.
7.1 Future work
This section discusses some of the possible future additions that could be implemented
to improve the proposed system. Three main possible extensions are discussed:
1. Determining a good method for placing a threshold based on the reconstruction

error that is used for separating normal flows from botnet flows. As described
in Section 4.5, the proposed system is evaluated on a number of thresholds in
order to determine the potential of using the reconstruction error as a metric for
separation.
37
7 Conclusion
2. More advanced Autoencoder architectures such as LSTM Autoencoders or Ro-

bust Deep Autoencoders.
3. Combining the reconstruction error from the Autoencoder with other systems, or
using the reconstruction error as a feature for other models.
7.1.1 Threshold placement
The current solution is evaluated on many different thresholds as can be seen in Section
5. To use the system in a real setting, a threshold has to be placed, which classifies
flows as anomalous or not. As discussed in Section 6.1 no single value could be used
as a threshold that would work well for all the datasets on which experiments were run.
Instead, one possible way of setting this threshold could be to use reconstruction error
percentiles from the training set. For example, setting the threshold at the 99th percentile
would theoretically result in 99% of all the flows classified as normal and the top 1% (the
1% with the largest reconstruction error) would be marked as anomalous, which could
possibly result in finding most of the botnet flows. Clearly some consideration would
have to be put into choosing a percentile depending on how many flows are present
in the network. Setting the reconstruction error at the example of 99% for a network
which produces a million flows per day would result in ten thousand alarms per day,
which would likely be overwhelming.
Another technique could be to look at the Cosine Similarity (CS) between the recon-
structions and the original flows. CS is a metric which measures the similarity between
two vectors. Similarly to how MSE was used for determining how well the Autoencoder
reconstructs its input, CS could be used to measure how similar the reconstructions and
the original flows are. This technique would then allow for similar threshold placements
as for using MSE. An even more advanced threshold could possibly be placed by using
both the MSE and CS values.
7.1.2 Advanced Autoencoder architectures
More advanced Autoencoder architectures could be used for creating a more robust and
generalizable system. Radford et al. [28] and Lin et al. [18] both used LSTM neural
networks for intrusion detection. An LSTM architecture would be interesting to use
since it specializes in sequential data, such as NetFlow. This could allow for detecting
time-dependent features in the datasets and possibly detect more advanced intrusions,
however, LSTMs are more computationally expensive and more difficult to tune.
Robust Deep Autoencoder is another advanced type of Autoencoder which Zhou et al.
38
7 Conclusion
[39] found useful for training on datasets where clean (perfectly labeled) data cannot
be guaranteed. This could be a promising type of architecture since it is robust against
outliers and is less affected by anomalous points when training, which could potentially
result in a model that can separate botnet and normal flows for dirty (poorly labeled)
datasets. A setting where this could be highly useful is for example in corporate net-
works where flow data is logged but not labeled. In this environment, a Robust Deep
Autoencoder trained on this unlabeled data could be used in conjunction with some
smart threshold placement as discussed in Section 7.1.1 as a potential botnet detector.
7.1.3 Use reconstruction error as metric for other models
The reconstruction error metric used in this proposed method for botnet detection does
not have to be the single metric for detecting botnets in NetFlow data. The metric
could be integrated as a feature for other models to be used, similar to how IP2Vec was
used for allowing continuous vector representations of IP addresses [30]. The metric
could be used as a feature for other machine learning methods, for example the methods
discussed in Section 3.1 and 3.2.
It could also be worth looking at whether using the reconstruction error metric with a
threshold classifier is better or worse at detecting certain botnet flows than other models.
Perhaps a combination of multiple models, each producing probabilities of detected
botnet flows could result in a superior detector, similar to how random forests leverage
multiple decision trees [25, p. 552-553].
39
References
References
[1] E. B. Beigi, H. H. Jazi, N. Stakhanova, and A. A. Ghorbani, “Towards effective

feature selection in machine learning-based botnet detection approaches,” in 2014
IEEE Conference on Communications and Network Security. IEEE, 2014, pp.
247–255.
[2] R. Bejtlich, The practice of network security monitoring: understanding incident

detection and response. No Starch Press, 2013.
[3] L. Bilge, D. Balzarotti, W. Robertson, E. Kirda, and C. Kruegel, “Disclosure: de-

tecting botnet command and control servers through large-scale netflow analysis,”
in Proceedings of the 28th Annual Computer Security Applications Conference.
ACM, 2012, pp. 129–138.
[4] A. Borghesi, A. Bartolini, M. Lombardi, M. Milano, and L. Benini, “Anomaly

detection using autoencoders in high performance computing systems,” arXiv
preprint arXiv:1811.05269, 2018.
[5] I. P. Carrascosa, H. K. Kalutarage, and Y. Huang, Data Analytics and Decision

Support for Cybersecurity: Trends, Methodologies and Applications. Springer,
2017.
[6] F. Chollet, Deep learning with python. Manning Publications Co., 2017.
[7] M. Collins, Network Security Through Data Analysis: Building Situational Aware-
ness. ” O’Reilly Media, Inc.”, 2014.
[8] T. Dunning and E. Friedman, Practical machine learning: a new look at anomaly
detection. ” O’Reilly Media, Inc.”, 2014.
[9] K. Gai, M. Qiu, L. Tao, and Y. Zhu, “Intrusion detection techniques for mobile
cloud computing in heterogeneous 5g,” Security and Communication Networks,
vol. 9, no. 16, pp. 3049–3058, 2016.
[10] S. Garcia, M. Grill, J. Stiborek, and A. Zunino, “An empirical comparison of bot-
net detection methods,” computers & security, vol. 45, pp. 100–123, 2014.
[11] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. MIT press
Cambridge, 2016, vol. 1.
[12] E. Henry. (2016) Netflow and word2vec -> flow2vec. (Date last accessed: 2019-
02-22). [Online]. Available: https://web.archive.org/web/20190125072650/https:
//edhenry.github.io/2016/12/21/Netflow-flow2vec/
40
References
[13] Z. Hu, J. Heidemann, and Y. Pradkin, “Towards geolocation of millions of ip ad-

dresses,” in Proceedings of the 2012 Internet Measurement Conference. ACM,
2012, pp. 123–130.
[14] M. Idhammad, K. Afdel, and M. Belouch, “Dos detection method based on arti-
ficial neural networks,” International Journal of Advanced Computer Science and
Applications, vol. 8, no. 4, pp. 465–471, 2017.
[15] H. Jiang, Y. Liu, and J. N. Matthews, “Ip geolocation estimation using neural
networks with stable landmarks,” in 2016 IEEE Conference on Computer Com-
munications Workshops (INFOCOM WKSHPS). IEEE, 2016, pp. 170–175.
[16] A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, and J. Srivastava, “A comparative

study of anomaly detection schemes in network intrusion detection,” in Proceed-
ings of the 2003 SIAM International Conference on Data Mining. SIAM, 2003,
pp. 25–36.
[17] B. Li, M. H. Gunes, G. Bebis, and J. Springer, “A supervised machine learning

approach to classify host roles on line using sflow,” in Proceedings of the first
edition workshop on High performance and programmable networking. ACM,
2013, pp. 53–60.
[18] D. Lin and B. Tang, “Detecting unmanaged and unauthorized devices on the net-
work with long short-term memory network,” in 2018 IEEE International Confer-
ence on Big Data (Big Data). IEEE, 2018, pp. 2980–2985.
[19] M. W. Lucas, Network flow analysis. No Starch Press, 2010.
[20] M. Małowidzki, P. Berezinski, and M. Mazur, “Network intrusion detection: Half

a kingdom for a good dataset,” in Proceedings of NATO STO SAS-139 Workshop,
Portugal, 2015.
[21] S. Mathur, B. Coskun, and S. Balakrishnan, “Detecting hidden enemy lines in ip

address space,” in Proceedings of the 2013 New Security Paradigms Workshop.
ACM, 2013, pp. 19–30.
[22] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word rep-
resentations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
[23] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed repre-

sentations of words and phrases and their compositionality,” in Advances in neural
information processing systems, 2013, pp. 3111–3119.
41
References
[24] M. Moradi and M. Zulkernine, “A neural network based system for intrusion de-
tection and classification of attacks,” in Proceedings of the IEEE International
Conference on Advances in Intelligent Systems-Theory and Applications, 2004,
pp. 15–18.
[25] K. P. Murphy, Machine Learning: A Probabilistic Perspective. MIT press Cam-
bridge, 2012.
[26] M. M. Najafabadi, T. M. Khoshgoftaar, C. Calvert, and C. Kemp, “Detection of
ssh brute force attacks using aggregated netflow data,” in 2015 IEEE 14th Interna-
tional Conference on Machine Learning and Applications (ICMLA). IEEE, 2015,
pp. 283–288.
[27] M. M. Najafabadi, T. M. Khoshgoftaar, A. Napolitano, and C. Wheelus, “Rudy
attack: Detection at the network level and its important features,” in The twenty-
ninth international flairs conference, 2016.
[28] B. J. Radford, L. M. Apolonio, A. J. Trias, and J. A. Simpson, “Net-
work traffic anomaly detection using recurrent neural networks,” arXiv preprint
arXiv:1803.10769, 2018.
[29] S. Raschka and V. Mirjalili, Python machine learning, 2nd ed. Packt Publishing
Ltd, 2017.
[30] M. Ring, A. Dallmann, D. Landes, and A. Hotho, “Ip2vec: Learning similarities
between ip addresses,” in Data Mining Workshops (ICDMW), 2017 IEEE Interna-
tional Conference on. IEEE, 2017, pp. 657–666.
[31] M. Ring, S. Wunderlich, D. Grüdl, D. Landes, and A. Hotho, “Flow-based bench-
mark data sets for intrusion detection,” in Proceedings of the 16th European Con-
ference on Cyber Warfare and Security. ACPI, 2017, pp. 361–369.
[32] X. Rong, “word2vec parameter learning explained,” arXiv preprint
arXiv:1411.2738, 2014.
[33] C. Sanders, Practical packet analysis: Using Wireshark to solve real-world net-
work problems. No Starch Press, 2017.
[34] M. Stevanovic and J. M. Pedersen, “An efficient flow-based botnet detection us-
ing supervised machine learning,” in 2014 international conference on computing,
networking and communications (ICNC). IEEE, 2014, pp. 797–801.
[35] D. S. Terzi, R. Terzi, and S. Sagiroglu, “Big data analytics for network anomaly
detection from netflow data,” in 2017 International Conference on Computer Sci-
ence and Engineering (UBMK). IEEE, 2017, pp. 592–597.
42
References
[36] Q. A. Tran, F. Jiang, and J. Hu, “A real-time netflow-based intrusion detection

system with improved bbnn and high-frequency field programmable gate arrays,”
in 2012 IEEE 11th International Conference on Trust, Security and Privacy in
Computing and Communications. IEEE, 2012, pp. 201–208.
[37] J. Wang and I. C. Paschalidis, “Botnet detection based on anomaly and community
detection,” IEEE Transactions on Control of Network Systems, vol. 4, no. 2, pp.
392–404, 2017.
[38] P. Winter, E. Hermann, and M. Zeilinger, “Inductive intrusion detection in flow-

based network data using one-class support vector machines,” in 2011 4th IFIP
international conference on new technologies, mobility and security. IEEE, 2011,
pp. 1–5.
[39] C. Zhou and R. C. Paffenroth, “Anomaly detection with robust deep autoencoders,”
in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. ACM, 2017, pp. 665–674.
43

Botnet Detection On Flow Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Botnet Detection On Flow Data

Uploaded by

Copyright:

Available Formats

UPTEC IT 19004

Botnet detection on flow data

Institutionen för informationsteknologi

Teknisk- naturvetenskaplig fakultet

The approach was evaluated on multiple different flow-based network datasets

Handledare: Håkan Persson

FN False Negative. 24, 25, 27

FP False Positive. 24, 25, 26, 27

FPR False Positive Rate. 13, 15, 16, 25, 26, 30

IDS Intrusion Detection System. 1, 2, 3, 4, 13, 16

LSTM Long short-term memory. 14, 15, 16, 30, 37, 38

MSE Mean Squared Error. 23, 24, 27, 34, 37, 38

NLP Natural Language Processing. 1, 2, 7, 15, 37

PCAP Packet capture. 4, 5, 13

TN True Negative. 25, 26

TP True Positive. 24, 25, 27

1 Intrusion Detection System example . . . . . . . . . . . . . . . . . . . 4

1 Related work - supervised results . . . . . . . . . . . . . . . . . . . . . 15

in cooperation with Combient AB an IT company specialized in providing data science

As this project aims to investigate the potential of Autoencoders in combination with

2.1 Network security

2.1.1 Anomaly detection

by network routers and switches, or by conversion of Packet capture (PCAP) data to

2.2.1 Publicly available NetFlow datasets

Labeled All points in a dataset must be labeled, as detailed as possible to provide a

Real To be as close to a real network environment as possible, a dataset should be

2.3 Neural networks

Figure 4: Simple vizualiation of a three-dimensional word embedding, showing the

2.4.1 Skip Gram model

Figure 6: Example Autoencoder architecture which takes as input a k-dimensional vec-

3.1 Network intrusion detection using classification models

Botnet detection is commonly approached in a supervised manner, using machine learn-

this section can be seen in Table 1.

Source Model Score Dataset

3.2 Network intrusion detection using anomaly detection models

Similarly to supervised approaches, unsupervised models come in various forms. The

Source Model Score Dataset

4.1 Data selection

4.2 Feature extraction

time since last flow, described below.

4.3 System structure

(a) Word2Vec model

4.3.1 Word2Vec model

4.3.2 Autoencoder model

1. Features are extracted from each flow as described in Section 4.2.

2. A Skip Gram Word2Vec model with negative sampling is trained as described in

6. An Autoencoder model is trained on Xtrain as described in Section 4.3.2.

7. After training, the Autoencoder is used to make predictions (reconstructions) on

Scenario Total flows Normal flows Bot flows

1. Reconstruction error visualization using histograms.

2. Precision-Recall curves for moving thresholds for botnet detection.

5.1 Flow reconstruction error

(a) Scenario #2 (b) Scenario #3

(c) Scenario #4 (d) Scenario #5

(e) Scenario #6 (f) Scenario #7

(a) Scenario #8 (b) Scenario #9

(c) Scenario #10 (d) Scenario #11

(e) Scenario #12 (f) Scenario #13

5.2 Precision-Recall curves

5.3 AUC scores

(a) Scenario #2 (b) Scenario #3

(c) Scenario #4 (d) Scenario #5

(e) Scenario #6 (f) Scenario #7