Investigation of The Problem of Classifying Unbalanced Datasets in Identifying Distributed Denial of Service Attacks

Journal of Physics: Conference Series
PAPER • OPEN ACCESS You may also like

- Performance Analysis and Evaluation of
Investigation of the problem of classifying Software Defined Networking Controllers
against Denial of Service Attacks
unbalanced datasets in identifying distributed Ahmed F Abdullah, Fatty M Salem, Ashraf
Tammam et al.
denial of service attacks - Cloud-Based Control Systems: Basics and

Beyond
Magdi S. Mahmoud
To cite this article: I Bolodurina et al 2020 J. Phys.: Conf. Ser. 1679 042020
- Boosting Algorithms to Identify Distributed
Denial-of-Service Attacks
V Kumar, A Kumar, S Garg et al.
View the article online for updates and enhancements.
This content was downloaded from IP address 202.188.82.67 on 10/01/2023 at 07:31

APITECH II IOP Publishing
Journal of Physics: Conference Series 1679 (2020) 042020 doi:10.1088/1742-6596/1679/4/042020
Investigation of the problem of classifying unbalanced

datasets in identifying distributed denial of service attacks
I Bolodurina1, A Shukhman2, D Parfenov1, A Zhigalov1 and L Zabrodina1

1
Department of Applied Mathematics, Orenburg State University, 13 Prospekt
Pobedy, Orenburg, 460018, Russia
2
Department of Geometry and Computer Science, Orenburg State University, 13
Prospekt Pobedy, Orenburg, 460018, Russia
E-mail: parfenovdi@mail.ru
Abstract. This paper examines the impact of data balancing algorithms in the network traffic
classification problem on various types of distributed denial of service attacks on the
CICDDoS2019 dataset, which contains information about reflection-based and exploitation-
based attacks. The results of computational experiments have shown the effectiveness of data
balancing algorithms such as naive random sampling, synthetic minority sampling, and adaptive
synthetic sampling in identifying network attacks. A comparative analysis of various data
sampling approaches has shown that the adaptive synthetic sampling method with the random
forest algorithm demonstrates the highest classification accuracy.
1. Introduction
At present, the urgency of the threat to the confidentiality and integrity of data requires the most careful
consideration of information security issues. Existing cryptographic protection, access control, and
intrusion detection systems do not allow identifying all possible types of attacks. This problem is related
to the fact that today there is no universal method for classifying attacks. Most of them are based on
forming an attack profile, which makes it difficult to identify new types of threats. Also, most of the
algorithms for detecting attacks depend on the type of network, as well as on its topology. In this regard,
the problem of identifying network attacks is studied by authors all over the world.
One of the most common approaches to analyzing network traffic is the use of machine learning
methods that allow to identify dependencies with attacking effects based on the fixed characteristics of
data packets and network frequency characteristics [1, 2]. This approach has already proven its
effectiveness and almost any traffic analysis system includes a classification component.
The main problem of building such classifiers, in addition to selecting and configuring classification
algorithms, is the quality of the data presented for training, as well as the test set of data used to evaluate
the accuracy of the results. It should be noted that data quality refers not only to missing values,
duplicates, anomalies, outliers, etc., but also to the difference in the number of samples of records
belonging to different classes. The latter phenomenon is called the "balancing problem" and has a whole
class of algorithms that can solve this problem and implement the principle of data diversity [3].
This paper examines the impact of data balancing algorithms in the network traffic classification
problem on various types of distributed denial of service (DDoS) attacks on the CICDDoS2019 dataset
containing information about reflection-based and exploitation-based attacks [4].
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
2. Related works
At the moment, several studies are describing the identification of various types of DDoS attacks,
including on the CICDDoS2019 dataset. Thus, in the framework of the publication [5], new attacks of
TCP/UDP protocols using artificial neural networks (ANN), support vector machine (SVM), Gaussian
naive Bayes, K-nearest neighbor method and other algorithms are analyzed, demonstrating high
accuracy of 99.8 %.
The authors of the article [6] proposed a new DDoSNet intrusion detection system in software-
defined networks (SDN) based on recurrent neural networks with an autoencoder. The results of research
in comparison with such machine learning algorithms as Random Forest, Naive Bayes, SVM, etc.
showed an increased accuracy of 99% in determining various types of DDoS attacks.
In the study [7], based on the CICDDoS2019 data set from the Canadian cybersecurity Institute, the
authors identified the best features for predicting DDoS attacks, taking into account the discretization
of the number of class instances. As a result of training several classifiers by cross-validation, results
were obtained with an accuracy of 96.9%. Also, the ensemble approach to training allowed us to find a
balance between productivity and time spent.
The most effective functions for detecting DDoS attacks were proposed in the article [8] by the
authors of the Canadian Institute of cybersecurity I. Sharafaldin, A. H. Lashkari and others. Also, the
study introduces a new DDoS taxonomy for the application layer and assesses the main advantages and
disadvantages of existing data sets describing DDoS attacks.
As part of the study [9], a new system for detecting DoS attacks based on machine learning methods
that classify attacks based on signatures previously extracted from network traffic samples is proposed.
Experiments have shown that on the presented reference data sets, the attack detection rate is higher
than 96% and is accompanied by a low false alarm rate of 20% of network traffic.
The authors of the study [10] used three different deep learning models to detect DDoS attacks and
performed a comparative analysis of the performance of each model for different types of attacks. The
proposed models are superior to traditional machine learning classification models and show up to
93.3% accuracy in identifying network attacks.
Thus, the research has shown that the use of traditional machine learning methods to identify network
attacks such as DDoS can achieve fairly accurate results. However, the problem of the quality of the
presented data set was not considered, as well as the question of the influence of balance on the accuracy
of the constructed classifiers. In this article, we will consider the use of various data balancing algorithms
and analyze their impact on the accuracy of the resulting attack classifiers.
3. Problem statement of network traffic classification

Let's consider the problem of network traffic classification to identify attacks on the CICDDoS2019
dataset, which describes the most common DDoS attacks. Note that experimental raw network traffic
data, as well as event log data, were previously converted using CICFlowMeter-V3 and about 80 traffic
characteristics were highlighted.
We need to build a classifier that identifies the types of DDoS attacks f ( Z ) : Z → K , which
matches each traffic flow zi  Z in the analyzed network from the set of all network flows Z, the label
k j  K corresponding to normal traffic or one of the types of attacks characteristic of DDoS. The traffic
flow z i , is described by the corresponding traffic characteristics zi =  x1 ,.. xm  , where m is the number
of fixed characteristics.
The set of class K tags corresponds to one of the types of attacks {Normal, PortMap, NetBIOS,
UDPLag, LDAP}. Traffic characteristics are described in more detail in the study [8] and are taken
except timestamps, source, and destination IP addresses, and source and destination ports.
2
We will investigate the influence of unbalanced data sampling on the most common classifiers:
support vector machine (SVM), random forest (AS), and gradient boosting (GBM). We will research
several stages:
• Pre-processing of data: processing of omissions, duplicates, anomalies, and outliers, the

encoding of features.
• Application of data balancing methods: sampling data to form equivalent classes.
• Classification of network attacks by the algorithms described above.
• Assessment of the quality of the classification results obtained.
In this paper, we consider the following data sampling algorithms: naive random sampling (ROS),
synthetic minority sampling (SMOTE), and adaptive synthetic sampling (ADASYN).
4. Sampling unbalanced data

One of the main approaches to solving the problem of unbalanced data is the use of various sampling
algorithms. If a certain number of entries of the majority class are deleted, it is called undersampling. if
the number of entries of the minority class is increased, it is called oversampling.
4.1. Naive random sampling (ROS)

The naive random sampling algorithm allows you to increase the number of instances of a minority class
by randomly creating an instance of such a class in the General data set. In this case, you need to set the
class ratio based on which the number of random duplicate entries is selected. In this regard, the
simplicity of implementation is the main advantage of this approach. The disadvantages of this sampling
strategy include a decrease in the representativeness of the sample.
4.2. The method of synthetic samples of the minority (SMOTE)

The synthetic minority sampling algorithm generates instances of the minority class based on the k
nearest neighbor method for evaluating the similarity of instances in the characteristic space. This
approach allows you to create synthetic instances of a class that is similar to instances of a minority class
but do not duplicate them.
Consider Z 1 is a set of instances in the training sample. Let Z min
1
is a set of instances of a minority
class, while Zmin
1
 Z 1 , Z = Z 1  Z 2 , Z 1  Z 2   , where Z 2 is a set of instances of the test sample.
( )
In this case zi = zi1 , zi2 ,..., zin  Z , i = 1, Smin is an instance of the class, where n is the total number of
instance characteristics.
SMOTE algorithm:
• For Each instance of the minority class z i , match k instances of the nearest neighbors from the
minority class (the Euclidean metric is used to estimate the distance).
• Randomly match one of the k nearest neighbors to Each instance of a minority class z i :
zK = ( z1K , zK2 ,..., zKn ) , 1  K  k .
• Synthesize a new instance of the minority class according to the rule: ti = zi + ( zi − zK )  rand ,
where rand is a random number from the segment [0, 1].
• Repeat steps 1-3 until more Smin synthesized instances of the minority class are created.
4.3. Adaptive synthetic sampling method (ADASYN)

The ADASYN algorithm is based on a systematic method that allows you to adaptively generate
different amounts of data by their distributions. The input data for the algorithm is a training data set
3
Dr with m samples xi , yi  , i = 1, m , where xi is an n-dimensional vector in the feature space, and yi
is the corresponding class.
Let mr and m x is the number of samples of minority and majority classes, respectively, such that
mr  mx and mr  + mx = m .
ADASYN algorithm:
mr
• Calculate the proportion of classes d = .
mx
• If d  d x (where d x is the specified threshold for the maximum allowable class imbalance),
then:
• Find the number of synthetically generated samples of a minor class G = ( mx − mr )  ,

where  is the parameter used to determine the desired level of balance (  = 1 means the
full balance of classes).
• For each xi  minority class, find the K nearest neighbors using the Euclidean distance and
 r
calculate ri = – i . Normalize rx = i so that rx becomes the density of the distribution.
K  ri
i
• Calculate the synthetic sample gi = rx G generated for each image from the minority class,
where G is the total number of examples of synthetic data.
• For each sample data from the class xi , create examples of synthetic data g i according to
the following steps: in the cycle from 1 to i:
• Randomly select one example of minority data, xu from the K nearest neighbors for
the data xi ;
• Create an example of synthetic data: gi = xi + ( xu − x )  , where ( xu − x ) is an n-
dimensional vector of Euclidean space;  is a random number:    0;1 .
5. Experiment
We will investigate the influence of unbalanced data sampling on the most common classifiers: support
vector machine (SVM), random forest (RF), and gradient boosting (GBM). All calculations were
implemented programmatically in Python.
To perform a comparative analysis of the classification results, the following metrics were calculated:
accuracy, balanced accuracy, completeness, and F1-measure. The results of the research for the support
vector method are presented in table 1. According to the obtained values of balanced accuracy, this
classification algorithm identifies the PortMap attack most effectively (94.81%). It should be noted that
at the sampling stage, the SVM algorithm combined with the SMOTE algorithm showed more accurate
results of more than 79.9%.
Table 1. Results of classification by the support vector method.
Balanced
Sampling method Type of attack Accuracy F1- measure Recall
accuracy
Normal 0.9656 0.6752 0.7759 0.7634
PortMap 0.9146 0.7219 0.8017 0.8625
SVM
Unbalanced data NetBIOS 0.8917 0.6821 0.7976 0.8073

UDPLag 0.9305 0.7493 0.8742 0.8706
LDAP 0.8034 0.6917 0.7571 0.7749
4
Normal 0.9591 0.8751 0.8297 0.8238

PortMap 0. 9164 0.8692 0.8239 0.9263
ROS NetBIOS 0.9368 0.8463 0.9023 0.8462
UDPLag 0.8942 0.7872 0.7965 0.8735
LDAP 0.8845 0.7975 0.8134 0.7957
Normal 0. 9784 0.9106 0.9274 0.9365
PortMap 0.9724 0.9481 0.9548 0.9653
SMOTE NetBIOS 0.9622 0.8936 0.9013 0.9043
UDPLag 0.8934 0.8394 0.8862 0.8898
LDAP 0.8738 0.7996 0.8126 0.7994
Normal 0.9582 0.9024 0.9175 0.9067
PortMap 0.9318 0.9276 0.9064 0.9363
ADASYN NetBIOS 0.8974 0.8473 0.8856 0.8936
UDPLag 0.8916 0.8795 0.8915 0.8851
LDAP 0.8585 0.7854 0.8396 0.7946
Table 2. Results of classification by the random forest algorithm.

Sampling method Type of attack Accuracy Balanced F1- measure Recall
accuracy
Normal 0.9487 0.7452 0.8034 0.8043
PortMap 0.9192 0.758 0.8255 0.854
UDPLag 0.9364 0.7843 0.8815 0.8905
LDAP 0.9289 0.8784 0.8407 0.8363
Normal 0.9478 0.8763 0.832 0.8275
PortMap 0.9451 0.9344 0.9243 0.9318
Ramdom Forest
ROS NetBIOS 0.9332 0.92 0.9112 0.9283

UDPLag 0.9032 0.8949 0.9262 0.9211
LDAP 0.9163 0.9081 0.9219 0.9405
Normal 0.9521 0.9504 0.9518 0.9572
PortMap 0.959 0.9385 0.9627 0.9688
SMOTE NetBIOS 0.9567 0.9206 0.9302 0.9225
UDPLag 0.9345 0.9262 0.9588 0.9532
LDAP 0.9693 0.9504 0.9664 0.9435
Normal 0.9748 0.9621 0.9509 0.9783
PortMap 0.9757 0.9788 0.9217 0.944
ADASYN NetBIOS 0.9802 0.9681 0.9656 0.9694
UDPLag 0.9862 0.9796 0.9771 0.9705
LDAP 0.9725 0.9663 0.9684 0.9852
Table 3. Results of classification by the gradient boosting algorithm.

Sampling method Type of attack Accuracy Balanced F1- measure Recall
accuracy
Normal 0.9713 0.8052 0.791 0.7885
PortMap 0.9221 0.7393 0.8155 0.8734
UDPLag 0.9312 0.7588 0.8762 0.8752
Gradient Boosting
LDAP 0.8339 0.8117 0.7687 0.7921

Normal 0.9616 0.8784 0.8304 0.827
PortMap 0.9263 0.8721 0.824 0.9273
ROS NetBIOS 0.9405 0.8519 0.9112 0.848
UDPLag 0.8998 0.8349 0.8262 0.8883
LDAP 0.9063 0.8081 0.8219 0.805
Normal 0.9802 0.911 0.9318 0.9372
PortMap 0.979 0.9485 0.9627 0.9688
SMOTE NetBIOS 0.967 0.9206 0.9302 0.9225
UDPLag 0.9445 0.9476 0.9388 0.9232
LDAP 0.9593 0.9459 0.9577 0.9338
5
Normal 0.9648 0.923 0.9209 0.9183

PortMap 0.9657 0.9588 0.9217 0.944
ADASYN NetBIOS 0.9573 0.9627 0.8951 0.9294
UDPLag 0.9542 0.9583 0.895 0.9104
LDAP 0.9634 0.9554 0.9332 0.9237
Similarly, tables 2, 3 are constructed, containing the results of classification using random forest and
gradient boosting, respectively. In all the considered cases, the use of sampling methods allowed us to
obtain a higher classification accuracy than on unbalanced data (figure 1). Within the framework of the
scheme described in this paper, the best classification accuracy (more than 98%) was achieved as a result
of applying the ADASYN class balancing algorithm and then the random forest algorithm.
Figure 1. Results of applying sampling methods with different classification algorithms.
6. Conclusion
In this paper, we investigate the issue of improving the accuracy of classification of network attacks on
unbalanced CICDDoS2019 data using class sampling algorithms such as ROS, SMOTE, and ADASYN.
The results of computational experiments have shown the effectiveness of data balancing algorithms in
identifying network attacks. Also, the ADASYN adaptive synthetic sampling method has improved the
accuracy of attack classification by up to 98% compared to other algorithms. In conclusion, it is worth
noting that the problem considered in this study can be considered with other classification algorithms,
such as recurrent neural networks and deep learning methods, and existing data sampling algorithms can
be improved.
Acknowledgments
The study was carried out with the financial support of the RFBR in the framework of scientific projects
No. 20-07-01065, as well as a grant from the President of the Russian Federation for state support of
leading scientific schools of the Russian Federation (NSh-2502.2020.9) and the grant from President of
the Russian Federation within the grant for state support of young Russian scientists (MK-860.2019.9).
References
[1] Elovici Y, Shabtai A, Moskovitch R, Tahan G and Glezer C 2007 Applying Machine Learning
Techniques for Detection of Malicious Code in Network Traffic KI 4667 ed Hertzberg J Beetz
M and et al. (Berlin: Springer) p 44-50
[2] Xin Ya, Kong L, Liu Zh, Chen Yu, Li Ya, Zhu H, Gao M, Hou H and Wang C 2018 Machine
Learning and Deep Learning Methods for Cybersecurity Ieee access 6 35365-81
[3] Sun Ya, Wong A K and Kamel M S 2009 Classification of imbalanced data: a review
International journal of pattern recognition and artificial intelligence 23(4) 687-719
[4] Canadian Institute for Cybersecurity DDoS evaluation dataset (CICDDoS2019) Retrieved from:
6
https://www.unb.ca/cic/datasets/ddos-2019.html
[5] Zekri M, El Kafhali S & Aboutabit N and Saadi Yo 2017 DDoS attack detection using machine
learning techniques in cloud computing environments 3rd International Conf. of Cloud
Computing Technologies and Application (CloudTech) Rabat 1-7
[6] Elsayed M S, Le-Khac N, Dev S and Jurcut A 2020 D DDoSNet: A Deep-Learning Model for
Detecting Network Attacks 2020 Proc. IEEE World of Wireless Mobile and Multimedia
networks (WoWMoM) 1-7
[7] Hussain Y S 2020 Network Intrusion Detection for Distributed Denial-of-Service (DDoS) Attacks
using Machine Learning Classification Techniques (Toronto: University of Toronto) p 49
[8] Sharafaldin I, Lashkari A H and Ghorbani A A 2019 Developing Realistic Distributed Denial of
Service (DDoS) Attack Dataset and Taxonomy International Carnahan Conf. on Security
Technology (Chennai) 1-8
[9] Lima F, Silveira A F, Medeiros A and Vargas-Solar G 2019 Smart Detection: An Online
Approach for DoS/DDoS Attack Detection Using Machine Learning Security and
Communication Networks ed Maglaras L 1-15
[10] Li J 2020 Detection of ddos attacks based on dense neural networks, autoencoders and pearson
correlation coefficient (Halifax: Dalhousie University) p 89

Investigation of The Problem of Classifying Unbalanced Datasets in Identifying Distributed Denial of Service Attacks

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Investigation of The Problem of Classifying Unbalanced Datasets in Identifying Distributed Denial of Service Attacks

Uploaded by

Copyright:

Available Formats

Journal of Physics: Conference Series

PAPER • OPEN ACCESS You may also like

denial of service attacks - Cloud-Based Control Systems: Basics and

View the article online for updates and enhancements.

This content was downloaded from IP address 202.188.82.67 on 10/01/2023 at 07:31

Investigation of the problem of classifying unbalanced

I Bolodurina1, A Shukhman2, D Parfenov1, A Zhigalov1 and L Zabrodina1

3. Problem statement of network traffic classification

• Pre-processing of data: processing of omissions, duplicates, anomalies, and outliers, the

4. Sampling unbalanced data

4.1. Naive random sampling (ROS)

4.2. The method of synthetic samples of the minority (SMOTE)

4.3. Adaptive synthetic sampling method (ADASYN)

• Find the number of synthetically generated samples of a minor class G = ( mx − mr )  ,

Unbalanced data NetBIOS 0.8917 0.6821 0.7976 0.8073

Normal 0.9591 0.8751 0.8297 0.8238

Table 2. Results of classification by the random forest algorithm.

ROS NetBIOS 0.9332 0.92 0.9112 0.9283

Table 3. Results of classification by the gradient boosting algorithm.

LDAP 0.8339 0.8117 0.7687 0.7921

Normal 0.9648 0.923 0.9209 0.9183

Figure 1. Results of applying sampling methods with different classification algorithms.

You might also like